Hadoop-Let us Admin

Hadoop Eco System

Hadoop is based on the Google paper on MapReduce published in 2004, and its development started in 2005. At the time, Hadoop was developed to support the open-source web search engine project called Nutch. Eventually, Hadoop separated from Nutch and became its own project under the Apache Foundation.

Apache Hadoop is a framework. It permits the users to store data across a cluster. It permtis distributed processing of large data sets across commodity computers using a simple programming model. Each element of the cluster provides for computation and storage. It is designed to scale up from single servers to thousands of machines. It parallelizes data processing across many nodes computers in a cluster. It speeds up large computations and hides I/O latency.

At its core, Hadoop is a Java–based MapReduce framework. However, due to the rapid adoption of the Hadoop platform, there was a need to support the non–Java user community. Today Hadoop is the best–known MapReduce framework in the market. Currently, several companies have grown around Hadoop to provide support, consulting, and training services for the Hadoop software.

Machine Failure

It is understood that machine failure is the order of the day. Hence it doesnot make sence to rely on hardware to deliver high-availability. Hence the framework itself is designed to detect and handle failures at the application layer, thus delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Hadoop is a composition of various individual components creating the hadoop eco-system.

For a very long time, Hadoop remained a system in which users submitted jobs that ran on the entire cluster. Jobs would be executed in a First In, First Out (FIFO) mode. However, this lead to situations in which a long-running, less-important job would hog resources and not allow a smaller yet more important job to execute. To solve this problem, more complex job schedulers in Hadoop, such as the Fair Scheduler and Capacity Scheduler were created. But Hadoop still had scalability limitations that were a result of some deeply entrenched design decisions resulting in Hadoop 2

Hadoop encompasses a multiplicity of tools that are designed and implemented to work together.