Hadoop-Let us Admin

Hadoop

Hadoop is an open-source ecosystem used for storing, managing and analyzing a large volume of data. It is a data management system bringing together massive amounts of structured and unstructured data that touch nearly every layer of the traditional enterprise data stack, positioned to occupy a central place within a data center. It fills a gap in the market by effectively storing and providing computational capabilities over substantial amounts of data. Hadoop is different from previous distributed approaches in the following ways:

  • Data is distributed in advance.
  • Data is replicated throughout a cluster of computers for reliability and availability.
  • Data processing tries to occur where the data is stored, thus eliminating bandwidth bottlenecks.

Hadoop ecosystem consists of several sub projects, and two of them form the very basic of Hadoop ecosystem, which are the distributed computation framework MapReduce and the distributed storage layer Hadoop Distributed File System (HDFS).

The MapReduce framework is both a parallel computation framework and a scheduling framework. The Apache MapReduce framework consists of two components:

  • a single JobTracker and
  • several TaskTracker

Until the Hadoop 2.x release, HDFS and MapReduce employed single-master models, resulting in single points of failure