Hadoop-Let us Admin

Hadoop MapReduce

MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. Apache MapReduce was derived from Google concept of MapReduce.

MR1

Hadoop supports the MapReduce model, which was introduced by Google as a method of solving a class of petascale problems with large clusters of commodity machines. The core concept of MapReduce in Hadoop is that input can be split into logical chunks, and each chunk can be initially processed independently by a map task. The model is based on two distinct steps, both of which are custom and user-defined for an application:

  • Map: An initial ingestion and transformation step in which individual input records can be processed in parallel.A map task can run on any compute node in the cluster, and multiple map tasks can run in parallel across the cluster. The map task is responsible for transforming the input records into key/value pairs. The output of all the maps will be partitioned, and each partition will be sorted.
  • Reduce: An aggregation or summarization step in which all associated records must be processed together by a single entity. There will be one partition for each reduce task. Each partition’s sorted keys and the values associated with the keys are then processed by the reduce task. There can be multiple reduce tasks running in parallel on the cluster.

MR2

The current Apache MapReduce version is built over Apache YARN Framework. YARN stands for “Yet-Another-Resource-Negotiator”. It is a new framework that facilitates writing arbitrary distributed processing frameworks and applications.

YARN’s execution model is more generic than the earlier MapReduce implementation. YARN can run applications that do not follow the MapReduce model, unlike the original Apache Hadoop MapReduce (also called MR1). Hadoop YARN is an attempt to take Apache Hadoop beyond MapReduce for data-processing.