Hadoop-Let us Admin

File Placement

HDFS uses replication to maintain at least three copies (one primary and two replicas) of every chunk. Applications that require more copies can specify a higher replication factor typically at file create time. All copies of a chunk are stored on different data nodes using a rack-aware replica placement policy. The first copy is always written to the local storage of a data node to lighten the load on the network. To handle machine failures, the second copy is distributed at random on different data nodes on the same rack as the data node that stored the first copy. This improves network bandwidth utilization because inter-rack communication is faster than cross-rack communication which often goes through intermediate network switches. To maximize data availability in case of a rack failure, HDFS stores a third copy distributed at random on data nodes in a different rack.

HDFS uses a random chunk layout policy to map chunks of a file on to different data nodes. At file create time; the name node randomly selects a data node to store a chunk. This random chunk selection may often lead to sub-optimal file layout that is not uniformly load balanced. The name node is responsible to maintain the chunk to data node mapping which is used by clients to access the desired chunk.