Hadoop-Let us Admin

Data Node

In HDFS, the daemon responsible for storing and retrieving block data is called the datanode (DN). The data nodes are responsible for serving read and write requests from clients and perform block operations upon instructions from name node. Each Data Node stores HDFS blocks on behalf of local or remote clients. Each block is saved as a separate file in the node’s local file system. Because the Data Node abstracts away details of the local storage arrangement, all nodes do not have to use the same local file system. Blocks are created or destroyed on Data Nodes at the request of the Name Node, which validates and processes requests from clients. Although the Name Node manages the namespace, clients communicate directly with Data Nodes in order to read or write data at the HDFS block level. A Data node normally has no knowledge about HDFS files. While starting up, it scans through the local file system and creates a list of HDFS data blocks corresponding to each of these local files and sends this report to the Name node.

Individual files are broken into blocks of a fixed size and distributed across multiple DataNodes in the cluster. The Name Node maintains metadata about the size and location of blocks and their replicas.

Hadoop was designed with an idea that DataNodes are "disposable workers", servers that are fast enough to do useful work as a part of the cluster, but cheap enough to be easily replaced if they fail.

The data block is stored on multiple computers, improving both resilience to failure and data locality, taking into account that network bandwidth is a scarce resource in a large cluster.