Hadoop-Let us Admin

hdfs-*.xml

hdfs-default.xml Default HDFS properties. The file is located in the following JAR file: hadoop-hdfs-2.2.0.jar (assuming version 2.2.0).

hdfs-site.xml Site–specific HDFS properties. Properties configured in this file override the properties in the hdfs-default.xml file.

The hdfs-default.xml and hdfs-site.xml files configure the properties for the HDFS. Together with the core-site. xml file described next, the HDFS is configured for the cluster. As you learned in Chapter 2, NameNode and Secondary NameNode are responsible for managing the HDFS. The hdfs-.xml files configure the NameNode and the Secondary NameNode components of the Hadoop system. The hdfs-.xml set of files Iare also used for configuring the runtime properties of the HDFS as well the properties associated with the physical storage of files in HDFS on the individual data nodes. Although the list of properties covered in this section is not exhaustive, it provides a deeper understanding of the HDFS design at the physical and operational level. This section explores the key properties of the hdfs-*.xml file. Some of the important properties of the hdfs-site.xml file include the following:

  • dfs.namenode.name.dir: Directories on the local file system of the NameNode in which the metadata file table (the fsimage file) is stored. Recall that this file is used to store the HDFS metadata since the last snapshot. If this is a comma-delimited list of directories, the file is replicated to all the directories for redundancy. (Ensure that there is no space after the comma in the comma-delimited list of directories.) The default value for this property is file://${hadoop.tmp.dir}/dfs/name. The hadoop.tmp.dir property is specified in the core-site.xml (or core-default.xml if core-site.xml does not override it).
  • dfs.namenode.edits.dir: Directories on the local file system of the NameNode in which the metadata transaction file (the edits file) is stored. This file contains changes to the HDFS metadata since the last snapshot. If this is a comma-delimited list of directories, the transaction file is replicated to all the directories for redundancy. The default value is the same as dfs.namenode.name.dir.
  • dfs.namenode.checkpoint.dir: Determines where the Secondary NameNode should store the temporary images to merge on the local/network file system accessible to the Secondary NameNode. Recall from Chapter 2 that this is the location where the fsimage file from the NameNode is copied into for merging with the edits file from the NameNode. If this is a comma-delimited list of directories, the image is replicated in all the directories for redundancy. The default value is file://${hadoop.tmp.dir}/dfs/namesecondary.

  • dfs.namenode.checkpoint.edits.dir: Determines where the Secondary NameNode should store the edits file copied from the NameNode to merge the fsimage file copied in the folder defined by the dfs.namenode.checkpoint.dir property on the local/network file system accessible to the Secondary NameNode. If it is a comma-delimited list of directories, the edits files are replicated in all the directories for redundancy. The default value is the same as dfs.namenode.checkpoint.dir.

  • dfs.namenode.checkpoint.period: The number of seconds between two checkpoints. As an interval equal to this parameter elapses, the checkpoint process begins, which merges the edits file with the fsimage file from the NameNode.

  • dfs.blocksize: The default block size for new files, in bytes. The default is 128 MB. Note that block size is not a system-wide parameter; it can be specified on a per-file basis.
  • dfs.replication: The default block replication. Although it can be specified per file, if not specified it is taken as the replication factor for the file. The default value is 3.
  • dfs.namenode.handler.count: Represents the number of server threads the NameNode uses to communicate with the DataNodes. The default is 10, but the recommendation is about 10% of the number of nodes, with a minimum value of 10. If this value is too low, you might notice messages in the DataNode logs indicating that the connection was refused by the NameNode when the DataNode tried to communicate with the NameNode through heartbeat messages.

  • dfs.datanode.du.reserved: Reserved space in bytes per volume that represents the amount of space to be reserved for non-HDFS use. The default value is 0, but it should be at least 10 GB or 25% of the total disk space, whichever is lower. • dfs.hosts: This is a fully qualified path to a file name that contains a list of hosts that are permitted to connect with the NameNode. If the property is not set, all nodes are permitted to connect with the NameNode.