Hadoop Distributed File System(HDFS) for Hadoop allows you to store large data sets across the cluster or multiple machines. The HDFS follows a master/slave architecture. The actual data files are stored across multiple slave nodes called DataNodes. These DataNodes are managed by a master node called NameNode. The Datanodes are connected to the NameNode through a network switch. We will learn about rack awareness in Hadoop in this article.
Like any other file system, in HDFS also files are stored as blocks. The default size of each block is 128 MB. HDFS ensures fault tolerance via the replication factor. According to the value configured for this factor, each block is replicated and replicas are stored across different DataNodes.
In this article let us look at:
A rack is nothing but a collection of 30-40 DataNodes or machines in a Hadoop cluster located in a single data center or location. These DataNodes in a rack are connected to the NameNode through traditional network design via a network switch. A large Hadoop cluster will have multiple racks.
The process of making Hadoop aware of what machine is part of which rack and how these racks are connected to each other within the Hadoop cluster is what defines rack awareness. In a Hadoop cluster, NameNode keeps the rack ids of all the DataNodes. Namenode chooses the closest DataNode while storing the data blocks using the rack information. In simple terms, having the knowledge of how different data nodes are distributed across the racks or knowing the cluster topology in the Hadoop cluster is called rack awareness in Hadoop. Rack awareness is important as it ensures data reliability and helps to recover data in case of a rack failure.
Hadoop keeps multiple copies for all data that is present in HDFS. If Hadoop is aware of the rack topology, each copy of data can be kept in a different rack. By doing this, in case an entire rack suffers a failure for some reason, the data can be retrieved from a different rack.
Replication of data blocks in multiple racks in HDFS via rack awareness is done using a policy called Replica Replacement Policy.
The policy states that “No more than one replica is placed on one node. And no more than 2 replicas are placed on the same rack.”
The default replication factor is 3 or it can also be configured.
At the time of the creation of a new block: The first replica is stored on the closest local node. The second is stored on altogether a different rack. The third replica is stored on the same rack but a different node.
At the time of re-replicating a block: If the number of the existing replicas is one, the second replica is stored on a different rack. If the number of the existing replicas is two and both are on the same rack, the third replica is stored on a different rack.
A simple way of storing data block replicas is placing each one on a separate rack however, this could increase the latency of Read/Write operations.
So Replication policy is designed in such a way to reduce the network bandwidth used when reading the data as the replicas are placed on only 2 unique racks, at the same time ensuring the fault tolerance.
Rack Awareness in Hadoop is the concept to choose a nearby data node (closest to the client which has raised the Read/Write request), thereby reducing the network traffic. Hadoop supports the configuration of rack awareness to ensure the placement of one replica of the data block on a different rack. The performance aspect of rack awareness is that despite copies of the data spread across racks, but it is not more than two ensuring that the bandwidth utilization is less and lower latency. This makes the Write operations faster at the same time providing fault tolerance. This also provides data availability if there is a partition within the cluster or in the event of a network switch failure.
Big data analysts are at the vanguard of the journey towards an ever more data-centric world. Being powerful intellectual resources, companies are going the extra mile to hire and retain them. You too can come on board, and take this journey with our Big Data Specialization course.