Hadoop YARN Architecture: A Tutorial In 5 Simple Points

10 May 2021

Introduction

Hadoop YARN architecture helps to knit the Hadoop storage unit which is the Hadoop distributed file system or HDFS with many processing tools. YARN stands for Yet Another Resource Negotiator. Read below to know what the YARN Apache is and what the YARN architecture looks like.

Why YARN?
Introduction to Hadoop YARN
Components of YARN Architecture
Application Submission in YARN
Application Workflow in Hadoop YARN

1. Why YARN?

In version 1.0 of YARN Hadoop which is also known as the MapReducce version 1 or MRV1, the MapReduce performed the resource management and the processing functions. It had a job tracker as the single master which allocated resources, did the scheduling, and also monitored the jobs for processing. It also assigned the map and reduced the tasks on various subordinate processes which is known as the task tracker. The function of the task tracker was to report the progress periodically to the job tracker.

However, because only a single job tracker was used scalability issues arose.

The practical design of this limit reached a cluster of 40000 tasks and 50000 nodes that ran concurrently. Also, MRV made it difficult to carry out computational resources. The Hadoop process thus became limited only to the MapReduce processing paradigm.

It was to overcome these problems that the YARN was introduced in version 2.0 of Hadoop. YARN was introduced in the year 2012 by Hortonworks and Yahoo. The main idea of YARN was to relive the MapReduce which it did by taking over the job scheduling and resource management responsibility. YARN lets Hadoop get the ability to run the non-MapReduce jobs in the framework on Hadoop.

2. Introduction to Hadoop YARN

Let us now introduce the part of the YARN architecture which is the core component in Hadoop v2.0. YARN allows the various ways of data processing like interactive, graph, and stream processing. It also allows batch processing that runs and processes the stored data in the HDFS.

This lets YARN architecture to open up the Hadoop to various kinds of distributed application which is beyond the MapReduce. YARN also lets the users carry out operations using various tools. YARN caries out resource management, job scheduling and also carries out all the processing activities which it does by scheduling the tasks and allocating the resources.

Here are the components of the Hadoop YARN architecture

1. Resource Manager

2. Node Manager

3. Application Master

4. Container

3. Components of YARN Architecture

YARN is like the brain of Hadoop. Here we explain the different components of YARN.

Resource Manager

The resource manager is the ultimate authority that allows the allocation of resources. When the processing request is received then it passes some part of the request to the node managers as per where the processing is to take place. It is basically the arbitrator of the cluster resources and it helps to decide how the available resources will be allocated for the applications competing with it. It optimizes the utilization of clusters by making sure that all the resources are used at all times. The resource manager works with two schedulers which are scheduler and application manager.

Node Manager

Node manager takes care of the individual nodes in the Hadoop cluster and it manages the workflow and jobs of the user in the node given. It is registered with the resource manager and sends the status of the health of the node. The main function of the node is to manage the application container that it has been assigned by the resource manager. It makes sure that it is up to date with the resource manager.

Application Master

The application is a single job that gets submitted to the framework. Every such application comes with an application master that is unique and it is associated with a specific entity of framework. It is a process that coordinates the execution of the application in the cluster. It also works to manage the faults. The work of the application master is to negotiate on the resources which it does with the resource manager. It then works along with the node manager in order to execute and monitor the tasks.

Container

The container is the physical resource collector which includes CPU cores and RAM all of which are on a single node. The YARN container is managed with a container launch context which is the CLC or the container life cycle.

4. Application Submission in YARN

First, submit the job
Get your application id
Application submission context
Start the container launch
Allocate the resources
Execute

5. Application Workflow in Hadoop YARN

The client will submit an application
The resource manager will allocate the container in order to start the application manager
The application manager will register with the resource manager
The application manager asks for the container from the resource manager
The application manager will notify the node manager to launch the containers
The application code gets executed in the container
The client will contact the resource manager or the application manager in order to monitor the status of the application
The application manager will unregister with the resource manager.

Conclusion

The Hadoop ecosystem got revolutionized completely with YARN. This made it more efficient, flexible, and scalable. In the year 2013, Yahoo went out live with YARN which allowed the company to shrink the Hadoop cluster size from 40000 to 32000 nodes. It also caused a whopping increase in jobs.

If you are interested in making a career in the Data Science domain, our 11-month in-person Postgraduate Certificate Diploma in Data Science course can help you immensely in becoming a successful Data Science professional.