Apache Spark Architecture: A Basic Guide In 2021

img
Ajay Ohri
Share

Introduction

Big Data Hadoop was and is used in Big Data, even to date. Apache Spark architecture complements Hadoop, as being today’s choice of technology. The framework of Apache Spark is one of the largest Big Data tools available and has a base of users exceeding 2 million members, with 200 enterprises and 500 code contributors. It is thus the most desired framework for enterprises like Tencent the social networking company, Alibaba the e-commerce giant, and Baidu, etc. who work on the challenges of large volumes of Big Data across various platforms. Apache Spark Suite comprises software like Spark streaming architecture, Spark SQL architecture, Spark cluster architecture, Spark Thrift Server architecture, and more.

  1. Apache Spark Architecture
  2. Spark Architecture Overview
  3. The Spark Driver role-Spark Architecture
  4. Understanding the Run-Time Architecture of a Spark Application

1) Apache Spark Architecture

Apache Spark’s layered architecture including Apache architecture comprises well-defined layers and spark components being integrated and loosely conjoined with several libraries and extensions.

The 2 popular abstractions of the Spark architecture are-

  • RDD the acronym for Distributed Datasets that are Resilient. 
  • DAG- the acronym for Acyclic Graphs that are Direct.

RDD/ Distributed Datasets are data item collections with partitions stored on/in the clusters and nodes. The Apache Spark architecture supports 2 RDD types namely 

  • The parallel collections made up of Scala collections.
  • The Hadoop datasets reside in and are stored on the HDFS files.

Spark RDD’s support the Spark application operations of 2 different kinds, namely

  • Actions
  • Transformations.

DAG is the acronym for a (Direct) Acyclic Graph. This action is a sequential operation of computations at the nodes which reside in an RDD’s partition and perform on the edge meaning atop of the data. DAG in Spark helps software like MapReduce remove models of multi-stage executions which are then used by Hadoop for enhancements.

Transformations of a Spark client may be Acyclic or Direct.

Direct – This means that the transformation action partitions the data and transitions its partitioned state from a set A to set B. 

Acyclic –In this kind of action, the transformation is irreversible and cannot be returned to its previous partition.

2) Spark Architecture Overview

Spark architecture uses a slave and master relationship between the 2 daemons and their cluster manager. 

The two daemons are called the 

  • Master Daemon used for the Driver or Master Process.
  • The Worker Daemon used for Slave Processes.

The cluster in Spark aka Spark cluster always uses the single Master server and may have several worker/slave servers processing the data. The executor and driver programs use Java processes to run individually.

Users have the option to use them as 

  • A cluster of horizontal servers 
  • Use different machines to run them from a cluster called the spark cluster (vertical).
  • Use a combination of both configurations, wherein it is called a mixed configuration cluster.

3) The Spark Driver role-Spark Architecture

The Spark Driver is the Master Node of Spark architecture applications. It forms the central entry point for the shell in Spark using R, Python, Scala, etc. It controls the main () functional applications and is where all creations in the context of Spark happen. 

Its several components translate the user code in Spark into spark jobs which get executed by the connection of the Spark DAG cluster. The components are

  • DAGScheduler
  • BackendScheduler 
  • TaskScheduler 
  • BlockManager 

Spark Architecture Role of Driver: The Spark driver program runs on the Spark cluster at the master node and negotiates/ communicates to the cluster manager for Spark cluster setup, job schedules, job execution, etc.

  • It partitions the executable graph into several stages and is responsible for translating ad transitioning the RDDs into an execution graph.
  • The Spark driver has partitioned RDD to store the metadata of its partitions and RDDs.
  • Cockpits of Tasks and Jobs Execution are where the Driver program changes the user applications known as jobs into tasks or small execution units. Such tasks are processed by clusters of executors who process individual tasks or worker-processes.
  • The Driver provides all information regarding the spark application that is running at the Web UI available at port (4040).

Executor Role using Spark architecture: The Executor is an agent responsible for the distribution and execution of tasks. Each Spark application has its own process for its executor. Executors last a lifetime in the Spark application and this is also called “Executor Static Allocation” in the Spark driver memory. However, the framework users can also use dynamic executor allocations wherein the executors may be dynamically removed/ added to match the overall demanded workload.

  • All data processing happens at the Executor component.
  • It is enabled to write and read data from a variety of external sources.
  • The Executor stores the data computations in its HDDs, cache memory, or in data memory.
  • It is solely responsible to interact with the architecture’s system storage.

Role of Spark architecture Cluster Manager: Here, an external device or service is used at the cluster for acquiring Apache Spark documentation and resources that need to be allocated to execute spark jobs. It uses 3 main types of cluster managers when running an application, to manage the de-allocation/allocation of physical resources like CPU memory, client jobs, memory, etc. Apache Mesos, Hadoop YARN, the standalone cluster manager can be used to run Spark applications and they can run on a cloud or in-premise arrangements.

The easiest cluster manager especially for new Spark applications and starting of Apache Spark operations is the Standalone Cluster Manager which works well with Apache Spark applications. To choose the Spark applications cluster manager one relies on the application’s final goal since the scheduling capacities and functioning of spark clusters differ.

4) Understanding the Run-Time Architecture of a Spark Application

Let’s study the Spark architecture in detail.

Launching the Spark Program: The single script ‘spark-submit’ command submits the spark program while launching the Spark application on the Spark cluster. Several options are utilizing script of the ‘spark-submit’ command connects with the managers of different clusters to control the resources allocated to the individual tasks running on each of the task managers and its network of spark executors. Ex: With a cluster manager, Spark can use the Spark YARN architecture and ‘spark-submit’ within the YARN cluster to run the driver of a worker node while it also runs simultaneously on several local machines.

Understanding Run-Time Architecture: As soon as a Spark Job is submitted, here is what happens. When the client submits an application code, the Spark driver reforms the content-bearing code’s actions and transformations into an (executable) DAG (acyclic graph). It also optimizes transformations like pipelining and converts the graph in logical form into an execution plan, that is physical and includes the various stages and sets of stages for the execution of the tasks. On completion of the execution plan, it lists and makes tasks or smaller physical executable units under each of the created stages. These tasks are then bundled together and dispatched to a Spark cluster.

Drivers and their programs are also responsible for resource allocation negotiations and communications with the cluster manager. Once resources are allocated, the cluster’s manager launches the worker nodes for executors as desired by the driver program. At this stage, data placement negotiations occur between the manager and driver to allow all executors to register at the driver unit so it has an overview of the tasks, executors, resources allocated, etc. Once this is done the executors do their tasks under the supervision of the driver program.

The Spark architecture’s driver program also schedules based on cached data location tracking, all future data placements, and tasks of the executors. Whenever the Driver’s program does a ‘call a stop’, main () method exit, or executes a () method in Spark Context, the application terminates operations at the executors to release the manager captive resources. When an action occurs on the data, RDD’s are initially always created from input data. New RDD’s are derived via transformations from the older existing RDD’s before the action takes place. DAG operations always are default created causing the driver to run while converting data into executable and actionable physical plans.

Conclusion

The Apache Spark architecture framework is now very popular and used widely across the globe. The article gives an overview of Spark architecture, components and introduces the various ways in which it functions.

If you are interested in making a career in the Data Science domain, our 11-month in-person PGDM in Data Science course can help you immensely in becoming a successful Data Science professional. 

ALSO READ

Related Articles

loader
Please wait while your application is being created.
Request Callback