Apache Spark Architecture: A Basic Guide In 2021

img
Ajay Ohri
Share

Introduction

Big Data Hadoop was and is used in Big Data, even to date. Apacheย Spark architectureย complements Hadoop, as being todayโ€™s choice of technology. The framework of Apache Spark is one of the largest Big Data tools available and has a base of users exceeding 2 million members, with 200 enterprises and 500 code contributors. It is thus the most desired framework for enterprises like Tencent the social networking company, Alibaba the e-commerce giant, and Baidu, etc. who work on the challenges of large volumes of Big Data across various platforms. Apache Spark Suite comprises software likeย Spark streaming architecture, Spark SQL architecture, Spark cluster architecture, Spark Thrift Server architecture,ย and more.

  1. Apache Spark Architecture
  2. Spark Architecture Overview
  3. The Spark Driver role-Spark Architecture
  4. Understanding the Run-Time Architecture of a Spark Application

1) Apache Spark Architecture

Apache Sparkโ€™s layered architecture including Apache architecture comprises well-defined layers and spark components being integrated and loosely conjoined with several libraries and extensions.

The 2 popular abstractions of the Spark architectureย are-

  • RDD the acronym for Distributed Datasets that are Resilient.ย 
  • DAG- the acronym for Acyclic Graphs that are Direct.

RDD/ Distributed Datasets are data item collections with partitions stored on/in the clusters and nodes. The Apache Spark architecture supports 2ย RDD types namelyย 

  • Theย parallel collections made up of Scala collections.
  • The Hadoop datasets reside in and are stored on the HDFS files.

Spark RDDโ€™s support theย Spark application operations of 2 different kinds, namely

  • Actions
  • Transformations.

DAG is the acronym for a (Direct) Acyclic Graph. This action is a sequential operation of computations at the nodes which reside in an RDDโ€™s partition and perform on the edge meaning atop of the data.ย DAG in Sparkย helps software like MapReduce remove models of multi-stage executions which are then used by Hadoop for enhancements.

Transformations of aย Spark clientย may be Acyclic or Direct.

Direct โ€“ This means that the transformation action partitions the data and transitions its partitioned state from a set A to set B.ย 

Acyclic โ€“In this kind of action, the transformation is irreversible and cannot be returned to its previous partition.

2) Spark Architecture Overview

Spark architecture uses a slave and master relationship between the 2 daemons and their cluster manager.ย 

The two daemons are called theย 

  • Master Daemon used for the Driver or Master Process.
  • The Worker Daemon used for Slave Processes.

The cluster in Spark aka Spark clusterย always uses the single Master server and may have several worker/slave servers processing the data. The executor and driver programs use Java processes to run individually.

Users have the option to use them asย 

  • A cluster of horizontal serversย 
  • Use different machines to run them from a cluster called the spark cluster (vertical).
  • Use a combination of both configurations, wherein it is called a mixed configuration cluster.

3) The Spark Driver role-Spark Architecture

The Spark Driver is the Master Node ofย Spark architectureย applications.ย It forms the central entry point for the shell in Spark using R, Python, Scala, etc. It controls the main () functional applications and is where all creations in the context of Spark happen.ย 

Its several components translate the user code in Spark into spark jobs which get executed by the connection of the Spark DAGย cluster. The components are

  • DAGScheduler
  • BackendSchedulerย 
  • TaskSchedulerย 
  • BlockManagerย 

Spark Architectureย Role of Driver: The Spark driver program runs on the Spark cluster at the master node and negotiates/ communicates to the cluster manager for Spark cluster setup, job schedules, job execution, etc.

  • It partitions the executable graph into several stages and is responsible for translating ad transitioning the RDDs into an execution graph.
  • The Spark driver has partitioned RDD to store the metadata of its partitions and RDDs.
  • Cockpits of Tasks and Jobs Execution are where the Driver program changes the user applications known as jobs into tasks or small execution units. Such tasks are processed by clusters of executors who process individual tasks or worker-processes.
  • The Driver provides all information regarding the spark application that is running at the Web UI available at port (4040).

Executor Role usingย Spark architecture: The Executorย is an agent responsible for the distribution and execution of tasks. Each Spark application has its own process for its executor. Executors last a lifetime in the Spark application and this is also called โ€œExecutor Static Allocationโ€ in theย Spark driver memory. However, the framework users can also use dynamic executor allocations wherein the executors may be dynamically removed/ added to match the overall demanded workload.

  • All data processing happens at the Executor component.
  • It is enabled to write and read data from a variety of external sources.
  • The Executor stores the data computations in its HDDs, cache memory, or in data memory.
  • It is solely responsible to interact with the architecture’s system storage.

Role ofย Spark architecture Cluster Manager: Here,ย an external device or service is used at the cluster for acquiringย Apache Spark documentationย and resources that need to be allocated to execute spark jobs. It uses 3 main types of cluster managers when running an application, to manage the de-allocation/allocation of physical resources like CPU memory, client jobs, memory, etc. Apache Mesos, Hadoop YARN, the standalone cluster manager can be used to run Spark applications and they can run on a cloud or in-premise arrangements.

The easiest cluster manager especially for new Spark applications and starting of Apache Spark operations is the Standalone Cluster Manager which works well with Apache Spark applications. To choose the Spark applications cluster manager one relies on the applicationโ€™s final goal since the scheduling capacities and functioning of spark clusters differ.

4) Understanding the Run-Time Architecture of a Spark Application

Letโ€™s study theย Spark architecture in detail.

Launching the Spark Program:ย The single script โ€˜spark-submitโ€™ command submits the spark program while launching the Spark application on the Spark cluster. Several options are utilizing script of the ‘spark-submit’ command connects with the managers of different clusters to control the resources allocated to the individual tasks running on each of the task managers and its network of spark executors. Ex: With a cluster manager, Spark can use theย Spark YARN architectureย and โ€˜spark-submitโ€™ within the YARN cluster to run the driver of a worker node while it also runs simultaneously on several local machines.

Understanding Run-Time Architecture: As soon as a Spark Job is submitted, here is what happens. When the client submits an application code, the Spark driver reforms the content-bearing code’s actions and transformations into an (executable) DAG (acyclic graph). It also optimizes transformations like pipelining and converts the graph in logical form into an execution plan, that is physical and includes the various stages and sets of stages for the execution of the tasks. On completion of the execution plan, it lists and makes tasks or smaller physical executable units under each of the created stages. These tasks are then bundled together and dispatched to a Spark cluster.

Drivers and their programs are also responsible for resource allocation negotiations and communications with the cluster manager. Once resources are allocated, the cluster’s manager launches the worker nodes for executors as desired by the driver program. At this stage, data placement negotiations occur between the manager and driver to allow all executors to register at the driver unit so it has an overview of the tasks, executors, resources allocated, etc. Once this is done the executors do their tasks under the supervision of the driver program.

Theย Spark architectureโ€™s driver program also schedules based on cached data location tracking, all future data placements, and tasks of the executors. Whenever the Driver’s program does a โ€˜call a stopโ€™, main () method exit, or executes a () method in Spark Context, the application terminates operations at the executors to release the manager captive resources. When an action occurs on the data, RDD’s are initially always created from input data. New RDD’s are derived via transformations from the older existing RDD’s before the action takes place. DAG operations always are default created causing the driver to run while converting data into executable and actionable physical plans.

Conclusion

The Apache Spark architecture framework is now very popular and used widely across the globe. The article gives an overview ofย Spark architecture, components and introduces the various ways in which it functions.

If you are interested in making a career in the Data Science domain, our 11-month in-personย PGDM in Data Scienceย course can help you immensely in becoming a successful Data Science professional.ย 

ALSO READ

Related Articles

loader
Please wait while your application is being created.
Request Callback