Big Data Hadoop was and is used in Big Data, even to date. Apache Spark architecture complements Hadoop, as being today’s choice of technology. The framework of Apache Spark is one of the largest Big Data tools available and has a base of users exceeding 2 million members, with 200 enterprises and 500 code contributors. It is thus the most desired framework for enterprises like Tencent the social networking company, Alibaba the e-commerce giant, and Baidu, etc. who work on the challenges of large volumes of Big Data across various platforms. Apache Spark Suite comprises software like Spark streaming architecture, Spark SQL architecture, Spark cluster architecture, Spark Thrift Server architecture, and more.
Apache Spark’s layered architecture including Apache architecture comprises well-defined layers and spark components being integrated and loosely conjoined with several libraries and extensions.
The 2 popular abstractions of the Spark architecture are-
RDD/ Distributed Datasets are data item collections with partitions stored on/in the clusters and nodes. The Apache Spark architecture supports 2 RDD types namely
Spark RDD’s support the Spark application operations of 2 different kinds, namely
DAG is the acronym for a (Direct) Acyclic Graph. This action is a sequential operation of computations at the nodes which reside in an RDD’s partition and perform on the edge meaning atop of the data. DAG in Spark helps software like MapReduce remove models of multi-stage executions which are then used by Hadoop for enhancements.
Transformations of a Spark client may be Acyclic or Direct.
Direct – This means that the transformation action partitions the data and transitions its partitioned state from a set A to set B.
Acyclic –In this kind of action, the transformation is irreversible and cannot be returned to its previous partition.
Spark architecture uses a slave and master relationship between the 2 daemons and their cluster manager.
The two daemons are called the
The cluster in Spark aka Spark cluster always uses the single Master server and may have several worker/slave servers processing the data. The executor and driver programs use Java processes to run individually.
Users have the option to use them as
The Spark Driver is the Master Node of Spark architecture applications. It forms the central entry point for the shell in Spark using R, Python, Scala, etc. It controls the main () functional applications and is where all creations in the context of Spark happen.
Its several components translate the user code in Spark into spark jobs which get executed by the connection of the Spark DAG cluster. The components are
Spark Architecture Role of Driver: The Spark driver program runs on the Spark cluster at the master node and negotiates/ communicates to the cluster manager for Spark cluster setup, job schedules, job execution, etc.
Executor Role using Spark architecture: The Executor is an agent responsible for the distribution and execution of tasks. Each Spark application has its own process for its executor. Executors last a lifetime in the Spark application and this is also called “Executor Static Allocation” in the Spark driver memory. However, the framework users can also use dynamic executor allocations wherein the executors may be dynamically removed/ added to match the overall demanded workload.
Role of Spark architecture Cluster Manager: Here, an external device or service is used at the cluster for acquiring Apache Spark documentation and resources that need to be allocated to execute spark jobs. It uses 3 main types of cluster managers when running an application, to manage the de-allocation/allocation of physical resources like CPU memory, client jobs, memory, etc. Apache Mesos, Hadoop YARN, the standalone cluster manager can be used to run Spark applications and they can run on a cloud or in-premise arrangements.
The easiest cluster manager especially for new Spark applications and starting of Apache Spark operations is the Standalone Cluster Manager which works well with Apache Spark applications. To choose the Spark applications cluster manager one relies on the application’s final goal since the scheduling capacities and functioning of spark clusters differ.
Let’s study the Spark architecture in detail.
Launching the Spark Program: The single script ‘spark-submit’ command submits the spark program while launching the Spark application on the Spark cluster. Several options are utilizing script of the ‘spark-submit’ command connects with the managers of different clusters to control the resources allocated to the individual tasks running on each of the task managers and its network of spark executors. Ex: With a cluster manager, Spark can use the Spark YARN architecture and ‘spark-submit’ within the YARN cluster to run the driver of a worker node while it also runs simultaneously on several local machines.
Understanding Run-Time Architecture: As soon as a Spark Job is submitted, here is what happens. When the client submits an application code, the Spark driver reforms the content-bearing code’s actions and transformations into an (executable) DAG (acyclic graph). It also optimizes transformations like pipelining and converts the graph in logical form into an execution plan, that is physical and includes the various stages and sets of stages for the execution of the tasks. On completion of the execution plan, it lists and makes tasks or smaller physical executable units under each of the created stages. These tasks are then bundled together and dispatched to a Spark cluster.
Drivers and their programs are also responsible for resource allocation negotiations and communications with the cluster manager. Once resources are allocated, the cluster’s manager launches the worker nodes for executors as desired by the driver program. At this stage, data placement negotiations occur between the manager and driver to allow all executors to register at the driver unit so it has an overview of the tasks, executors, resources allocated, etc. Once this is done the executors do their tasks under the supervision of the driver program.
The Spark architecture’s driver program also schedules based on cached data location tracking, all future data placements, and tasks of the executors. Whenever the Driver’s program does a ‘call a stop’, main () method exit, or executes a () method in Spark Context, the application terminates operations at the executors to release the manager captive resources. When an action occurs on the data, RDD’s are initially always created from input data. New RDD’s are derived via transformations from the older existing RDD’s before the action takes place. DAG operations always are default created causing the driver to run while converting data into executable and actionable physical plans.
The Apache Spark architecture framework is now very popular and used widely across the globe. The article gives an overview of Spark architecture, components and introduces the various ways in which it functions.
If you are interested in making a career in the Data Science domain, our 11-month in-person PGDM in Data Science course can help you immensely in becoming a successful Data Science professional.