Oozie – workflow scheduler for Hadoop – perhaps is the only major component in the Hadoop ecosystem that does not work on or handle data directly by way of data ingestion or data processing. Let us look at what its function is and where & how it is used through a production scenario case study.
Consider the production scenario of a typical Hadoop project. Data ingestion i.e. uploading data in to HDFS forms almost 50% of any project in terms of time and effort due to various reasons. In any corporate, data usually comes from multiple sources into the Big Data systems. For instance in Insurance domain, policy data can come from different sources like the normal sales channels or from online transactions via the web site. Similarly other details like payments or claim requests and so on can come from different sources too.
Since data comes from different channels there would be discrepancies and inconsistencies in the data fields and content. Data from some sources may send additional fields which need to be filtered out. Fields in the data from some channels may contain junk characters or control characters which need to be removed. Another type of inconsistency one usually sees is – one data source may indicate a field say Gender with an “F” or an “M” while another source may fill in for same field with full forms like “Female” and “Male”. All these need to be cleaned and inconsistencies to be ironed out before the data is ready to be loaded for downstream analytics applications.
An efficient and robust data pipeline that takes the incoming data and then cleans extracts and transforms the data into a form that can be loaded into Hadoop for downstream operations is very essential for smooth functioning of the entire project similar to a Data Warehouse application.
Oozie allows us to build a workflow for these operations as Directed Acyclic Graph (DAG) and its Coordinator functionality allows us to schedule the workflow which can be triggered at a specific time or based on a specific event such as availability of data files.
We at Jigsaw Academy have taken a typical production scenario from Auto Insurance domain with a dataset of 9000+ records for this case study. The dataset contains two types of data:
After looking at the sample records of both the data files we decided to:
So our workflow basically consists of three actions for these cleaning and ETL tasks as shown below.
Shell Script → Pig Script → Hive Queries
This sequence of actions needs to be executed as a workflow for each of the two files as and when they are made available. Once a month it is possible that both the files are made available at the same time. To handle this scenario instead of scheduling one more workflow with same actions for two files, we have used Oozie Workflow’s Decision Control Node and Fork & Join Control Node features.
A question may come up as to why we cannot use cron, the scheduler that comes with Linux system or any other scheduling tool like Autosys which will be available in most of the mid to large data flow environments.
To implement a Oozie Workflow as well as a Coordinator we just need to define two files for each:
The Workflow is built and the Coordinator is defined to trigger the workflow on the event of the availability of the data files in a specified directory. The workflow is as shown in the chart below. The workflow gets depicted on Oozie’s Web UI graphically in a similar fashion, highlighting the path that is being executed.
Oozie is tailor-made and eminently suitable for scheduling workflows on Hadoop not only for data ingestion, but it can also be used for downstream descriptive analytics operations like generating standard reports for example.
Figuring out and implementing this case study gives you sufficient understanding and familiarity with Oozie and its features and makes you production-ready.