When the data world expanded exponentially with the advent of better storage technologies and algorithms to handle massive data, basically the emergence of big data, which is based on the school of thought that any and every data can be potentially useful, the idea of Data lake took shape. A data lake is just what it literally conveys, a huge centralized pool that structured and unstructured data can be dumped in, with appropriate pipelines planted in to tap this data at massive scales.
The hitherto process of deciding which data is vital to business and how much is good enough to make sense is discarded here and you are allowed to store your data as it is, without any necessary or predefined structure. This raw data is available for tools that can work on semi-structured and unstructured data, apart from the structured data and provide insights through dashboards, visualizations, reports generated even at real-time, helping management to take timely decisions. This scale of data is what Machine Learning and Artificial Intelligence thrive on.
So when we say unstructured data, what all does it encompass? Well, it can be binary data like videos, images etc. Semi-structured data include data from tweets, comments, logs, blogs, JSON files and plain text. Structured data, obviously comes from sources like relational databases, neatly stacked in rows and columns. James Dixon, the chief technology officer at Pentaho, coined the term Data Lake when proposing it against the Data Marts solution prevalent at that time and even now in the data warehouse based architectures.
What problem does a data lake solve?
Data lakes were conceptualized based on the idea that any data may have a value worth mining information out of. Data lakes are used by data scientists and analysts to mine and analyze large amounts of Big Data.
To truly get to understand Data Lake architecture, lets first look at how an EDW or traditional Data Warehouse is structured.
A high-level representation of a traditional EDW.
The idea behind a data warehouse is to extract, transform and load structured data that is perceived to have useful information for the business. At the end of the process is the presentation layer with reporting, visualization and analysis tools, that workout data from data marts or OLAP cubes to provide insights into the perceived data of the business domain.
When it comes to scaling up to Big Data levels, this infrastructure faces some stiff challenges.
The time it takes to first understand the data and find out the source system and its structure, cardinality, data modelling based on business requirements, data cleansing, exploratory data analysis and so on, is not a good fit for the amount of data that big data brings in and at the speed with which the data comes in.
Since the exponential proliferation of data all around us, the definition of analyzable data is no longer the same and thus an alternative model like Data Lakes makes much more sense.
So here is the Data Lake conceptual architecture
Data lake architecture at the conceptual level.
All structured and unstructured data are brought into a raw data store after they go through an extract and load process, remember there is no transformation. It is important to note that the data storage here is commodity hardware-based storage which is economical to procure at this scale. The analytical sandbox is a logical module that is used for exploratory data analysis and application of data science tools to build various models including predictive ones. All the data in this system is catalogued and curated with secure governance policies. Notice the real-time processing engine processing data streams that need to be responded in real-time to make some sense out of such data and react in real-time.
Data lakes provide a solution to the massively altered data scenes since the EDW started.
Let’s go through the differences between the approaches.
Data Lake implementations can be a stagewise deployment in an agile fashion.
Setup a basic data lake that accepts raw data from all identified sources.
Apply data science tools and methodologies to come up with analytical data models.
Run a parallel setup alongside the enterprise data warehouses, with storing low priority data as well.
Take over from data warehouses, and make the data lake as the single source of all enterprise data.
The most valuable advantage of Data lakes is the flexibility it offers in terms of adjusting to changing business requirements and changing data ecosystems. Other major benefits of data lakes are listed here.
With the ever-increasing amounts of data, it is important that your business has the right setup to handle this avalanche of data. Data lakes make it possible to scale your business to enterprise-level and further.
Tools that are a part of big data infrastructures like Kafka and Flume can acquire data that are coming in at great speeds and volumes. Data lakes can use these tools to pool in all that data for further analysis.
Exploratory data analysis and modelling take place when and where it is needed, unlike in the DWH architecture where it has to be done even before data is ingested.
Data lakes will allow data to be stored in a schema-less way which is a useful feature at the time of ingestion, particularly high-speed ones.
The large volumes of data are ideal for Machine Learning based analytical tools to analyse large sets and produce accurate statistical models used further in predictive analytics.
Data Silos are a major bone of contention in the DWH world, but in Data Lakes you always access the same data existing in one unified view, accessible to all users with the right credentials.
Data lakes are the way to go for large enterprises if they intend to capture every bit of data that is flowing through their business to make sense of it and extract value and stay ahead of competitors.
If you are interested in making a career in the Data Science domain, our 11-month in-person Postgraduate Certificate Diploma in Data Science course can help you immensely in becoming a successful Data Science professional.