There are several models that data can be fit into for a thorough analysis. But before you do so, you have to determine which model is an ideal fit for the data at hand. For this reason, you end up exploring the data, its shape, characteristics and come up with a summary that describes the current state of data at hand and whether the data needs further processing before it can be modeled with statistical and scientific techniques. This exploration of data, usually with the help of descriptive statistics, visualization tools, and presentation techniques, make up for what we call Exploratory Data Analysis or EDA in data science.
EDA in data science is quite like the service advisor doing a rough inspection of your car, asking a few preliminary questions, setting expectations and then taking the car in for service. It is one of the first things done with the data, so it is a critical phase, as many inferences and consequent actions depend on this exploration.
In this article let us look at:
The initial analysis of data supplied or extracted, to understand the trends, underlying limitations, quality, patterns, and relationships between various entities within the data set, using descriptive statistics and visualization tools is called Exploratory Data Analysis (EDA). EDA will give you a fair idea of what model better fits the data and whether any data cleansing and massaging might be required before taking the data through advanced modelling techniques or even put through Machine Learning and Artificial Intelligence algorithms.
EDA full form in Python is Exploratory Data Analysis. Any firm needs exploratory data analysis. It enables data scientists to examine the data before making any assumptions. It ensures that the findings are reliable and relevant to the objectives and outcomes of the company.
Significance of using EDA in data science for analyzing data sets is:
Ok, time for some examples, that might give you an idea about what EDA really entails and what are you looking for, what questions are you trying to answer.
With data comes a lot of anomalies. One of them is missing data. Although the overall data might be good, there are columns within the data set that might be missing values. This can skew your results and not provide an accurate model for further use.
One great way to identify missing values visually is the use of the missing package in Python. This is obviously for a large data set. This gives you a graphic story of how much data is missing and on which variables.
Without going into the details of the coding, the above graphic was achieved with a single line of code. The white lines in the above graph indicate missing values.
Another example here is of summary statistics that give you a fair idea of your numeric data.
As shown above, the columns are features of the data set, and the statistics on the left column describe each.
Outliers are data that lie on the extreme or even outside the spectrum of values that a variable should normally hold, thereby giving you a hint or an opportunity to explore.
You can immediately notice an outlier, where it shows a good percentage of customers are buying more than 50 products. An investigation can be initiated and, in many cases, they turn out to be resellers. This can be seen as an opportunity to develop a B2B relationship with the resellers and grow it as a separate vertical in the business.
There are broadly two categories of EDA, graphical and non-graphical. These two are further divided into univariate and multivariate EDA, based on interdependency of variables in your data.
Univariate non-graphical: Here, the data features a single variable, and the EDA is done in mostly tabular form, for example, summary statistics. These non-graphical analyses give you a statistic that indicates how skewed your data might be or which is the dominant value for your variable if any.
Univariate graphical: The EDA here, involves graphic tools like bar charts and histograms to get a quick view of how variable properties are stacked against each other, whether there is a relationship between these properties and whether there is any interdependency among these properties.
Multivariate non-graphical: Non-graphical methods like crosstabs are used to depict the relationship between two or more variables. Statistical values like correlation coefficient indicate if there are a possible relationship and the measure of correlation.
Multivariate graphical: A graphical representation always gives you a better understanding of the relationship, especially among multiple variables.
The most commonly used software tools to perform EDA are Python and R.
Both enjoy massive community support and frequent updates on packages that can be used to EDA. Let’s look at the various graphical instruments that can be used to execute an EDA.
Box plots are used where there is a need to summarize data on an interval scale like the ones on the stock market, where ticks observed in one whole day may be represented in a single box, highlighting the lowest, highest, median and outliers.
Heatmaps are most often used for the representation of the correlation between variables. Here is an example of a heatmap.
As you can see from the chart, there is a strong correlation between density and residual sugar and absolutely no correlation between alcohol and residual sugar.
The histogram is the graphical representation of numerical data that splits the data into ranges. The taller the bar, the greater the number of data points falling in that range. A good example here is the height data of a class of students. You would notice that the height data looks like a bell curves for a particular class with most the data lying within a certain range and a few of outside these ranges. There will be outliers too, either very short or very small.
Exploratory data analysis Python‘s main goal is to identify the underlying structure. The trends, patterns, and connections between the different data sets are based on how they are organized. A firm has to take a thorough, analytical look at the data set before it can draw any conclusions or make any assumptions from the vast amount of data.
Exploratory Data Analysis enables Data Scientists to identify flaws, disprove hypotheses, and choose an effective prediction model.
EDA in python aims to deliver precise results that a Data Scientist extracts from the data set while also enabling them to get deeper insight into a data set.
Exploratory Data Analysis is essential in the analysis of massive data sets, to be able to ensure that you have the right data for the chosen statistical model. You certainly would not want to figure out at a later stage that the data is not a good fit for the statistical model you are trying to build. A sound EDA must be performed before any data mining, data analysis, or data modeling occurs.
If you are interested in making a career in the Data Science domain, our Data Science courses can help you immensely in becoming a successful Data Science professional.