With the runaway proliferation of data everywhere around us, notwithstanding better data capturing systems, you still have a great deal of data that is incomplete, inconsistent, and downright incorrect. A certain percentage of data will fall into this category. Consequently, the data that you have freshly sourced from various streams and sources is not yet ready to be run through your savvy Machine Learning algorithms. You may ask why? Well, all that churning that specialized software does to extract meaningful information from your data will be gutted to nothing if the raw data that it is based on is of bad quality. It will skew your results and result in incorrect decisions.
Precisely for this reason, it is mandated that a data quality plan is drawn out. that monitors data and cleans it by removing, reducing, outsourcing, or making corrections. Although not mentioned as many times, Data Cleaning is an important phase before any advanced analytical procession is done on the data. Definition of data quality, data quality issues with your data, how to find them, and how to fix them and why data cleaning in python is what follows from here on.
In this article let us look at:
Consider data to be at the very bottom of a pyramid that increases in terms of its usefulness as you keep going up the pyramid. Each step up takes input from the step below to refine on the quality of information that it gains, with high quality and useful knowledge sitting pretty at the top. Now if the data supplied to this system is of low quality, the knowledge gained from this data will be no good to the business.
There are several ways in which the quality of data is compromised within a data set. A data cleaning exercise is undertaken to detect quality issues, and take corrective action before the data is fed to the systems above for refining out information and knowledge. In the content ahead, we shall go through a few Data Quality issues, how we detect them using Python and what are the ways to deal with them to finally filter out relatively clean data.
Data cleaning is a process of detecting inaccurate and corrupt data from the data set and take measures to correct or remove the anomalies. Data cleaning might involve removing of errors, by either validating or correcting against a known list of entities. Data Enhancement is a common practice carried out in data cleaning, where data is made more complete by adding more information. There are other techniques like harmonization that brings together varying data and transforming into one homogeneous unit, like converting abbreviations like Rd, Road, rd to a consistent word “road”.
The strong Python community offers a robust set of libraries and tools that can easily ingest data into the supporting data structures, perform analysis to find inconsistencies and then make changes to the data, making the data set fit for consumption downstream. Libraries like Numpy, Pandas, and Pretty Pandas offer data structures that are suitable for such tasks. We shall start by looking at each type of anomaly, how they are detected and how they are dealt with in Python.
Missing data is one of the most prevalent issues with data available for advanced analysis. Most models do not accept missing data or might give unpredictable results with such data. It is important that this data is either filled in by using other sources of data or discarded altogether.
There are a few techniques to figure out missing data in the given data set. They can either be graphical or a summary metric of the features indicating the missing percentage.
In Python it is a quick few lines of code that generates a graphical representation of the missing values.
Where cols is being assigned the first 30 columns of the data set, df is the data frame containing the data, the second line prepares a range of color palette and the third line, uses the seaborn package of Python to generate the heatmap. We are passing only the first 30 columns of the data set into the heatmap function. Below graphic is the output. As can be seen, the yellow lines represent missing information within the data set.
You can make a list containing missing % for each feature in the data set like so.
for col in df.columns:
The code above is calculating a percentage value for each column and then printing out the column names and the percentage value against it.
A histogram depicting missing values % in each column will also give you a pictorial representation.
In the code above, a
The above code will generate the below output.
Now that you have found out how to look for missing values in Python, lets look at ways to resolve these problems.
You can drop the observation containing the missing value all together from the data set. Which means the entire row has to go. Do so after a careful analysis of the data, to ensure dropping such observations entirely will not adversely affect your data.
You could also opt to take a subset of the data, but eliminating those that have more than a certain number of columns missing.
Instead of dropping the observation that runs across many columns or features, you may opt to drop the feature itself after evaluating the importance of the feature to the final solution.
You could do, what is called as imputation, a way to derive a reasonably accurate value and replace the missing value with the new value. An average or mean value of the entire feature will be a good fit to replace the missing value.
This solution is like making the observation of the feature a dummy by inserting a sentinel, a known value that indicates that the value is missing, like in the sample below.
This article hopefully showed how Python can be used for data cleansing activity with a few lines of code. There are a lot more data cleansing features that Python data structures can be used for.
If you are interested in making a career in the Data Science domain, our 11-month in-person Postgraduate Certificate Diploma in Data Science course can help you immensely in becoming a successful Data Science professional.