Data Preprocessing: A Step-By-Step Guide For 2021

Ajay Ohri


Data mining is the art of discovering a pattern in various sets of data. The only goal of this field of computer science is to work with data and find the most relevant information from the same. the extracted data is then converted into a comprehensive structure for better understanding. Because of how complex and important it is, data mining features a variety of steps in order for the final results to be achieved. One such activity is data preprocessing.

  1. What is Preprocessing Data?
  2. What are the steps in Data Preprocessing?

1) What is Preprocessing Data?

Data preprocessing in data mining is an important step. It is here where the information relevant to the query is extracted and then further analyzed before being sent for processing ahead. Its work is so important that without preprocessing, garbage in and garbage out can best describe the results one may get to see. This step comes right after the primary task of data gathering and thus is an important part as data gathering often ends up in the collection of unnecessary data, which may answer why data preprocessing is important.

2) What are the steps in Data Preprocessing?

Just like the task its made to do, data preprocessing is also a complex task and needs proper understanding of its elements in order to make the best use of whatever tasks you wish to employ it for. The following are the data preprocessing steps one should know about before getting into the world of data mining.

1. Data Cleaning:

 The first step to deal with when it comes to data preprocessing is to clean it. Just the way one washes and clean items before cooking them, similarly one needs to clean the data that has been extracted from the massive piles of information to keep whatever is important and do away with the rest. As it is one of the first data preprocessing steps it works with different kinds of data extracted from different sources, it ends up ultimately classifying them into Missing Data and Noisy Data.

(a). Missing Data

 As there is plenty to work within the task of data preprocessing. There are times when some information is missing from the data that has been extracted. There are certain ways to deal with them: 

Ignore the tuples or Fill the Missing values

 When there is plenty of information available and a large data set available to work with, one may choose to ignore the entire tuple of data if anything is missing. However, this is only suitable when working with large volumes of data. As for the second option, one can add in their own values to the missing data. It’s a suitable alternative to ignoring the missing values and can be used when data extracted is less in size and proper data mining is needed.

(b). Noisy Data

 When data can’t be interpreted by machines, it gets termed as noisy data. The reason this happens is that data may be faulty or some serious errors may creep into it. However, just the way missing data can be handled, noisy data can also be taken care of in various ways.

Binning Method, Regression and Clustering:

 The Binning method is used to smooth out data that’s been sorted and kept aside. Here, the data made available is first segmented into categories based on their characteristics and then processed accordingly. All of the data that has been segmented is then handled separately. Using boundaries or mean values, one can easily replace all the data in a set according to the values around it.

In the second method, data is dealt with using regression. One can use independent variables under linear regression or multiple independent variables and work with multiple regression as required. The final step is when data is to be collected, one can use clustering of data. In this, data that follows the lines of the cluster are included while outliers are rejected.

2. Data Transformation:

 The next step in understanding what is preprocessing in data mining is to understand data transforming and how it’s transform data to suit one’s needs, any of the following methods can be adopted.

  1. Normalization:
     This method is used to set a scale and then put data in that range (-1.0 to 1.0 or 0.0 to 1.0)
  2. Attribute Selection:
     Attributes are selected from a pre-constructed range in order to classify data and then transform them accordingly.  
  3. Discretization:
     internal or conceptual levels are used to replace the values of various numeric attributes.
  4. Concept Hierarchy Generation:
     here, changes can be made based on hierarchy and one of a lower level can be upgraded to that of a higher level.

3. Data Reduction:

 With great amounts of data comes the greater need to process data accurately. And in this case, analysis with tons of data onboard can be a difficult task to deal with. Therefore, such techniques are employed in data preprocessing in data mining to get the required results and can be done so in the following ways.   

  1. Data Cube Aggregation:
     A data cube is constructed using the operation of data aggregation.
  2. Attribute Subset Selection:
     using only attributes that are highly relevant is usually the correct way to deal with things. Unnecessary data can always be discarded. In attribute selection, a level can be decided and anything that may be of lesser significance can be discarded.
  3. Numerosity Reduction:
     in this case, data preprocessing only stores model data and throws away unnecessary data. 
  4. Dimensionality Reduction:
     using various encoding mechanisms, the size of the data can be reduced. Depending on how it’s done, one may or may not lose data. If after reduction, one is able to successfully retrieve reduced data, then it is considered lossless. If otherwise, then the data is lost for good.  


In data mining, there are numerous data preprocessing techniques for data mining that one may use as per their needs. Data preprocessing is an important part of data mining and is one that is used by many as and when required. If done well, it can make the whole data mining process a whole lot easier.

If you are interested in making it big in the world of data and evolve as a Future Leader, you may consider our Integrated Program in Business Analytics, a 10-month online program, in collaboration with IIM Indore!



Related Articles

Please wait while your application is being created.
Request Callback