Different Types of Missing Data Values and How To Handle It!

Introduction 

Business analytics (BA) is a set of skills, technology, and methods used to study a company’s information and performance to obtain insights and make data-driven future choices. BA determines which datasets are useful and can increase revenue, productivity, and efficiency. Almost 33% of large-sized enterprises will use decision intelligence by 2023. Businesses benefit from data analytics by making decisions 5x faster. 

Many types of datasets contain outliers and missing values. Outliers significantly affect the estimation process in statistics by causing underestimated or overestimated results. Missing values reduce the available data and may cause the results to be less reliable. Hence, learning how to handle missing data and deal with outliers is important. Business Analytics plays an important role in handling such issues. Business Analysts use certain methodologies that can be used to deal with outliers and missing values. Here’s how to deal with outliers and missing values in a given dataset. 

What Is an Outlier? 

An outlier is an observation in a random population sample that lies at an abnormal distance from other values. It is an observation that deviates substantially from the normal values in the population. The analyst decides what to consider abnormal. Before deciding that, normal observations are to be characterized. Outliers are found in all types of datasets. 

For Example: Which of the following will have the lowest do value? 

Consider a dataset consisting of values 1,500,550,560 and 575. The answer will be 1. 1 is an outlier since it differs greatly from other observations. 

Types of Outliers 

Three types of outliers are found in various types of datasets. 

1. Global Outliers 

Global outliers are observations that deviate significantly from normal observations, and they lie farthest away from the normal data points. For example, the number of flights operating during the lockdown as compared to other times.  

2. Collective Outliers 

Some observations cannot be individually considered outliers, but collectively they deviate significantly from the rest of the data points. They are considered outliers when aggregated. 

3. Contextual Outliers 

Contextual outliers deviate significantly from normal observations, but only when the observations are based on a specific context. For example- a temperature may be considered abnormally low during summer but will be normal during winter. 

Why Do We Need to Treat Outliers? 

Outliers are abnormally different from normal observations, and their presence in a dataset may introduce bias in statistical estimates. There are many reasons why it is important to treat outliers. 

  • The presence of outliers can lead to vague predictions. 
  • Outliers can cause misleading predictions due to their extreme values. 
  • Outliers may lead to overestimating or underestimating statistical values like mean, mode, and median. 
  • Outliers will likely influence statistical models like linear regression, support vector machines, and logistic regression. 
  • The mathematical power of statistical models is reduced due to the presence of outliers. 

What Are Missing Values? 

Missing values occur in datasets when no value is recorded for a variable in an observation. Missing values can be present due to many reasons, such as non-response from the study participant, loss of data while recording, errors from the recorder, etc. Missing values are found in all types of datasets, and they may significantly affect the accuracy of the results. You may have wondered what happens when a dataset includes records with missing data. Missing values may cause bias in the results, lead to incorrect results, and make machine learning models fail. Hence it is essential to tackle missing values properly. 

Detecting the Missing Values  

The first step towards dealing with a missing value is to find the missing value. There are several ways to detect missing values in Python. Mostly isnull() function is used to find the missing values. By using dataframe.isnull().values.any(), you can find the missing values in a dataset. Using dataframe.isnull().sum(), you can find the missing values in each column. 

Strategies for Dealing With Missing Values and Outliers 

How to handle missing data and outliers in different types of datasets? Can they be deleted? Can they be replaced? Several strategies are used for handling missing values and outliers. You need to analyze the dataset properly to decide which strategy is best suited for you. Some of the commonly used strategies are- 

1. Deleting the Missing Values 

This quick technique is used to eliminate missing values from a dataset. However, it is not a recommended strategy. You may lose important data by deleting the observations with missing values. This may lead to incorrect conclusions and misinterpretations.  

  • Deleting the entire row- If one row has multiple missing values, the entire row is sometimes deleted. 
  • Deleting the entire column- If a column contains multiple missing values, you may choose to delete the entire column. 

2. Imputing the Missing Values and the Outliers 

Under this method, missing values are replaced with others before working on the data. Several imputation methods can be used. 

  • Replace with an arbitrary value: In this method, missing values are replaced with arbitrary values based on guesses. However, the guesses cannot be random. They must be educated guesses that will make sense in the given context. 
  • Replacing With the Mean: This is a very widely used imputation method. However, it does not work well for imputing outliers. 
  • Replacing With the Mode: The Mode is the most commonly occurring value in the dataset. Hence it makes sense to replace a missing value or an outlier with the mode. 
  • Replacing with the Median: The middlemost value of the dataset is called the Median. This method is better suited for imputing outliers. 
  • Replacing With the Previous Value: This method is also known as ‘Forward Fill.’ The missing value or the outlier is replaced with the previously occurred value. 
  • Replace with the Next Value- This is called a ‘Backward Fill.’ The missing value or the outlier is replaced with the next value appearing. 
  • Interpolation- Missing values and outliers can be imputed using various methods of interpolation, such as linear, quadratic, and polynomial methods.  

Conclusion 

Business Analytics deals with analyzing various types of datasets, resolving issues like missing values or outliers, and making informed decisions. If you find Business Analytics interesting, you can enroll in the Integrated Program in Business Analytics by UNext . It will enable you to learn data visualization, statistical modeling, and other advanced data analytics topics.

Related Articles

loader
Please wait while your application is being created.
Request Callback