The first step in any Data Science or Machine Learning workflow is cleaning the data obtained in the raw form. The raw data is too messy and unstructured cannot be directly used to build Machine Learning models. Without a clean and formatted data, it is difficult to see the existing insights in the Data Exploration process. The real challenge faced with messy or raw data, is when the real training of the Machine Learning model begins. The central idea in building any Machine Learning model is to make the data should be clean and ready to be used in the model. In context to Machine Learning and Data Science, Cleaning the data is a process of altering and filtering that paves the way out to easily derive and explore insights and model in a format that is required. Filtering the data in a way that is irreverent to the business problem at hand and altering the parts of data that is needed in a format to be able to use it in the business context.
Broadly, the steps in any machine learning workflow after data collection and data wrangling, the data cleaning plays a vital role for preparing the data to feed it into a model. This part of data cleaning deals with handling missing values, encoding categorical data that is collected in a text format, dropping the redundant features and reducing the dimensionality using standard dimensionality reduction techniques. This step prepares the data as a whole to be applied to any machine learning algorithms.
The raw data can contain various different types of data which can be both structured and unstructured and needs to be processed in order to bring to form that is usable in the Machine Learning models. At a high level, the types of data are divided into Structured Data and Unstructured Data. The structured data is again classified into Numeric and Categorical data. Some of the common structured data types that are used in Machine Learning and Data Science point of view are as listed:
Examples: Height or Weight of an individual, Rate of Interest on loans, etc.
Examples: Student count in a class, Colour count in a Rainbow.
Examples: States in a country, zip codes of areas.Â
Examples: Ratings for a restaurant (e.g. very good, good, bad, very bad), Level of Education of an individual (e.g. Doctorate, Post Graduate, UnderGraduate), etc.
Examples: Gender (male or female), Fraudulent transaction (Yes or No), Cancerous Cell (True or False).
After the identification of the data types of the features present in the data set, the next step is to process the data in a way that is suitable to put to Machine Learning models. This step of pre-processing consists of various steps which include Missing Value Treatment, Feature Encoding, Dimensionality Reduction and so on. Out of these techniques the one that closely relates to above discussed data types is Feature Encoding. Feature Encoding is the conversion of Categorical features to numeric values as Machine Learning models cannot handle the text data directly. Most of the Machine Learning Algorithms performance vary based on the way in which the Categorical data is encoded. The three popular techniques of converting Categorical values to Numeric values are done in two different methods.
Level of Education
An alternative way would look like:
Fraudulent_Transaction
Applying the above discussed techniques after identifying the data types of the features in the data set is not only important but also necessary to make the data ready for the machine learning model. The encoding techniques discussed are the most popular and are applicable to most of the data set that needs to be modelled using Machine Learning Algorithms.
Fill in the details to know more
From The Eyes Of Emerging Technologies: IPL Through The Ages
April 29, 2023
Personalized Teaching with AI: Revolutionizing Traditional Teaching Methods
April 28, 2023
Metaverse: The Virtual Universe and its impact on the World of Finance
April 13, 2023
Artificial Intelligence – Learning To Manage The Mind Created By The Human Mind!
March 22, 2023
Wake Up to the Importance of Sleep: Celebrating World Sleep Day!
March 18, 2023
Operations Management and AI: How Do They Work?
March 15, 2023
The Portal Podcast Transcription – Episode 3 – Analytics in HR Management With Sayantani Pyne
Podcast Transcript Episode 2: Product Thinking For Entrepreneurs With Mr. Praveen Udupa, Co-founder, eedge.ai
March 13, 2023
“The Power of SQL in Driving Business Success”
March 8, 2023
Exploring the Potential of Artificial Intelligence & Machine Learning for Improving Program Management
February 28, 2023
Cyber Safe Behaviour In Banking Systems
February 17, 2023
How Does BYOP(Bring Your Own Project) Help In Building Your Portfolio?
What Are the Ethics in Artificial Intelligence (AI)?
November 25, 2022
What is Epoch in Machine Learning?| UNext
November 24, 2022
The Impact Of Artificial Intelligence (AI) in Cloud Computing
November 18, 2022
Role of Artificial Intelligence and Machine Learning in Supply Chain ManagementÂ
November 11, 2022
Best Python Libraries for Machine Learning in 2022
November 7, 2022
Add your details:
By proceeding, you agree to our privacy policy and also agree to receive information from UNext through WhatsApp & other means of communication.
Upgrade your inbox with our curated newletters once every month. We appreciate your support and will make sure to keep your subscription worthwhile