Information gainย is used to calculate the entropy reduction caused by the datasetโs transformation. It is used for effective classification in the training dataset construction using decision trees, where it evaluates each variableโs information gain and selects the best variable to maximize the gain, resulting in entropy drops and splitting of the data set. It is also used to assess the mutual information of 2 variables randomly selected and for feature selection, where it evaluates the gain of each variable versus the target variable.
Information Gainย calculates the entropy reduction aka surprise through dataset splitting of a random variable as per the given value. When the gain is large the entropy is low, and thus there is less surprise in theย entropy vs information gainย predictions. This helps prediction and probability distribution since lower probability events have more information that can be quantified in bits and entropy, in turn, how much information the random variable has. A skewed distribution (unsurprising event) has low entropy, whereas equal probability distribution events (surprising) have large entropy.
Datasets also have entropy, and the observed probability distribution of 2 classes in a binary classification dataset is calculated as follows.
Entropy = ((p(1) * log(P(1) -(p(0) * log(P(0)))ย
Thus, if the distribution of a sample from the dataset drawn randomly is a split 50/50 sample, the entropy and surprise is a 1-bit maximum. If the distribution is imbalanced, like a 10/90 split, the entropy or surprise is lesser. This can be implemented as anย information-gain calculatorย in Python, and the result shows that in the binary classification of an arbitrary example randomly selected from the dataset, entropy is less than 1-bit of information required for coding the class label. How balanced the distribution indicates the datasetโs purity which can also be calculated and is anย information-gain example. 0-bit entropy means the dataset has only 1 class while 1 or more bits entropy values indicate the number of classes in a balanced dataset or maximum entropy.
Suppose we split a dataset S using a range of values for the random variable, theย
IG(S, a) = H(S) โ H(S | a) where the LHS is for the dataset S and the RHS shows H(S) the dataset entropy before the change and H(S | a) known as the datasetโs conditional entropyย for the given variable a. The number of bits saved during data transformation gives one the information gain.ย Conditional entropy may also be calculated for each observed value of aย by calculating theย information gain ratioย or sum of ratios used to split the data for each group multiplied by the groupโs entropy denoted by H(S | a) = sum v in a Sa(v)/S * H(Sa(v)).
Suppose one needs to calculateย information-gainย of 2 classes, Class-0 and class-1. One calculates theย entropy and information gainย of samples from both groups. Take the example of a 20 dataset example where Class-1 has 7 examples, and Class-0 has 13 examples. One calculates the dataset entropy where the values are less than 1 bit by splitting the datasets. Suppose a variable has 2 unique values, โvalue1โ, โvalue2.โ To calculate the information gain of this particular variable, one splits the dataset using the value1. Take 8 samples where 7 are in class-0 and 1 in class-1. One calculates entropy orย information gain in machine learningย for these samples group. Next, split the dataset using value2; one has 12 samples with 6 in each group and an entropy value equal to 1 bit.
Finally, calculate theย information gainย for this variable using the groups created for each variable and its value with its calculated entropy. Now IG is calculated as below.
Entropy(13/20, 7/20) โ (8/20 * Entropy(7/8, 1/8) 12/20 * Entropy(6/12, 6/12))
For the dataset given, the entropy is just less than 1 bit. For the first group, it is 0.5, and for the second, it is 1 bit. The variable IG calculated is 0.117 bits using the chosen variable.ย
One of the most popular examples in machine learning is theย information gain in decision treeย using the ID3- Iterative Dichotomiser 3 algorithm in the construction of decision trees. Information gain is also used as the split criterion for implementation of the CART information-gain algorithmย (Regression and Classification Tree) found in the Python library for machine learning and scikit-learn, which use entropy calculations for model configuration. Information gain is also used for modelling feature selection and the mutual_info_classif() scikit-learn function for calculation mutual information between 2 variables.
Mutual informationย measures the uncertainty reduction between 2 variables or/and for 1variable when the other variable is a known value. It is stated asย
I(X ; Y) = H(X) โ H(X | Y)
Where the LHS of I(X ; Y) is the calculated value of mutual information orย gain ratioย for Yย and X the 2 variables, the entropy of X represented by H(X) and the conditional entropy represented by H(X | Y) for X when Yย is given, theย result is mentioned in bits of information. It is symmetric in nature and also called the mutual dependence of 2 random variables. KL- the Kullback-Leibler divergence is the measure of differences between the 2 variables probability distributions.
Information Gainย and Mutual Information are one and the same. It is called IG when applied to decision trees and the effects of dataset transformation, while it is called MI-Mutual Information when applied to feature selection or the dependence between 2 variables. When applied to the same data, both are the same in quantitative terms.ย
From the above discussion, we see, that the larger the IG orย information gain, the more the differences seen in the mutual information in the marginal and joint probability distributions of the two. Information gain is used in decision tree training by quantifying the surprise or entropy reduction occurring when a dataset is transformed by comparing the entropy values after and before the transformation. IG applied to variable selection is called mutual information and quantifies the 2 variables’ statistical dependence.
There are no right or wrong ways of learning AI and ML technologies โ the more, the better! These valuable resources can be the starting point for your journey on how to learn Artificial Intelligence and Machine Learning. Do pursuing AI and ML interest you? If you want to step into the world of emerging tech, you can accelerate your career with thisย Machine Learning And AI Coursesย by Jigsaw Academy.
Fill in the details to know more
From The Eyes Of Emerging Technologies: IPL Through The Ages
April 29, 2023
Personalized Teaching with AI: Revolutionizing Traditional Teaching Methods
April 28, 2023
Metaverse: The Virtual Universe and its impact on the World of Finance
April 13, 2023
Artificial Intelligence โ Learning To Manage The Mind Created By The Human Mind!
March 22, 2023
Wake Up to the Importance of Sleep: Celebrating World Sleep Day!
March 18, 2023
Operations Management and AI: How Do They Work?
March 15, 2023
How Does BYOP(Bring Your Own Project) Help In Building Your Portfolio?
What Are the Ethics in Artificial Intelligence (AI)?
November 25, 2022
What is Epoch in Machine Learning?| UNext
November 24, 2022
The Impact Of Artificial Intelligence (AI) in Cloud Computing
November 18, 2022
Role of Artificial Intelligence and Machine Learning in Supply Chain Managementย
November 11, 2022
Best Python Libraries for Machine Learning in 2022
November 7, 2022
Add your details:
By proceeding, you agree to our privacy policy and also agree to receive information from UNext through WhatsApp & other means of communication.
Upgrade your inbox with our curated newletters once every month. We appreciate your support and will make sure to keep your subscription worthwhile