Information Gain And Mutual Information: Overview In 5 Basic Points


Information gain is used to calculate the entropy reduction caused by the dataset’s transformation. It is used for effective classification in the training dataset construction using decision trees, where it evaluates each variable’s information gain and selects the best variable to maximize the gain, resulting in entropy drops and splitting of the data set. It is also used to assess the mutual information of 2 variables randomly selected and for feature selection, where it evaluates the gain of each variable versus the target variable.

  1. What Is Information Gain?
  2. How to Calculate Information Gain?
  3. Examples of Information Gain in Machine Learning
  4. What Is Mutual Information?
  5. How Are Mutual Information and Information Gain Related?

1. What Is Information Gain?

Information Gain calculates the entropy reduction aka surprise through dataset splitting of a random variable as per the given value. When the gain is large the entropy is low, and thus there is less surprise in the entropy vs information gain predictions. This helps prediction and probability distribution since lower probability events have more information that can be quantified in bits and entropy, in turn, how much information the random variable has. A skewed distribution (unsurprising event) has low entropy, whereas equal probability distribution events (surprising) have large entropy.

Datasets also have entropy, and the observed probability distribution of 2 classes in a binary classification dataset is calculated as follows.

Entropy = ((p(1) * log(P(1) -(p(0) * log(P(0))) 

Thus, if the distribution of a sample from the dataset drawn randomly is a split 50/50 sample, the entropy and surprise is a 1-bit maximum. If the distribution is imbalanced, like a 10/90 split, the entropy or surprise is lesser. This can be implemented as an information-gain calculator in Python, and the result shows that in the binary classification of an arbitrary example randomly selected from the dataset, entropy is less than 1-bit of information required for coding the class label. How balanced the distribution indicates the dataset’s purity which can also be calculated and is an information-gain example. 0-bit entropy means the dataset has only 1 class while 1 or more bits entropy values indicate the number of classes in a balanced dataset or maximum entropy.

Suppose we split a dataset S using a range of values for the random variable, the 

IG(S, a) = H(S) – H(S | a) where the LHS is for the dataset S and the RHS shows H(S) the dataset entropy before the change and H(S | a) known as the dataset’s conditional entropy for the given variable a. The number of bits saved during data transformation gives one the information gain. Conditional entropy may also be calculated for each observed value of by calculating the information gain ratio or sum of ratios used to split the data for each group multiplied by the group’s entropy denoted by H(S | a) = sum v in a Sa(v)/S * H(Sa(v)).

2. How to Calculate Information Gain?

Suppose one needs to calculate information-gain of 2 classes, Class-0 and class-1. One calculates the entropy and information gain of samples from both groups. Take the example of a 20 dataset example where Class-1 has 7 examples, and Class-0 has 13 examples. One calculates the dataset entropy where the values are less than 1 bit by splitting the datasets. Suppose a variable has 2 unique values, “value1”, “value2.” To calculate the information gain of this particular variable, one splits the dataset using the value1. Take 8 samples where 7 are in class-0 and 1 in class-1. One calculates entropy or information gain in machine learning for these samples group. Next, split the dataset using value2; one has 12 samples with 6 in each group and an entropy value equal to 1 bit.

Finally, calculate the information gain for this variable using the groups created for each variable and its value with its calculated entropy. Now IG is calculated as below.

Entropy(13/20, 7/20) – (8/20 * Entropy(7/8, 1/8) 12/20 * Entropy(6/12, 6/12))

For the dataset given, the entropy is just less than 1 bit. For the first group, it is 0.5, and for the second, it is 1 bit. The variable IG calculated is 0.117 bits using the chosen variable. 

3. Examples of Information Gain in Machine Learning

One of the most popular examples in machine learning is the information gain in decision tree using the ID3- Iterative Dichotomiser 3 algorithm in the construction of decision trees. Information gain is also used as the split criterion for implementation of the CART information-gain algorithm (Regression and Classification Tree) found in the Python library for machine learning and scikit-learn, which use entropy calculations for model configuration. Information gain is also used for modelling feature selection and the mutual_info_classif() scikit-learn function for calculation mutual information between 2 variables.

4. What Is Mutual Information?

Mutual information measures the uncertainty reduction between 2 variables or/and for 1variable when the other variable is a known value. It is stated as 

I(X ; Y) = H(X) – H(X | Y)

Where the LHS of I(X ; Y) is the calculated value of mutual information or gain ratio for and X the 2 variables, the entropy of X represented by H(X) and the conditional entropy represented by H(X | Y) for X when is given, the result is mentioned in bits of information. It is symmetric in nature and also called the mutual dependence of 2 random variables. KL- the Kullback-Leibler divergence is the measure of differences between the 2 variables probability distributions.

5. How Are Mutual Information and Information Gain Related?

Information Gain and Mutual Information are one and the same. It is called IG when applied to decision trees and the effects of dataset transformation, while it is called MI-Mutual Information when applied to feature selection or the dependence between 2 variables. When applied to the same data, both are the same in quantitative terms. 


From the above discussion, we see, that the larger the IG or information gain, the more the differences seen in the mutual information in the marginal and joint probability distributions of the two. Information gain is used in decision tree training by quantifying the surprise or entropy reduction occurring when a dataset is transformed by comparing the entropy values after and before the transformation. IG applied to variable selection is called mutual information and quantifies the 2 variables’ statistical dependence.

There are no right or wrong ways of learning AI and ML technologies – the more, the better! These valuable resources can be the starting point for your journey on how to learn Artificial Intelligence and Machine Learning. Do pursuing AI and ML interest you? If you want to step into the world of emerging tech, you can accelerate your career with this Machine Learning And AI Courses by Jigsaw Academy.


Related Articles

Please wait while your application is being created.
Request Callback