What is information gain in data mining?

2 February 2020

7048

Acquiring information is commonly used in the construction of decision trees from a training data set, assessing the increase in information for each variable and selecting a variable that maximizes information gain, which in turn minimizes entropy and best divides the data set into groups for effective classification. What is information gain in data mining?

What is information acquisition?

The information gain, or IG for short, measures the reduction of entropy or surprise by dividing the data set according to the given random variable value.

A larger increase in information suggests a lower entropy or sample group, and therefore less surprise.

You may remember that this information determines how surprising a bit event is. Lower probability events have more information, higher probability events have less information. Entropy quantifies the amount of information in a random variable, more precisely its probability distribution. The skewed distribution has low entropy, whereas the distribution in which events have equal probability has higher entropy.

Entropy

The decision tree is built top-down from the root node and involves splitting data into subsets that contain instances of similar (homogeneous) values. The ID3 algorithm uses entropy to calculate sample homogeneity. If the sample is completely homogeneous, entropy is zero, and if the sample is evenly divided, it has entropy of one.

What is information gain in data mining? — unsplash.com

Sometimes also marked with the letter “H”

Where “Pi” is simply the common probability of the “i” element / class in our data. For simplicity let’s assume that we only have two classes, a positive class and a negative class. Therefore, “and” can be here + or (-). So if we had a total of 100 data points in our data set, of which 30 belonged to the positive class and 70 belonged to the negative class, then “P +” would be 3/10 and “P-” would be 7/10. Quite simple.

If I had to calculate the entropy of my classes in this example using the formula above. Here’s what I would get.

Entropy here is about 0.88. This is considered high entropy, high level of disorder (which means low level of purity). Entropy is measured between 0 and 1. (Depending on the number of classes in the data set, entropy may be greater than 1, but this means the same, very high level of disorder. For simplicity, examples in this blog will have entropy between 0 and 1) .

Decision Tree

The main idea of a decision tree is to identify the elements that contain the most information about the target, and then divide the data set along the values of those elements so that the values of the target in the resulting nodes are as clean as possible. The function that best separates uncertainty from the target information is considered the most informative function. The process of searching for the most informative function continues until we are done with clean leaf nodes.