… in which we thoroughly motivate common impurity measures used for Decision Tree learning and show that each optimises some specific loss function.
The Problem
To grow (construct, learn) a decision tree model, we need the following basic ingredients:
- A splitting criterion that evaluates candidate splits and thus determines where to split a node.
- A leaf combiner that produces a prediction from the leaf that the query example falls in.
Further, as in any learning task, we use a loss function to evaluate the performance of our model.
Here’s how I’ve seen Decision Trees be introduced in any of the teaching materials I’ve seen:
- A diagram illustrating how the data space is iteratively split, yielding a tree structure.
- A list of so-called “impurity measures” which can, allegedly, be used as splitting criteria.
Often omitted or mentioned only in passing are:
- How does a learned Decision Tree actually produce a prediction?
- How, if at all, is any of this related to the loss function we are using to evaluate our tree?
To deepen my point, consider an impurity measure that is almost always one of the prime examples.
Gini Impurity
Let be the probability distribution of classes in leaf . Let be the probability of drawing an example of class from leaf . The Gini impurity is defined as
Cool, you think, nodding smartly at the slide. You’ve seen some of these symbols before. If you stare really hard, you may see that the Gini impurity is the probability of drawing two examples of different classes from the leaf . If you are very unlikely to draw two examples of different classes, then the class distribution in the leaf is very pure, at least validating the name.
But how do we know this is actually helpful in constructing a “good” tree? And what even distinguishes a good tree from a not-so-good tree? In other words, how do we know the splitting criterion of our choice is actually sensible? And what about the leaf combiner, can I just pick something that seems reasonable?
Ultimately, the only relevant measure of quality is our loss function .
The quality of a single decision tree is measured with respect to a loss function evaluated over some test set . We are ultimately interested in finding a tree that minimises the loss over the test set:
So, how do we know that our splitting criterion actually improves this quantity? Or would it even be possible that splitting a tree node hurts performance?
But alas, the professor (or YouTuber, or your attention) has already carried on and you’re left with nothing but an uneasy feeling.
Claim
The choice of loss function, splitting criterion (impurity measure) and leaf combiner are tightly related, and choosing one implies the others.
Claim
Choosing loss function, impurity measure and leaf combiner in accordance yields the theoretical guarantee that splitting a tree model is always at least as good or better after the split of a leaf node (provided we stop splitting when no impurity reduction is possible).
Let’s take a step back and consider the basic nature of decision trees.
Tree leaves partition the data space
A decision tree is constructed (grown, learned) as follows:
- initial case: take the entire training dataset and use them to find a binary split along some dimension at some threshold value, forming two child nodes
- recursion: for a child node, take the training examples that are within its decision boundaries and use them to find a binary split (as above)
At some point (at the latest when each node only contains a single example) we stop splitting. The remaining nodes are the leaves.
Observe that we are recursively partitioning the training dataset . But we are not just assigning training examples to nodes — we are finding threshold values (and dimensions) in the value range of the data space . In other words, we are partitioning the data space . By proxy, this also partitions . But nothing prevents us from taking any new query point and checking our tree’s decision boundaries to tell what leaf would belong to.
The resulting tree leaves represent a partition in the mathematical sense: Each element of the underlying set belongs to exactly one partition cell (leaf), cells/leaves do not overlap, and their union constitutes the entire space.
Formally, we can identify a grown tree with a partition of and can write if s are leaves (partition cells).
This is nice, because it allows us to make statements across the entire space by only considering the partition cells (leaves) individually. For example, if is a partition of in , we can write the sum over all points in in two different ways:
- Simply by summing over all points in
- The sum of the sums of each individual partition cell
In the following, we will extend this idea and, instead of the sum, talk about the mean (or expectation) over the entire set. We then essentially arive at a special case of the Law of Total Expectation (see Section 3.2. 2).
Leaf losses
Recall that, given a query example , a decision tree produces a prediction as follows:
- Find the leaf in whose bounds (decision rules) contain .
- Aggregate the training examples within the bounds of to produce the prediction .
This means that for all queries that fall within leaf , the tree makes the same prediction.
In other words, tree predictions are constant over individual leaves.
Using this intuition and some basic arithmetic, we can write
Because (predictions are constant over a leaf), we have
A tree partitions the data space (and consequently the training data ) into leaves. The loss of a tree is exactly equal to a weighted sum of the leaf losses . This means we need only consider individual leaves in isolation. If we want a tree with a small overall loss, we need to ensure that each leaf has low loss.
The best leaf combiner is the centroid
Given a loss function and assuming the tree structure (and thus the leaves ) are fixed, how should we define the leaf combiner ? From the equation above, we can already see that the leaf loss is minimised if and only if
This is somewhat tautological: We say the best leaf combiner is the one that is, well, the best with respect to . However, there is another way of looking at the above equation which will be the key to actually understanding how everything comes together.
Recall that the purpose of a loss function is to measure the discrepancy between two outcomes. That is, if and are very different, should be large; and if and are very similar, should be small. Imagining as points in a mathematical space, is reminiscent of a distance measure between and ^2. The point then is the point for which the sum of “distances” to all other points is minimal. In other words, is the centroid of the points in with respect to !
Maximising purity means minimising leaf loss
The next question is how we can find good decision boundaries. In principle, we could run a single optimisation procedure to find for us the best possible model. This, however, is very expensive to compute (the problem is NP-hard). When faced with such a dire situation, we can settle for an optimisation strategy that may find only an almost-optimal solution but is much faster. One broad category is greedy optimisation in which we do not try to make all choices (find all parameters) simultaneously but greedily make the next-best choice that heuristically seems like it leads us in a good direction.
The vanilla algorithm to construct decision trees is such a greedy optimisation technique. The basic idea is to iteratively perform binary splits. Note that nobody is saying that this is an exceptionally good method per se — it just a pragmatic means to solve an otherwise challenging problem ^1. Countless alternative procedures have been proposed that, arguably, can grow better trees.
Even though our rule to find just the next-best split may not always yield the best possible decision tree, it does always improve the loss of the tree. Showing that this is true is our main objective, as described in the introduction. A split is performed according to some measure of “purity”. Hence, the key question is how finding a “pure” split improves the leaf loss and, by proxy, the loss of the entire tree model.
Squared-error-loss and Variance Reduction
Consider the regression task under the squared-error loss . A commonly used impurity measure is the squared-error variance (also known as the CART criterion):
Or, writing instead of the squared difference:
is the statistical variance of and is the arithmetic mean. The arithmetic mean is in fact the centroid with respect to the squared-error loss.
If, as argued earlier, we indeed define the leaf combiner as the centroid w.r.t , then we have
Consequently, simply by virtue of definition, we can see that . In less formal terms, we have now seen that
- If the leaf combiner is indeed chosen to be the centroid, …
- … the impurity measure used here (statistical variance) is an instance of a generalised kind of variance with respect to of the shape
- … and the impurity of the leaf is exactly the loss of the leaf.
Result
Splitting according to minimises the squared-error loss of the tree if the leaf combiner is the arithmetic mean.
Note that we have not used any properties of the squared-error loss except for that its centroid is the arithmetic mean. This is the correspondence between the squared-error loss and the arithmetic mean leaf combiner. So, in principle, if we were to find other loss functions and combiners with the same correspondence (combiner is the centroid), the same argument should hold. As we will now show, this is indeed the case.
-loss and majority vote
The majority vote is a centroid with respect to the -loss. The implied impurity measure is the error rate (the inverse of accuracy).
Impurities of probability distributions
Our leading example, the Gini impurity, talks about probabilities. To connect the leaf loss we derived above to something involving probabilities, we have to do some extra work to bridge the gap to probabilities. One essential ingredient that we will be using is that we initially define the leaf combiner to just be the probability distribution of classes in the leaf. I currently don’t have a very good derivation for this except for that if you work with the equations below, you will see that it is a very natural choice to have the scheme work out.
Let be the -the entry of the distribution . Then
Gini impurity
To measure the purity of class labels in a cell, one may consider the probability of drawing two different outcomes from the examples in the current cell. Let be the probability distribution of classes in leaf . Let be the probability of drawing an example of class from leaf . The probability of drawing one example of class and one of a different class is . The probability of drawing two examples of any two different classes then is the Gini impurity
Unlike the case for the squared-error loss, with the Gini impurity it is not so obvious what loss function it corresponds to.
The above result expresses the leaf loss under the assumption that the leaf combiner is the class distribution, i.e. . Let us compare the above result to the definition of the Gini impurity
This means that if we minimise the Gini impurity for a leaf, we are in fact minimising the leaf loss for a choice of .
It remains to be seen that is in fact a centroid with respect to . We will start from a general characterisation of the centroid and then see that in fact fulfills that characterisation.
The centroid is
Let be the vector that contains at position and otherwise. Then
Further, is exactly the vector of class frequencies . Continuing, we have
which is maximised if and only if .
Entropy and Information Gain
Using the same strategy, we can show that the Information Gain criterion corresponds to the negative entropy impurity and the cross-entropy loss.
Consider the negative entropy impurity measure
If the leaf combiner is the distribution of classes in , i.e. then maximises
Rewriting this over all examples, this is the log-loss, also known as cross-entropy loss.
Summary
We have now seen that for a variety of widely-used impurity measures, we can actually exactly derive the loss function they are optimising. This implies a choice of a combiner function.
Impurity | Combiner / Centroid | Loss |
---|---|---|
Squared-Error Variance | Arithmetic Mean | Squared-Error |
Error Rate | Majority Vote | -loss |
Gini Impurity | Distribution | Probability of target class |
Information Gain | Distribution |
Note that I am not making any claims about the combinations here being empirically better.
Consequently, if we split according to a given impurity measure (and we indeed make a split that reduces impurity), we now know that this also directly improves a certain loss function. In other words, we have the theoretical guarantee that tree model is always at least as good or better after the split of a leaf node (provided we stop when no impurity reduction is possible).
Outlook
Given that the leaf combiner is a centroid w.r.t. the loss function , the overall leaf loss
becomes a generalised variance. Plugging in the squared-error loss and the arithmetic mean combiner, we obtain the usual “statistical” variance. It turns out that there is a whole class of loss functions for which this structure holds, named Bregman divergences. Note that the -loss is not a Bregman divergence, nevertheless the intuition behind it remains the name. Bregman divergences have very convenient properties which allow us to make similarly optimistic guarantees for Random Forests.