… in which we thoroughly motivate common impurity measures used for Decision Tree learning and show that each optimises some specific loss function.

The Problem

To grow (construct, learn) a decision tree model, we need the following basic ingredients:

A splitting criterion that evaluates candidate splits and thus determines where to split a node.
A leaf combiner that produces a prediction from the leaf that the query example falls in.

Further, as in any learning task, we use a loss function to evaluate the performance of our model.

Here’s how I’ve seen Decision Trees be introduced in any of the teaching materials I’ve seen:

A diagram illustrating how the data space is iteratively split, yielding a tree structure.
A list of so-called “impurity measures” which can, allegedly, be used as splitting criteria.

Often omitted or mentioned only in passing are:

How does a learned Decision Tree actually produce a prediction?
How, if at all, is any of this related to the loss function we are using to evaluate our tree?

To deepen my point, consider an impurity measure that is almost always one of the prime examples.

Gini Impurity

Let $p_{f}$ be the probability distribution of classes in leaf $f$ . Let $p_{f k}$ be the probability of drawing an example of class $k$ from leaf $f$ . The Gini impurity is defined as
$H_{Gini} (f) =_{def} k \sum p_{f k} (1 - p_{f k})$

Cool, you think, nodding smartly at the slide. You’ve seen some of these symbols before. If you stare really hard, you may see that the Gini impurity $H_{Gini}$ is the probability of drawing two examples of different classes from the leaf $f$ . If you are very unlikely to draw two examples of different classes, then the class distribution in the leaf is very pure, at least validating the name.

But how do we know this is actually helpful in constructing a “good” tree? And what even distinguishes a good tree from a not-so-good tree? In other words, how do we know the splitting criterion of our choice is actually sensible? And what about the leaf combiner, can I just pick something that seems reasonable?

Ultimately, the only relevant measure of quality is our loss function $ℓ$ .

The quality of a single decision tree $q$ is measured with respect to a loss function $l$ evaluated over some test set ${(x_{i}, y_{i})}_{1}^{n}$ . We are ultimately interested in finding a tree that minimises the loss over the test set:

\frac{1}{n} i = 1 \sum n ℓ (y_{i}, q (x_{i}))

So, how do we know that our splitting criterion actually improves this quantity? Or would it even be possible that splitting a tree node hurts performance?

But alas, the professor (or YouTuber, or your attention) has already carried on and you’re left with nothing but an uneasy feeling.

Claim

The choice of loss function, splitting criterion (impurity measure) and leaf combiner are tightly related, and choosing one implies the others.

Claim

Choosing loss function, impurity measure and leaf combiner in accordance yields the theoretical guarantee that splitting a tree model is always at least as good or better after the split of a leaf node (provided we stop splitting when no impurity reduction is possible).

Let’s take a step back and consider the basic nature of decision trees.

Tree leaves partition the data space

A decision tree is constructed (grown, learned) as follows:

initial case: take the entire training dataset and use them to find a binary split along some dimension at some threshold value, forming two child nodes
recursion: for a child node, take the training examples that are within its decision boundaries and use them to find a binary split (as above)

At some point (at the latest when each node only contains a single example) we stop splitting. The remaining nodes are the leaves.

Observe that we are recursively partitioning the training dataset $D$ . But we are not just assigning training examples to nodes — we are finding threshold values (and dimensions) in the value range of the data space $X$ . In other words, we are partitioning the data space $X$ . By proxy, this also partitions $D$ . But nothing prevents us from taking any new query point $x$ and checking our tree’s decision boundaries to tell what leaf $x$ would belong to.

The resulting tree leaves represent a partition in the mathematical sense: Each element of the underlying set belongs to exactly one partition cell (leaf), cells/leaves do not overlap, and their union constitutes the entire space.

Formally, we can identify a grown tree with a partition of $X$ and can write $X = f_{1} \cup f_{2} \dots \cup f_{L}$ if $f_{i}$ s are leaves (partition cells).

This is nice, because it allows us to make statements across the entire space by only considering the partition cells (leaves) individually. For example, if $q$ is a partition of $D$ in $D = f_{1} \cup f_{2} \cup \dots$ , we can write the sum over all points in $D$ in two different ways:

Simply by summing over all points in $D$
The sum of the sums of each individual partition cell

i \in D \sum x_{i} = f \in q \sum i \in f \sum x_{i}

In the following, we will extend this idea and, instead of the sum, talk about the mean (or expectation) over the entire set. We then essentially arive at a special case of the Law of Total Expectation (see Section 3.2. 2).

Leaf losses

Recall that, given a query example $x$ , a decision tree $q$ produces a prediction $q (x)$ as follows:

Find the leaf $f$ in $q$ whose bounds (decision rules) contain $x$ .
Aggregate the training examples within the bounds of $f$ to produce the prediction $q (f)$ .

This means that for all queries $x$ that fall within leaf $f$ , the tree makes the same prediction.
In other words, tree predictions are constant over individual leaves.

Using this intuition and some basic arithmetic, we can write

\frac{1}{n} i \in D \sum ℓ (y_{i}, q (x_{i})) = f \in q \sum \frac{n _{f}}{n} =_{def} L (f) \frac{1}{n _{f}} i \in f \sum ℓ (y_{i}, q (x_{i}))

Because $\forall x \in f : q (x) = q (f)$ (predictions are constant over a leaf), we have

L (f) = \frac{1}{n _{f}} i \in f \sum ℓ (y_{i}, q (f))

A tree partitions the data space (and consequently the training data $D$ ) into leaves. The loss of a tree is exactly equal to a weighted sum of the leaf losses $L (f)$ . This means we need only consider individual leaves in isolation. If we want a tree with a small overall loss, we need to ensure that each leaf has low loss.

The best leaf combiner is the centroid

Given a loss function $ℓ$ and assuming the tree structure (and thus the leaves $f$ ) are fixed, how should we define the leaf combiner $q (f)$ ? From the equation above, we can already see that the leaf loss $L (f)$ is minimised if and only if

q (f) =_{def} ar g z min i \in f \sum ℓ (y_{i}, z)

This is somewhat tautological: We say the best leaf combiner is the one that is, well, the best with respect to $ℓ$ . However, there is another way of looking at the above equation which will be the key to actually understanding how everything comes together.

Recall that the purpose of a loss function $ℓ$ is to measure the discrepancy between two outcomes. That is, if $z_{1}$ and $z_{2}$ are very different, $ℓ (z_{1}, z_{2})$ should be large; and if $z_{1}$ and $z_{2}$ are very similar, $ℓ (z_{1}, z_{2})$ should be small. Imagining $z_{1}, z_{2}$ as points in a mathematical space, $ℓ$ is reminiscent of a distance measure between $z_{1}$ and $z_{2}$ ^2. The point $q (f)$ then is the point for which the sum of “distances” to all other points is minimal. In other words, $q (f)$ is the centroid of the points in $f$ with respect to $ℓ$ !

Maximising purity means minimising leaf loss

The next question is how we can find good decision boundaries. In principle, we could run a single optimisation procedure to find for us the best possible model. This, however, is very expensive to compute (the problem is NP-hard). When faced with such a dire situation, we can settle for an optimisation strategy that may find only an almost-optimal solution but is much faster. One broad category is greedy optimisation in which we do not try to make all choices (find all parameters) simultaneously but greedily make the next-best choice that heuristically seems like it leads us in a good direction.

The vanilla algorithm to construct decision trees is such a greedy optimisation technique. The basic idea is to iteratively perform binary splits. Note that nobody is saying that this is an exceptionally good method per se — it just a pragmatic means to solve an otherwise challenging problem ^1. Countless alternative procedures have been proposed that, arguably, can grow better trees.

Even though our rule to find just the next-best split may not always yield the best possible decision tree, it does always improve the loss of the tree. Showing that this is true is our main objective, as described in the introduction. A split is performed according to some measure of “purity”. Hence, the key question is how finding a “pure” split improves the leaf loss and, by proxy, the loss of the entire tree model.

Squared-error-loss and Variance Reduction

Consider the regression task under the squared-error loss $ℓ (z_{1}, z_{2}) = (z_{1} - z_{2})^{2}$ . A commonly used impurity measure is the squared-error variance (also known as the CART criterion):

H_{var} (f) = \frac{1}{n _{f}} i \in f \sum (y_{i} - \overset{y}{ˉ})^{2} for \overset{y}{ˉ} =_{def} \frac{1}{n _{f}} i \in f \sum y_{i}

Or, writing $ℓ$ instead of the squared difference:

H_{var} (f) = \frac{1}{n _{f}} i \in f \sum ℓ (y_{i}, \overset{y}{ˉ})

$H_{var} (f)$ is the statistical variance of $y_{i} \in f$ and $\overset{y}{ˉ}$ is the arithmetic mean. The arithmetic mean is in fact the centroid with respect to the squared-error loss.

If, as argued earlier, we indeed define the leaf combiner $q (f)$ as the centroid w.r.t $ℓ$ , then we have

\overset{y}{ˉ} = \frac{1}{n _{f}} i \in f \sum y_{i} = ar g z min \sum ℓ (y_{i}, z) = q (f)

Consequently, simply by virtue of definition, we can see that $H_{var} (f) = L (f)$ . In less formal terms, we have now seen that

If the leaf combiner is indeed chosen to be the centroid, …
… the impurity measure used here (statistical variance) is an instance of a generalised kind of variance with respect to $ℓ$ of the shape $H (f) = \frac{1}{n _{f}} \sum ℓ (y_{i}, f)$
… and the impurity of the leaf is exactly the loss of the leaf.

Result

Splitting according to $H_{var}$ minimises the squared-error loss of the tree if the leaf combiner is the arithmetic mean.

Note that we have not used any properties of the squared-error loss except for that its centroid is the arithmetic mean. This is the correspondence between the squared-error loss and the arithmetic mean leaf combiner. So, in principle, if we were to find other loss functions and combiners with the same correspondence (combiner is the centroid), the same argument should hold. As we will now show, this is indeed the case.

$0/1$ -loss and majority vote

The majority vote $\overset{y}{ˉ}$ is a centroid with respect to the $0/1$ -loss. The implied impurity measure is the error rate (the inverse of accuracy).

H_{0/1} (f) = \frac{1}{n _{f}} i \in f \sum ℓ_{0/1} (y_{i}, \overset{y}{ˉ})

Impurities of probability distributions

Our leading example, the Gini impurity, talks about probabilities. To connect the leaf loss we derived above to something involving probabilities, we have to do some extra work to bridge the gap to probabilities. One essential ingredient that we will be using is that we initially define the leaf combiner to just be the probability distribution of classes in the leaf. I currently don’t have a very good derivation for this except for that if you work with the equations below, you will see that it is a very natural choice to have the scheme work out.

Let $p_{f k}$ be the $k$ -the entry of the distribution $q (f) =_{d e f} p_{f}$ . Then

L (f) = \frac{1}{n _{f}} i \in f \sum ℓ (y_{i}, p_{f y_{i}}) = \frac{1}{n _{f}} i \in f \sum k \sum 1 [y_{i} = k] ℓ (k, p_{f k}) = k \sum ℓ (k, p_{f k}) \frac{1}{n _{f}} i \in f \sum 1 [y_{i} = k] = k \sum ℓ (k, p_{f k}) p_{f k}

Gini impurity

To measure the purity of class labels in a cell, one may consider the probability of drawing two different outcomes from the examples in the current cell. Let $p_{f}$ be the probability distribution of classes in leaf $f$ . Let $p_{f k}$ be the probability of drawing an example of class $k$ from leaf $f$ . The probability of drawing one example of class $k$ and one of a different class is $p_{f k} (1 - p_{f k})$ . The probability of drawing two examples of any two different classes then is the Gini impurity

H_{Gini} (f) =_{def} k \sum p_{f k} (1 - p_{f k}) = k \sum p_{f k} - k \sum p_{f k}^{2} = 1 - k \sum p_{f k}^{2}

Unlike the case for the squared-error loss, with the Gini impurity it is not so obvious what loss function it corresponds to.

The above result expresses the leaf loss $L (f)$ under the assumption that the leaf combiner is the class distribution, i.e. $q (f) =_{def} p_{f}$ . Let us compare the above result to the definition of the Gini impurity

L (f) H_{Gini} (f) = k \sum ℓ (k, p_{f k}) p_{f k} = k \sum (1 - p_{f k}) p_{f k}

This means that if we minimise the Gini impurity for a leaf, we are in fact minimising the leaf loss for a choice of $ℓ (k, p_{f k}) =_{def} (1 - p_{f k})$ .

It remains to be seen that $q (f) = p_{f}$ is in fact a centroid with respect to $ℓ$ . We will start from a general characterisation of the centroid and then see that $p_{f}$ in fact fulfills that characterisation.

The centroid is

ar g z min \frac{1}{n _{f}} i \in f \sum (1 - z_{y_{i}}) \equiv ar g z max \frac{1}{n} i \in f \sum z_{y_{i}}

Let $e_{k}$ be the vector that contains $1$ at position $k$ and $0$ otherwise. Then

z_{y_{i}} = ⟨ z, e_{y_{i}} ⟩

Further, $\frac{1}{n _{f}} \sum_{i} e_{y_{i}}$ is exactly the vector of class frequencies $p_{f}$ . Continuing, we have

ar g z max \frac{1}{n} i \in f \sum z_{y_{i}} = ar g z max ⟨ z, \frac{1}{n} i \sum e_{y_{i}} ⟩ = ar g z max ⟨ z, p_{f} ⟩

which is maximised if and only if $z = p_{f}$ .

Entropy and Information Gain

Using the same strategy, we can show that the Information Gain criterion corresponds to the negative entropy impurity and the cross-entropy loss.

Consider the negative entropy impurity measure

H_{entr} (f) = k \sum p_{f k} lo g (p_{f k})

If the leaf combiner is the distribution of classes in $f$ , i.e. $q (f) = p_{f}$ then $H_{entr}$ maximises

ℓ (k, q (x_{i})) = - lo g q (x_{i})_{k} = - lo g p_{f k} .

Rewriting this over all examples, this is the log-loss, also known as cross-entropy loss.

Summary

We have now seen that for a variety of widely-used impurity measures, we can actually exactly derive the loss function they are optimising. This implies a choice of a combiner function.

Impurity	Combiner / Centroid	Loss
Squared-Error Variance	Arithmetic Mean	Squared-Error
Error Rate	Majority Vote	$0/1$ -loss
Gini Impurity	Distribution	Probability of target class
Information Gain	Distribution	$- lo g p_{f k}$

Note that I am not making any claims about the combinations here being empirically better.

Consequently, if we split according to a given impurity measure (and we indeed make a split that reduces impurity), we now know that this also directly improves a certain loss function. In other words, we have the theoretical guarantee that tree model is always at least as good or better after the split of a leaf node (provided we stop when no impurity reduction is possible).

Outlook

Given that the leaf combiner $q (f)$ is a centroid w.r.t. the loss function $ℓ$ , the overall leaf loss

L (f) = \frac{1}{n _{f}} i \in f \sum ℓ (y_{i}, q (f))

becomes a generalised variance. Plugging in the squared-error loss and the arithmetic mean combiner, we obtain the usual “statistical” variance. It turns out that there is a whole class of loss functions for which this structure holds, named Bregman divergences. Note that the $0/1$ -loss is not a Bregman divergence, nevertheless the intuition behind it remains the name. Bregman divergences have very convenient properties which allow us to make similarly optimistic guarantees for Random Forests.

xnhp

Explorer

Decision Trees, Splitting Criteria and Leaf Combiners

The Problem

Tree leaves partition the data space

Leaf losses

The best leaf combiner is the centroid

Maximising purity means minimising leaf loss

Squared-error-loss and Variance Reduction

$0/1$ -loss and majority vote

Impurities of probability distributions

Gini impurity

Entropy and Information Gain

Summary

Outlook

Graph View

Table of Contents

Backlinks

xnhp

Explorer

Decision Trees, Splitting Criteria and Leaf Combiners

The Problem

Tree leaves partition the data space

Leaf losses

The best leaf combiner is the centroid

Maximising purity means minimising leaf loss

Squared-error-loss and Variance Reduction

0/1-loss and majority vote

Impurities of probability distributions

Gini impurity

Entropy and Information Gain

Summary

Outlook

Graph View

Table of Contents

Backlinks

$0/1$ -loss and majority vote