
are called leaf nodes. From all nodes, except for leaves, there are branches coming out. A branch is
defined by ’if’ statements. Likewise all nodes, except leafs, has children nodes which are nodes coming
out of another node, called the parent node. What we now need to understand is how decision trees
are making these if-statements.
2 How are Decision Trees built?
First, we need to understand the splitting rule. The splitting rule is about splitting a node (parent node)
into two other nodes (children nodes). A splitting rule requires an explanatory variable and a threshold
of classification. Just like as in figure 1. To understand how nodes are split we need to understand the
term impurity.
p
i
, describes the proportion of observations in class i. Impurity is a measure of how much information
is needed to classify. We say that a node is pure if it only contains observations from one class. But
how does we measure impurity? We can use entropy (like in logistic regression) and the gini index.
Mathematically we can write Entropy as:
Entropy = −
n
X
i=1
p
i
· log(p
i
) (1)
So how we can interpret impurity from an entropy definition is simple. Entropy must be the amount of
information needed to accurately describe the data. If the data is homogeneous, meaning all the elements
are similar, the entropy will be 0. If the data is equally divided the entropy will be 1. Let us also take a
look at the gini index:
gini index = 1 −
n
X
i=1
p
2
i
(2)
The interpretation are very similar with entropy. The gini index is a measure of inequality in data. But
how exactly is a node split then?
The objective of the algorithm is to maximize gain in purity. To understand what this means let us
consider a parent node m. m
1
and m
2
is the number of observations in child 1 and child 2. Therefore,
m = m
1
+ m
2
. The objective (gain in purity) is then defined as:
gain = 1 −
m
1
m
impurity
1
−
m
2
m
impurity
2
(3)
The way the algorithm finds the maximized gain in purity is with full enumeration. So it goes through
all explanatory variables and tries each possible split for each variable. The split and variable which
maximized the gain in purity will be chosen. Mostly the gini index is used for calculating the impurity.
However, this splitting can continue until we only have pure nodes. But if the algorithm continued until
this stage the model would be overfitted. We need to define a stopping criteria to tell the algorithm when
to stop splitting parent nodes and make a classification.
The usual stopping criterion is simple. The number of observations in each node should be above a
2