Skip to article frontmatterSkip to article content

Decision Tree

Cornell University

Motivation for Decision Trees

Recall the kD tree data structure. Decision trees also divide the data into different regions, except that it stops dividing when current region is pure (all have the same label). Therefore, if a test point falls into any bucket, we can just return the label of that bucket. Our new goal becomes to build a tree that is:

  1. Maximally compact (of minimal size)
  2. Only has pure leaves

Finding such a minimum size tree turns out to be NP-Hard. But the good news is: We can approximate it very effectively with a greedy strategy. We keep splitting the data to minimize an impurity function that measures label purity amongst the children.

Impurity Functions

Data: S={(x1,y1),,(xn,yn)},yi{1,,c}S=\left\{ \left( \mathbf{x}_1,y_1 \right),\dots,\left( \mathbf{x}_n,y_n \right) \right\}, y_i\in \left\{ 1,\dots,c \right\}, where cc is the number of classes.

Let SkSS_k\subseteq S be the inputs with labels kk. Formally, Sk={(x,y)S:y=k}S_k=\left \{ \left ( \mathbf{x},y \right )\in S:y=k \right \}, so S=S1ScS=S_1\cup \dots \cup S_c. Define

pk=SkSfraction of inputs in S with label kp_k=\frac{\left | S_k \right |}{\left | S \right |}\leftarrow \textrm{fraction of inputs in } S \textrm{ with label } k

We know what we don’t want (Uniform Distribution): p1=p2==pc=1cp_1=p_2=\dots=p_c=\frac{1}{c}. This is the worst case since each class is equally likely. Making decision based on some algorithm is no better than random guessing.

Gini impurity

So Gini impurity on a leaf is:

G(S)=k=1cpk(1pk)G(S) = \sum_{k=1}^{c}p_k(1-p_k)
img

Fig: The Gini Impurity Function in the binary case reaches its maximum at p=0.5p=0.5

This corresponds with the idea that uniform distribution is the least we want to see.

Gini impurity of a tree is:

GT(S)=SLSGT(SL)+SRSGT(SR)G^T(S)=\frac{\left | S_L \right |}{\left | S \right |}G^T(S_L)+\frac{\left | S_R \right |}{\left | S \right |}G^T(S_R)

Entropy

Define the impurity as how close we are to uniform. Use KLKL-Divergence to describe “closeness” from pp to qq. (Note: KLKL-Divergence is not a metric because it is not symmetric, i.e. KL(pq)KL(qp)KL(p||q)\neq KL(q||p))

KL(pq)=k=1cpklogpkqk\begin{align} KL(p||q)=\sum_{k=1}^{c}p_klog\frac{p_k}{q_k} \end{align}

In our case, pp is our distribution and let q1,,qcq_1,\dots,q_c be the uniform label/distribution. i.e. k,  qk=1c\forall k, \; q_k=\frac{1}{c}. We will calculate the KL divergence between pp and qq.

KL(pq)=k=1cpklogpkqk=kpklog(pk)pklog(qk)qk=1c=kpklog(pk)+pklog(c)=kpklog(pk)+log(c)kpk\begin{align} KL(p||q) = &\sum_{k=1}^{c}p_klog\frac{p_k}{q_k}\\ = &\sum_{k}p_klog(p_k)-p_klog(q_k) &q_k=\frac{1}{c}\\ = &\sum_{k}p_klog(p_k)+p_klog(c)\\ = &\sum_{k}p_klog(p_k)+log(c)\sum_{k}p_k \end{align}

If we are uniform (our least wanted case), logpkqk=log1=0\log \frac {p_k} {q_k} = \log 1 = 0, so KL=0KL = 0. From this, we see that at the end we want to maximize the distance KLKL.

Since log(c)log(c) is constant and kpk=1\sum_{k}p_k=1,

maxpKL(pq)=maxpkpklog(pk)=minpkpklog(pk)=minpH(s)\begin{align} \max_{p}KL(p||q) = &\max_{p}\sum_{k}p_klog(p_k) \\ = &\min_{p}-\sum_{k}p_klog(p_k) \\ = &\min_{p}H(s) \end{align}

Define entropy of set SS is H(s)=kpklog(pk)H(s) = -\sum_{k}p_klog(p_k). We have therefore translated the problem of maximizing distance to minimizing entropy.

Entropy on a tree is defined the same as Gini impurity.

H(S)=SLSH(SL)+SRSH(SR)H(S)=\frac{|S^L|}{|S|} H(S^L) + \frac{|S^R|}{|S|} H(S^R)

ID3-Algorithm

Base Cases

In two situations, we say that we have reached the base case:

  1. there is no need to split: all the data points in a subset have the same label, so return that label
  2. there is no way to split: there are no more attributes could be used to split the subset (all the data points have the same value across all entries), so return the most common label
ID3(S):{if yˉ s.t. (x,y)S,y=yˉreturn leaf  with label yˉif xˉ s.t. (x,y)S,x=xˉreturn leaf  with mode(y:(x,y)S) or mean (regression)\textrm{ID3}(S):\left\{ \begin{array}{ll} \textrm{if } \exists \bar{y}\textrm{ s.t. }\forall(x,y)\in S, y=\bar{y}\Rightarrow \textrm{return leaf } \textrm{ with label } \bar{ y}\\ \textrm{if } \exists\bar{x}\textrm{ s.t. }\forall(x,y)\in S, x=\bar{x}\Rightarrow \textrm{return leaf } \textrm{ with mode}(y:(x,y)\in S)\textrm{ or mean (regression)}\end{array} \right.

Note, we do not stop when if no split can improve impurity:

Why? Look at the following XOR example

img

Decision trees are myopic: ID3 (and possibly all algorithms used to find decision trees) is a greedy algorithm and only looks at one feature at a time. In the example, the first split does NOT decrease the impurity measure, however, if combining two feature splits, we reduce the impurity measure to 0.

Recursion Step and Time Analysis

At each recursion, we find the split that would minimize impurity based on these two parameters:

For a specific feature/dimension ii, there are n1=O(n)n-1 = O(n) possible places to draw the line (the same as putting dividers between balls). When we draw such a line at tt, we are basically dividing the set into xftx_f\le t and xf>tx_f \gt t, so

Define: [SL={(x,y)S:xft}SR={(x,y)S:xf>t}]\textrm{Define: }\begin{bmatrix} S^L=\left \{ (x,y)\in S: x_f\leq t \right \}\\ S^R=\left \{ (x,y)\in S: x_f> t \right \} \end{bmatrix}

To draw such a line based on dimension ii, we first have to sort all data points by xix_i - their value on this dimension. This sort takes O(n  logn)O(n\;\log n) time and we need test all O(n)O(n) possible line positions to determine where to draw the line. Therefore, it takes O(n  logn)O(n\;\log n) time for each feature.

For all features and all corresponding places to draw the line, choose the division that minimizes impurity function. There are a total dd features, so each final drawing takes

O(dn  logn)O(dn\;\log n)

Regression Trees - CART

CART stands for Classification and Regression Trees.

Assume labels are continuous: yiRy_i\in\mathbb{R}, we can just use squared loss as our impurity function, so

L(S)=1S(x,y)S(yyˉS)2Average squared difference from average labelwhere yˉS=1S(x,y)SyAverage label\begin{align} L(S)=\frac{1}{|S|}\sum_{(x,y)\in S}(y-\bar{y}_S)^2 \leftarrow &\textrm{Average squared difference from average label}\\ &\textrm{where }\bar{y}_S=\frac{1}{|S|}\sum_{(x,y)\in S}y\leftarrow\textrm{Average label} \end{align}

This model has the following characteristics:

Parametric vs. Non-parametric algorithms

So far we have introduced a variety of algorithms. One can categorize these into different families, such as generative vs. discriminative, or probabilistic vs. non-probabilistic. Here we will introduce another one: parametric vs. non-parametric.

Decision Tree is an interesting edge case. If they are trained to full depth they are non-parametric, as the depth of a decision tree scales as a function of the training data (in practice O(log2(n))O(\log_2(n))). If we however limit the tree depth by a maximum value they become parametric (as an upper bound of the model size is now known prior to observing the training data).