Skip to article frontmatterSkip to article content

Boosting

Cornell University

Generic Boosting

Boosting Reduces Bias

Scenario: Hypothesis class H\mathbb{H}, whose set of classifiers has large bias and the training error is high (e.g. CART trees with very limited depth.)

Famous question: Can weak learners (hh) be combined to generate a strong learner with low bias?

Answer: Yes!

Solution: Create ensemble classifier HT(x)=t=1Tαtht(x)H_T(\vec x) = \sum_{t = 1}^{T}\alpha_t h_t(\vec{x}). This ensemble classifier is built in an iterative fashion. In iteration tt we add the classifier αtht(x)\alpha_th_t(\vec x) to the ensemble. At test time we evaluate all classifier and return the weighted sum.

Let \ell denote a (convex and differentiable) loss function.

(H)=1ni=1n(H(xi),yi)\ell(H)=\frac{1}{n}\sum_{i=1}^n \ell(H(\mathbf{x}_i),y_i)

Assume we have already finished tt iterations and already have an ensemble classifier Ht(x)H_t(\vec{x}). Now at iteration t+1t+1 we add one more weak learner ht+1h_{t+1} that minimizes the loss:

ht+1=argminhH  (Ht+αh).h_{t+1} = \textrm{argmin}_{h \in \mathbb{H}} \; \ell(H_t + \alpha h).

So Ht+1:=Ht+αhH_{t+1} := H_t + \alpha h

Finding hh - Gradient Descent in Functional Space

As before, we find minimum by doing gradient descent. However, instead of finding a parameter that minimizes the loss function, we find a function hh this time. Therefore we will do gradient descent in function space.

Given HH, we want to find the step-size α\alpha and the weak learner hh to minimize the loss (H+αh)\ell(H+\alpha h). Use Taylor Approximation on (H+αh)\ell(H+\alpha h):

(H+αh)(H)+α<(H),h>\ell(H+\alpha h) \approx \ell(H) + \alpha<\nabla \ell(H),h>

This approximation only holds within a small region around (H)\ell(H). We therefore fix α\alpha to a small constant (e.g. α0.1\alpha\approx 0.1). With the step-size α\alpha fixed, we can use the approximation above to find an almost optimal hh:

$$

h=argminhH(H+αh)argminhH<(H),h>h = \textrm{argmin}_{h\in H}\ell(H+\alpha h) \approx \textrm{argmin}_{h\in H}<\nabla \ell(H),h>

$$

In function space, inner product is defined as <f,g>=xf(x)g(x)dx< f,g >=\int\limits_x f(x)g(x)dx. This is intractable most of the time because xx comes from an infinite space. Since we only have training set, we can rewrite the inner product as <f,g>=i=1nf(xi)g(xi)< f,g >= \sum_{i = 1}^{n} f(\mathbf{x}_i)g(\mathbf{x}_i), where ff is the gradient, gg is model hh.

h=argminhHi=1nH(xi)h(xi)h = \textrm{argmin}_{h \in \mathbb{H}}\sum_{i = 1}^{n} \frac{\partial \ell}{\partial H}(\mathbf{x}_i) h(\mathbf{x}_i)

Note H\frac{\partial \ell}{\partial H} is the derivative of a function with respect to another function, which is tricky, so we need a work-around. (H)=i=1n(H(xi))\ell(H) = \sum_{i = 1}^{n}\ell(H(\mathbf{x}_i)), so we can write H(xi)=[H(xi)]\frac{\partial \ell}{\partial H}(\mathbf{x}_i) = \frac{\partial \ell}{\partial [H(\mathbf{x}_i)]}, the derivative of the loss with respect to a specific function value. Now the optimization problem becomes:

ht+1=argminhHi=1n[H(xi)]rih(xi)h_{t+1} = \textrm{argmin}_{h \in \mathbb{H}} \sum_{i = 1}^{n} \underbrace{\frac{\partial \ell}{\partial [H(\mathbf{x}_i)]}}_{r_i} h(\mathbf x_i)

In order to make progress this hh does not have to be great (reach a minimum). We still make progress as long as i=1nrih(xi)<0\sum_{i = 1}^{n} r_i h(\mathbf{x}_i)<0. That is because

(Ht+1)=(Ht+αht+1)(Ht)+α<(Ht),ht+1>=(Ht)+αi=1nrih(xi)<(Ht)\begin{align} \ell(H_{t+1}) &= \ell(H_t+\alpha h_{t+1}) \\ &\approx \ell(H_t) + \alpha<\nabla \ell(H_t),h_{t+1}>\\ &= \ell(H_t) + \alpha \sum_{i = 1}^{n} r_i h(\mathbf{x}_i)\\ &< \ell(H_{t}) \end{align}

In this way, we decrease the loss by adding this new model ht+1h_{t+1}. However, if the this inner product is >0 there is nothing we can do with gradient descent.

Generic boosting (a.k.a Anyboost) in Pseudocode

注意,下图只是把我们上一节描述的东西用伪代码写下来了而已,实际是完全一样的东西。

img

Case study #1: Gradient Boosted Regression Tree(GBRT)

Background and Setting

Goal

We want to find a tree hh that maximizes h=argminhHi=1nrih(xi)h = \textrm{argmin}_{h \in \mathbb{H}} \sum_{i = 1}^{n} r_i h(\mathbf{x}_i) where ri=H(xi)r_i = \frac{\partial \ell}{\partial H(\mathbf{x}_i)}.

Assumptions

  1. First, we assume that i=1nh2(xi)\sum_{i = 1}^{n} h^2(\mathbf{x}_i) = constant. So we are essentially fixing the vector hh in i=1nh(xi)ri\sum_{i=1}^n h(\mathbf{x}_i)r_i to lie on a circle, and we are only concerned with its direction but not its length.
  2. Define the negative gradient as ti=ri=H(xi)t_i = -r_i = -\frac{\partial \ell}{\partial H(\mathbf{x}_i)}.

Algorithm

h=argminhHi=1nrih(xi) original AnyBoost formulation=argminhH2i=1ntih(xi)Swapping in ti for ri and multiply by constant 2=argminhHi=1nti2constant2tih(xi)+(h(xi))2constantAdding constant iti2+h(xi)2=argminhHi=1n(h(xi)ti)2\begin{align} h &= \textrm{argmin}_{h \in \mathbb{H}} \sum_{i = 1}^{n} r_i h(\mathbf{x}_i) &&\text{ original AnyBoost formulation} \\ &= \textrm{argmin}_{h \in \mathbb{H}}-2\sum_{i = 1}^{n} t_i h(\mathbf{x}_i) &&\text{Swapping in $t_i$ for $-r_i$ and multiply by constant 2} \\ &= \textrm{argmin}_{h \in \mathbb{H}} \sum_{i = 1}^{n} \underbrace{t_i^2}_{\textrm{constant}} - 2t_i h(\mathbf{x}_i) + \underbrace{(h(\mathbf{x}_i))^2}_{\textrm{constant}} &&\text{Adding constant $\sum_i t_i^2+h(\mathbf{x}_i)^2$} \\ &=\textrm{argmin}_{h \in \mathbb{H}}\sum_{i = 1}^{n}(h(\mathbf{x}_i)-t_i)^2 \end{align}

At the second last step, note ti2t_i^2 is the negation of loss function with respect to the ensemble HH we already chose, independent of the model hh we are choosing, so it is a constant. On the other hand, we constrained in the assumptions part that i=1nh2(xi)\sum_{i = 1}^{n} h^2(\mathbf{x}_i) is a constant.

Therefore, we translate the original problem about finding a regression tree that minimizes an arbitrary loss function to this problem of finding a regression tree that minimizes the squared loss with our new “label” tt and we are fitting tt now.

If the loss function \ell is the squared loss, i.e. (H)=12i=1n(H(xi)yi)2\ell(H)=\frac{1}{2}\sum_{i=1}^n (H(\mathbf{x}_i)-y_i)^2, there is a special meaning of this tt: it is easy to show that

ti=H(xi)=yiH(xi),t_i=-\frac{\partial \ell}{H(\mathbf{x}_i)}=y_i-H(\mathbf{x}_i),

which is simply the error between our current prediction and the correct label, so this newly added model is just fitting the error current ensemble model has. However, it is important to keep in mind that you can use any other differentiable and convex loss function \ell and the solution for your next weak learner hh will always be the regression tree minimizing the squared loss.

GBRT in Pseudo Code

img

Case Study #2: AdaBoost

Finding the best weak learner

At each update step, first we compute the gradient ri=H(xi)=yieyiH(xi)r_i=\frac{\partial \ell}{\partial H(\mathbf{x}_i)}=-y_i {e^{-y_i H(\mathbf{x}_i)}}. For notational convenience, define wi=1ZeyiH(xi)w_i= \frac{1}{Z}e^{-y_iH(\mathbf{x}_i)}, where Z=i=1neyiH(xi)Z=\sum_{i=1}^{n} e^{-y_iH(\mathbf{x}_i)} is a normalizing factor so that i=1nwi=1.\sum_{i=1}^{n} w_i=1. Note that the normalizing constant ZZ is identical to the loss function. Each weight wiw_i can now be interpreted as the relative contribution of the training point (xi,yi)(\mathbf{x}_i,y_i) towards the overall loss. Therefore, at each update step, we reweight the data points according to current loss and in later part of the algorithm will prioritize those points we got really wrong.

We can translate the optimization problem of minimizing loss into a optimization probelm of minimizing the weighted classification error.

h(xi)=argminhHi=1nrih(xi)(substitute in: ri=eH(xi)yi)=argminhHi=1nyieH(xi)yih(xi)(substitute in: wi=1ZeH(xi)yi,Z is constant)=argminhHi=1nwiyih(xi)(yih(xi){+1,1} with h(xi)yi=1    h(xi)=yi)=argminhHi:h(xi)yiwii:h(xi)=yiwi(i:h(xi)=yiwi+i:h(xi)yiwi=1)=argminhHi:h(xi)yiwi(This is the weighted classification error.)\begin{align} h(\mathbf{x}_i)&=\textrm{argmin}_{h \in \mathbb{H}}\sum_{i=1}^{n}r_ih(\mathbf{x}_i) && \Big(\textrm{substitute in: } r_i=e^{-H(\mathbf{x}_i)y_i}\Big)\\ &=\textrm{argmin}_{h \in \mathbb{H}}-\sum_{i=1}^n y_i e^{-H(\mathbf{x}_i)y_i}h(\mathbf{x}_i) && \Big(\textrm{substitute in: } w_i=\frac{1}{Z}e^{-H(\mathbf{x}_i)y_i}, \textrm{Z is constant}\Big)\\ &=\textrm{argmin}_{h \in \mathbb{H}}-\sum_{i=1}^{n} w_i y_i h(\mathbf{x}_i) && \Big(y_ih(\mathbf{x}_i)\in \{+1,-1\} \textrm{ with } h(\mathbf{x}_i)y_i=1 \iff h(\mathbf{x}_i)=y_i \Big)\\ &=\textrm{argmin}_{h \in \mathbb{H}}\sum_{i: h(\mathbf{x}_i)\neq y_i} w_i - \sum_{i: h(\mathbf{x}_i)= y_i} w_i && \Big(\sum_{i: h(\mathbf{x}_i)= y_i} w_i + \sum_{i: h(\mathbf{x}_i)\neq y_i} w_i= 1\Big)\\ &=\textrm{argmin}_{h \in \mathbb{H}}\sum_{i: h(\mathbf{x}_i)\neq y_i} w_i && \Big(\textrm{This is the weighted classification error.}\Big) \end{align}

Let us denote this weighted classification error as ϵ=i:h(xi)yi=1wi\epsilon=\sum_{i:h(\mathbf{x}_i)y_i=-1} w_i, or to say the weights of those wrongly classified points. So for AdaBoost, we only need a classifier that reduces this weighted classification error of these wrongly labeled training samples. It doesn’t have to do all that well, in order for the inner-product irih(xi)\sum_i r_i h(\mathbf{x}_i) to be negative, it just needs less than ϵ<0.5\epsilon<0.5 weighted training error.

Finding the stepsize α\alpha

In the previous example, GBRT, we set the stepsize α\alpha to be a small constant. As it turns out, in the AdaBoost setting we can find the optimal stepsize (i.e. the one that minimizes \ell the most) in closed form every time we take a “gradient” step.

If we take the derivative of loss function \ell with respect to α\alpha, we will actually find out that there is a nice closed form for α\alpha:

α=12ln1ϵϵ\alpha = \frac{1}{2}\ln \frac{1-\epsilon}{\epsilon}

It is unusual that we can find the optimal step-size in such a simple closed form. One consequence is that AdaBoost converges extremely fast.

Re-normalization

After you take a step, i.e. Ht+1=Ht+αhH_{t+1}=H_{t}+\alpha h, you need to re-compute all the weights and then re-normalize. However, if we update ww using the formula below, ww will remain normalized.

wiwieαh(xi)yi2ϵ(1ϵ){w}_i\leftarrow w_i\frac{e^{-\alpha h(\mathbf{x}_i)y_i}}{2\sqrt{\epsilon(1-\epsilon)}}

AdaBoost Pseudo-code

img

The inner loop can terminate as the error ϵ=12\epsilon=\frac{1}{2}, and in most cases it will converge to 12\frac{1}{2} over time. In that case the latest weak learner hh is only as good as a coin toss and cannot benefit the ensemble (therefore boosting terminates). Also note that if ϵ=12\epsilon=\frac{1}{2} the step-size α\alpha would be zero.

Further analysis

We can in fact show that the training loss is decreasing exponentially! Even further, we can show that after O(log(n))O(\log(n)) iterations your training error must be zero. In practice it often makes sense to keep boosting even after you make no more mistakes on the training set.