Skip to article frontmatterSkip to article content

Bagging

Cornell University

Bagging is an ensemble method.

Bagging Reduces Variance

Remember the Bias-Variance decomposition:

E[(hD(x)y)2]Error=E[(hD(x)hˉ(x))2]Variance+E[(hˉ(x)yˉ(x))2]Bias+E[(yˉ(x)y(x))2]Noise\underbrace{\mathbb{E}[(h_D(x) - y)^2]}_\mathrm{Error} = \underbrace{\mathbb{E}[(h_D(x)-\bar{h}(x))^2]}_\mathrm{Variance} + \underbrace{\mathbb{E}[(\bar{h}(x)-\bar{y}(x))^2]}_\mathrm{Bias} + \underbrace{\mathbb{E}[(\bar{y}(x)-y(x))^2]}_\mathrm{Noise}

Our goal is to reduce the variance term: E[(hD(x)hˉ(x))2]\mathbb{E}[(h_D(x)-\bar{h}(x))^2]. For this, we want hDhˉh_D \to \bar{h}.

Weak law of large numbers

The weak law of large numbers says: for i.i.d. random variables xix_i with mean xˉ\bar{x}, we have

limm1mi=1mxi=xˉ\lim_{m \to \infty} \frac{1}{m}\sum_{i = 1}^{m}x_i = \bar{x}

Apply this to classifiers: Assume we have m training sets D1,D2,,DnD_1, D_2, …, D_n drawn from PnP^n. Train a classifier on each one and average result:

limmh^=1mi=1mhDi=hˉ\lim_{m \to \infty} \hat{h} = \frac{1}{m}\sum_{i = 1}^m h_{D_i} = \bar{h}

We refer to this average of multiple classifiers as an ensemble of classifiers. Good news: If h^hˉ\hat{h}\rightarrow \bar{h} the variance component of the error must also vanish, i.e. E[(h^(x)hˉ(x))2]0\mathbb{E}[(\hat{h}(x)-\bar{h}(x))^2]\rightarrow 0. However, the problem is that we don’t have mm data sets D1,.,DmD_1, …., D_m . We only have a single DD.

Solution: Bagging (Bootstrap Aggregating)

We need to sample a total mm dataset DiD_i, each of size n=Dn = |D|.

To do this, simulate drawing from PP by drawing uniformly with replacement from the set DD.

Mathematically, let Q(X,YD)Q(X,Y|D) be a probability distribution that picks a training sample (xi,yi)(\mathbf{x}_i,y_i) from DD uniformly at random. i.e. (xi,yi)D,  Q((xi,yi)D)=1n\forall (\mathbf{x_i}, y_i)\in D, \; Q((\mathbf{x_i}, y_i)|D) = \frac{1}{n} with n=Dn=|D|. We sample the set DiQnD_i\sim Q^n, with Di=n|D_i| =n, so DiD_i is picked with replacement from distribution QQ.

Notice we cannot use the W.L.L.N here: h^D=1mi=1mhDihˉ\hat{h}_D = \frac{1}{m}\sum_{i = 1}^{m}h_{D_i}\nrightarrow \bar{h} because these data points are not drawn i.i.d. from the original distribution PP, where hDh_D comes from. Even though they are drawn i.i.d from the distribution QQ (which is conditioned on DD.) However, in practice bagging still reduces variance very effectively.

Analysis

Although we cannot prove that the new samples are i.i.d., we can show that they are drawn from the original distribution PP, namely Q(X=xi)=P(X=xi)Q(X=x_i)=P(X=x_i). A proof can be found at the original lecture notes.

Bagging summarized

  1. Sample mm data sets D1,,DmD_1,\dots,D_m from DD with replacement.
  2. For each DjD_j train a classifier hj()h_j()
  3. The final classifier is h(x)=1mj=1mhj(x)h(\mathbf{x})=\frac{1}{m}\sum_{j=1}^m h_j(\mathbf{x}).

In practice larger mm results in a better ensemble, but at some point you will obtain diminishing returns. Note that setting mm unnecessarily high will only slow down your classifier but will not increase the error of your classifier.

Advantages of Bagging

Random Forest

A Random Forest is essentially bagged decision trees, with a slightly modified splitting criteria - don’t use all features.

The algorithm works as follows:

  1. Sample mm data sets D1,,DmD_1,\dots,D_m from DD with replacement.
  2. For each DjD_j train a full decision tree hjh_j (max-depth=\infty) with one small modification: before each split randomly subsample kdk\leq d features (without replacement) and only consider these features for your split. This further increases the variance of the trees. A good choice for kk is k=dk=\sqrt{d}
  3. The final classifier is h(x)=1mj=1mhj(x)h(\mathbf{x})=\frac{1}{m}\sum_{j=1}^m h_j(\mathbf{x}).

The randomness in random forest comes from the random sampling in step 1 and the random feature selection in step 2.

The Random Forest is one of the best and easiest to use model. There are two reasons:

Useful variants of Random Forests: