Bagging - Incomplete CS Notes @ Cornell

Bagging is an ensemble method.

Bagging Reduces Variance¶

Remember the Bias-Variance decomposition:

\underbrace{\mathbb{E}[(h_D(x) - y)^2]}_\mathrm{Error} = \underbrace{\mathbb{E}[(h_D(x)-\bar{h}(x))^2]}_\mathrm{Variance} + \underbrace{\mathbb{E}[(\bar{h}(x)-\bar{y}(x))^2]}_\mathrm{Bias} + \underbrace{\mathbb{E}[(\bar{y}(x)-y(x))^2]}_\mathrm{Noise}

Our goal is to reduce the variance term: $\mathbb{E}[(h_D(x)-\bar{h}(x))^2]$ . For this, we want $h_D \to \bar{h}$ .

Weak law of large numbers¶

The weak law of large numbers says: for i.i.d. random variables $x_i$ with mean $\bar{x}$ , we have

\lim_{m \to \infty} \frac{1}{m}\sum_{i = 1}^{m}x_i = \bar{x}

(1)

Apply this to classifiers: Assume we have m training sets $D_1, D_2, …, D_n$ drawn from $P^n$ . Train a classifier on each one and average result:

\lim_{m \to \infty} \hat{h} = \frac{1}{m}\sum_{i = 1}^m h_{D_i} = \bar{h}

(2)

We refer to this average of multiple classifiers as an ensemble of classifiers. Good news: If $\hat{h}\rightarrow \bar{h}$ the variance component of the error must also vanish, i.e. $\mathbb{E}[(\hat{h}(x)-\bar{h}(x))^2]\rightarrow 0$ . However, the problem is that we don’t have $m$ data sets $D_1, …., D_m$ . We only have a single $D$ .

Solution: Bagging (Bootstrap Aggregating)¶

We need to sample a total $m$ dataset $D_i$ , each of size $n = |D|$ .

To do this, simulate drawing from $P$ by drawing uniformly with replacement from the set $D$ .

Mathematically, let $Q(X,Y|D)$ be a probability distribution that picks a training sample $(\mathbf{x}_i,y_i)$ from $D$ uniformly at random. i.e. $\forall (\mathbf{x_i}, y_i)\in D, \; Q((\mathbf{x_i}, y_i)|D) = \frac{1}{n}$ with $n=|D|$ . We sample the set $D_i\sim Q^n$ , with $|D_i| =n$ , so $D_i$ is picked with replacement from distribution $Q$ .

Notice we cannot use the W.L.L.N here: $\hat{h}_D = \frac{1}{m}\sum_{i = 1}^{m}h_{D_i}\nrightarrow \bar{h}$ because these data points are not drawn i.i.d. from the original distribution $P$ , where $h_D$ comes from. Even though they are drawn i.i.d from the distribution $Q$ (which is conditioned on $D$ .) However, in practice bagging still reduces variance very effectively.

Analysis¶

Although we cannot prove that the new samples are i.i.d., we can show that they are drawn from the original distribution $P$ , namely $Q(X=x_i)=P(X=x_i)$ . A proof can be found at the original lecture notes.

Bagging summarized¶

Sample $m$ data sets $D_1,\dots,D_m$ from $D$ with replacement.
For each $D_j$ train a classifier $h_j()$
The final classifier is $h(\mathbf{x})=\frac{1}{m}\sum_{j=1}^m h_j(\mathbf{x})$ .

In practice larger $m$ results in a better ensemble, but at some point you will obtain diminishing returns. Note that setting $m$ unnecessarily high will only slow down your classifier but will not increase the error of your classifier.

Advantages of Bagging¶

Easy to implement
Reduces variance, so has a strong beneficial effect on high variance classifiers.
We can obtain a mean score and variance as the prediction is an average of many classifiers. Variance can be interpreted as the uncertainty of the prediction. Especially in regression tasks, such uncertainties are otherwise hard to obtain. For example, imagine the prediction of a house price is $300,000. If a buyer wants to decide how much to offer, it would be very valuable to know if this prediction has standard deviation +-$10,000 or +-$50,000.
Bagging provides an unbiased estimate of the test error- the out-of-bag error. The idea is that for each training point, there are very likely to be some data sets $D_k$ that did not pick this point. If we average the classifiers $h_k$ of all such data sets, we obtain a classifier that was not trained on $(\mathbf{x}_i,y_i)$ ever. To these classifiers, this point is equivalent to a test sample. If we compute the error of all these points, we obtain an estimate of the true test error.
More formally, for each training point $(\mathbf{x}_i,y_i)\in D$ let $S_i=\{k| (\mathbf{x}_i,y_i)\notin D_k\}$ - in other words $S_i$ - the set of all the training sets $D_k$ that do not contain $(\mathbf{x}_i,y_i)$ . Let the averaged classifier over all these data sets be
$\tilde h_i(\mathbf{x})=\frac{1}{|S_i|}\sum_{k\in S_i}h_k(\mathbf{x}).$
(3)
The-of-bag error is the average error/loss that all these classifiers yield
$\epsilon_\mathrm{OOB}=\frac{1}{n}\sum_{(\mathbf{x}_i, y_i) \in D}l(\tilde h_i(\mathbf{x_i}),y_i).$
(4)
This is an estimate of the test error, because for each training point we used the subset of classifiers that never saw that training point during training. if $m$ is sufficiently large, the fact that we take out some classifiers has no significant effect and the estimate is pretty reliable.

Random Forest¶

A Random Forest is essentially bagged decision trees, with a slightly modified splitting criteria - don’t use all features.

The algorithm works as follows:

Sample $m$ data sets $D_1,\dots,D_m$ from $D$ with replacement.
For each $D_j$ train a full decision tree $h_j$ (max-depth= $\infty$ ) with one small modification: before each split randomly subsample $k\leq d$ features (without replacement) and only consider these features for your split. This further increases the variance of the trees. A good choice for $k$ is $k=\sqrt{d}$
The final classifier is $h(\mathbf{x})=\frac{1}{m}\sum_{j=1}^m h_j(\mathbf{x})$ .

The randomness in random forest comes from the random sampling in step 1 and the random feature selection in step 2.

The Random Forest is one of the best and easiest to use model. There are two reasons:

It only has two hyper-parameters, $m$ and $k$ and is extremely insensitive to both of these. You can set $m$ as large as you can afford and a popular choice of $k$ is $\sqrt d$
Decision trees do not require a lot of preprocessing, so the features can be of different scale, magnitude, or slope. This is highly advantageous with heterogeneous data, for example with features like blood pressure, age, gender, ... each of which is recorded in completely different units.

Useful variants of Random Forests:

Split each training set into two partitions $D_l=D_l^A\cup D_l^B$ . Build the tree on $D_l^A$ and estimate the leaf labels on $D_l^B$ . You must stop splitting if a leaf has only a single point in $D_l^B$ in it. This has the advantage that each tree and also the RF classifier become consistent.
Do not grow each tree to its full depth, instead prune based on the leave out samples. This can further improve your bias/variance trade-off. (An easy way to prune decision trees is to start from the bottom and then to remove splits and check to see if it affects our error too much. If it doesn’t we remove that split.)

CS4780 Intro to Machine Learning

Decision Tree

CS4780 Intro to Machine Learning

Boosting