Bias-Variance Tradeoff - Incomplete CS Notes @ Cornell

Setting Up¶

As usual, we are given a training dataset $D = \{(\mathbf{x}_1, y_1), \dots, (\mathbf{x}_n,y_n)\}$ , drawn i.i.d. from some distribution $P(X,Y)$ . Throughout this lecture we assume a regression setting, i.e. $y \in \mathbb{R}$ . In this lecture we will decompose the generalization error of a classifier into three rather interpretable terms.

Expected Label¶

Even though we have the same features $\mathbf{x}_1 = \mathbf{x}_2$ , their labels can be different $y_1 \not= y_2$ . For example, if your vector $\mathbf{x}$ describes features of house (e.g. #bedrooms, square footage, ...) and the label $y$ its price, you could imagine two houses with identical description selling for different prices. So for any given feature vector $\mathbf{x}$ , there is a distribution over possible labels. According to this idea, we define given $\mathbf{x} \in \mathbb{R}^d$ :

\bar{y}(\mathbf{x}) = E_{y \vert \mathbf{x}} \left[Y\right] = \int\limits_y y \, \Pr(y \vert \mathbf{x}) \partial y.

(1)

is the expected label - the label you would expect to obtain, given a feature vector $\mathbf{x}$ .

Expected Model¶

After drawing a training set $D$ i.i.d. from the distribution $P$ , we will use some machine learning algorithm $\mathcal{A}$ on this data set $D$ to train a model $h$ . Formally, we denote this process as $h_D = \mathcal{A}(D)$ . Similar to the reasoning above, a same learning algorithm $\mathcal{A}$ can give different models $h_D$ based on different datasets $D$ , so we want to find the expected model $\bar h$ .

\bar{h} = E_{D \sim P^n} \left[ h_D \right] = \int\limits_D h_D \Pr(D) dD

(2)

where $\Pr(D)$ is the probability of drawing $D$ from $P^n$ . In practice, we cannot integrate over all datasets, so we just sample some datasets and take their average (taking average means we think the probability of drawing out each $D$ is the same. We can assume this because each time we draw $D$ from the same distribution $P^n$ )

Expected Error¶

For a given $h_D$ , learned on data set $D$ with algorithm $\mathcal{A}$ , we can compute the generalization error as measured in squared loss here. (One can use other loss functions. We use squared loss because it is easier for the derivation later) as follows. This is the error of our model:

\epsilon_{h_D} = E_{(\mathbf{x},y) \sim P} \left[ \left(h_D (\mathbf{x}) - y \right)^2 \right] = \int\limits_x \! \! \int\limits_y \left( h_D(\mathbf{x}) - y\right)^2 \Pr(\mathbf{x},y) \; d \mathbf{x} \; dy

(3)

where $(\mathbf x, y)$ is a pair of test point.

Given the idea of “expected model” discussed above, we can compute the expected test error only given $\mathcal{A}$ , taking the expectation also over $D$ . So now we have the error of an algorithm, which produces different models based on different training data:

\epsilon_{\mathcal A} = E_{\substack{(\mathbf{x},y) \sim P\\ D \sim P^n}} \left[\left(h_{D}(\mathbf{x}) - y\right)^{2}\right] = \int_{D} \int_{\mathbf{x}} \int_{y} \left( h_{D}(\mathbf{x}) - y\right)^{2} \mathrm{P}(\mathbf{x},y) \mathrm{P}(D) \; d \mathbf{x} \;dy \;dD

where $D$ is our training datasets and the $(\mathbf{x},y)$ pairs are the test points. We are interested in exactly this expression, because it evaluates the quality of a machine learning algorithm $\mathcal{A}$ with respect to a data distribution $P(X,Y)$ . In the following we will show that this expression decomposes into three meaningful terms.

Derivation¶

\begin{align} E_{\mathbf{x},y,D}\left[\left[h_{D}(\mathbf{x}) - y\right]^{2}\right] &= E_{\mathbf{x},y,D}\left[\left[\left(h_{D}(\mathbf{x}) - \bar{h}(\mathbf{x})\right) + \left(\bar{h}(\mathbf{x}) - y\right)\right]^{2}\right] \nonumber \\ &= E_{\mathbf{x}, D}\left[(\bar{h}_{D}(\mathbf{x}) - \bar{h}(\mathbf{x}))^{2}\right] + 2 \mathrm{\;} E_{\mathbf{x}, y, D} \left[\left(h_{D}(\mathbf{x}) - \bar{h}(\mathbf{x})\right)\left(\bar{h}(\mathbf{x}) - y\right)\right] + E_{\mathbf{x}, y} \left[\left(\bar{h}(\mathbf{x}) - y\right)^{2}\right] \end{align}

(4)

The middle term of the above equation is 0 as we show below:

\begin{align*} E_{\mathbf{x}, y, D} \left[\left(h_{D}(\mathbf{x}) - \bar{h}(\mathbf{x})\right) \left(\bar{h}(\mathbf{x}) - y\right)\right] &= E_{\mathbf{x}, y} \left\{ E_{D}\left[\left(h_{D}(\mathbf{x}) - \bar{h}(\mathbf{x})\right) \left(\bar{h}(\mathbf{x}) - y\right)\right] \right\} \\ &= E_{\mathbf{x}, y} \left[E_{D} \left[ h_{D}(\mathbf{x}) - \bar{h}(\mathbf{x})\right] \left(\bar{h}(\mathbf{x}) - y\right) \right] \\ &= E_{\mathbf{x}, y} \left[ \left( E_{D} \left[ h_{D}(\mathbf{x}) \right] - \bar{h}(\mathbf{x}) \right) \left(\bar{h}(\mathbf{x}) - y \right)\right] \\ &= E_{\mathbf{x}, y} \left[ \left(\bar{h}(\mathbf{x}) - \bar{h}(\mathbf{x}) \right) \left(\bar{h}(\mathbf{x}) - y \right)\right] \\ &= 0 \end{align*}

(5)

Returning to the earlier expression, we’re left with the variance and another term:

E_{\mathbf{x}, y, D} \left[ \left( h_{D}(\mathbf{x}) - y \right)^{2} \right] = \underbrace{E_{\mathbf{x}, D} \left[ \left(h_{D}(\mathbf{x}) - \bar{h}(\mathbf{x}) \right)^{2} \right]}_\mathrm{Variance} + E_{\mathbf{x}, y}\left[ \left( \bar{h}(\mathbf{x}) - y \right)^{2} \right]

(6)

We can break down the second term in the above equation just as what we did to $\epsilon_\mathcal A$ at first:

\begin{align} E_{\mathbf{x}, y} \left[ \left(\bar{h}(\mathbf{x}) - y \right)^{2}\right] &= E_{\mathbf{x}, y} \left[ \left(\bar{h}(\mathbf{x}) -\bar y(\mathbf{x}) )+(\bar y(\mathbf{x}) - y \right)^{2}\right] \\ &=\underbrace{E_{\mathbf{x}, y} \left[\left(\bar{y}(\mathbf{x}) - y\right)^{2}\right]}_\mathrm{Noise} + \underbrace{E_{\mathbf{x}} \left[\left(\bar{h}(\mathbf{x}) - \bar{y}(\mathbf{x})\right)^{2}\right]}_\mathrm{Bias^2} + 2 \mathrm{\;} E_{\mathbf{x}, y} \left[ \left(\bar{h}(\mathbf{x}) - \bar{y}(\mathbf{x})\right)\left(\bar{y}(\mathbf{x}) - y\right)\right] \end{align}

(7)

The third term in the equation above is 0, as we show in the same way as before, but this time decomposes $E_{\mathbf{x}, y}$ into $E_{\mathbf{x}}E_{y \mid \mathbf{x}}$ (can show correctness if you write out the definition of expectation):

\begin{align*} E_{\mathbf{x}, y} \left[\left(\bar{h}(\mathbf{x}) - \bar{y}(\mathbf{x})\right)\left(\bar{y}(\mathbf{x}) - y\right)\right] &= E_{\mathbf{x}}\left[E_{y \mid \mathbf{x}} \left[\bar{y}(\mathbf{x}) - y \right] \left(\bar{h}(\mathbf{x}) - \bar{y}(\mathbf{x}) \right) \right] = 0 \end{align*}

(8)

This gives us the decomposition of expected test error:

\underbrace{E_{\mathbf{x}, y, D} \left[\left(h_{D}(\mathbf{x}) - y\right)^{2}\right]}_\mathrm{Expected\;Test\;Error} = \underbrace{E_{\mathbf{x}, D}\left[\left(h_{D}(\mathbf{x}) - \bar{h}(\mathbf{x})\right)^{2}\right]}_\mathrm{Variance} + \underbrace{E_{\mathbf{x}, y}\left[\left(\bar{y}(\mathbf{x}) - y\right)^{2}\right]}_\mathrm{Noise} + \underbrace{E_{\mathbf{x}}\left[\left(\bar{h}(\mathbf{x}) - \bar{y}(\mathbf{x})\right)^{2}\right]}_\mathrm{Bias^2}

Interpretation¶

Variance: Literally, it is how much our classifier on training data $D$ deviates from the expected classifier. This value captures how much your classifier changes if you train on a different training set. How “over-specialized” is your classifier to a particular training set (overfitting)?
Bias: Literally, this is how much error the best classifier $\bar h(x)$ still makes. It captures the inherent error that you obtain from your classifier even with infinite training data. This is due to your classifier being “biased” to a particular kind of solution (e.g. linear classifier). In other words, bias is inherent to your model and dependent on the algorithm $\mathcal A$ .
Noise: How big is the difference between the test point’s label and the expected label? This is a data-intrinsic noise that measures ambiguity due to your data distribution and feature representation. You can never beat this, it is an aspect of the data.

Fig 1: Graphical illustration of bias and variance. If we have bias, it will be the board constantly shaking, so you can hardly ever hit it.

Detecting High Bias and High Variance¶

If a classifier is under-performing (test or training error is too high), the first step is to determine the root of the problem.

Figure 3: Test and training error as the number of training instances increases.

The graph above plots the training error and the test error and can be divided into two overarching regimes. In the first regime (on the left side of the graph), training error is below the desired error threshold (denoted by $\epsilon$ ), but test error is significantly higher. In the second regime (on the right side of the graph), test error is remarkably close to training error, but both are above the desired tolerance of $\epsilon$ .

Regime 1 (High Variance)¶

In the first regime, the cause of the poor performance is high variance.

Symptoms:

Training error is much lower than test error
Training error is lower than $\epsilon$
Test error is above $\epsilon$

Remedies:

Add more training data
Reduce model complexity, usually by adding a regularization term (complex models are prone to high variance)
Bagging (will be covered later in the course)
Early Stopping

Regime 2 (High Bias)¶

Unlike the first regime, the second regime indicates high bias: the model being used is not robust enough to produce an accurate prediction.

Symptoms:

Training error is higher than $\epsilon$

Remedies:

Use more complex model (e.g. kernelize, use non-linear models)
Add features
Boosting (will be covered later in the course)

Thought Process to determine whether a model is high bias / variance:¶

High Variance:
- Imagine you train several models on different dataset you draw from the original distribution $P$ , each dataset is very small. How do the results compare with each other? If differ by a lot, high var.
- Is this model overfitting? If yes, high var.
High Bias:
- Imagine you train this model on all data from the original distribution $P$ , if it still performs badly, it has a high bias.
- Is this model underfitting? If yes, high bias.

Example:

	Bias	Variance
kNN with small k	low	high
kNN with large k	high	low

Ideal Case¶

Training and test error are both below the acceptable test error line.

Reduce Noise¶

Two reasons can cause noise and we introduce the corresponding solution:

Labels are just wrongly assigned <- no solution but to assign the correct label
Features are not expressive enough <- add more features

CS4780 Intro to Machine Learning

Midterm Review

CS4780 Intro to Machine Learning

Model Selection Tricks