Skip to article frontmatterSkip to article content

Bias-Variance Tradeoff

Cornell University

Setting Up

As usual, we are given a training dataset D={(x1,y1),,(xn,yn)}D = \{(\mathbf{x}_1, y_1), \dots, (\mathbf{x}_n,y_n)\}, drawn i.i.d. from some distribution P(X,Y)P(X,Y). Throughout this lecture we assume a regression setting, i.e. yRy \in \mathbb{R}. In this lecture we will decompose the generalization error of a classifier into three rather interpretable terms.

Expected Label

Even though we have the same features x1=x2\mathbf{x}_1 = \mathbf{x}_2, their labels can be different y1y2y_1 \not= y_2. For example, if your vector x\mathbf{x} describes features of house (e.g. #bedrooms, square footage, ...) and the label yy its price, you could imagine two houses with identical description selling for different prices. So for any given feature vector x\mathbf{x}, there is a distribution over possible labels. According to this idea, we define given xRd\mathbf{x} \in \mathbb{R}^d:

yˉ(x)=Eyx[Y]=yyPr(yx)y.\bar{y}(\mathbf{x}) = E_{y \vert \mathbf{x}} \left[Y\right] = \int\limits_y y \, \Pr(y \vert \mathbf{x}) \partial y.

is the expected label - the label you would expect to obtain, given a feature vector x\mathbf{x}.

Expected Model

After drawing a training set DD i.i.d. from the distribution PP, we will use some machine learning algorithm A\mathcal{A} on this data set DD to train a model hh. Formally, we denote this process as hD=A(D)h_D = \mathcal{A}(D). Similar to the reasoning above, a same learning algorithm A\mathcal{A} can give different models hDh_D based on different datasets DD, so we want to find the expected model hˉ\bar h.

hˉ=EDPn[hD]=DhDPr(D)dD\bar{h} = E_{D \sim P^n} \left[ h_D \right] = \int\limits_D h_D \Pr(D) dD

where Pr(D)\Pr(D) is the probability of drawing DD from PnP^n. In practice, we cannot integrate over all datasets, so we just sample some datasets and take their average (taking average means we think the probability of drawing out each DD is the same. We can assume this because each time we draw DD from the same distribution PnP^n)

Expected Error

For a given hDh_D, learned on data set DD with algorithm A\mathcal{A}, we can compute the generalization error as measured in squared loss here. (One can use other loss functions. We use squared loss because it is easier for the derivation later) as follows. This is the error of our model:

ϵhD=E(x,y)P[(hD(x)y)2]=x ⁣ ⁣y(hD(x)y)2Pr(x,y)  dx  dy\epsilon_{h_D} = E_{(\mathbf{x},y) \sim P} \left[ \left(h_D (\mathbf{x}) - y \right)^2 \right] = \int\limits_x \! \! \int\limits_y \left( h_D(\mathbf{x}) - y\right)^2 \Pr(\mathbf{x},y) \; d \mathbf{x} \; dy

where (x,y)(\mathbf x, y) is a pair of test point.

Given the idea of “expected model” discussed above, we can compute the expected test error only given A\mathcal{A}, taking the expectation also over DD. So now we have the error of an algorithm, which produces different models based on different training data:

ϵA=E(x,y)PDPn[(hD(x)y)2]=Dxy(hD(x)y)2P(x,y)P(D)  dx  dy  dD\epsilon_{\mathcal A} = E_{\substack{(\mathbf{x},y) \sim P\\ D \sim P^n}} \left[\left(h_{D}(\mathbf{x}) - y\right)^{2}\right] = \int_{D} \int_{\mathbf{x}} \int_{y} \left( h_{D}(\mathbf{x}) - y\right)^{2} \mathrm{P}(\mathbf{x},y) \mathrm{P}(D) \; d \mathbf{x} \;dy \;dD

where DD is our training datasets and the (x,y)(\mathbf{x},y) pairs are the test points. We are interested in exactly this expression, because it evaluates the quality of a machine learning algorithm A\mathcal{A} with respect to a data distribution P(X,Y)P(X,Y). In the following we will show that this expression decomposes into three meaningful terms.

Derivation

Ex,y,D[[hD(x)y]2]=Ex,y,D[[(hD(x)hˉ(x))+(hˉ(x)y)]2]=Ex,D[(hˉD(x)hˉ(x))2]+2  Ex,y,D[(hD(x)hˉ(x))(hˉ(x)y)]+Ex,y[(hˉ(x)y)2]\begin{align} E_{\mathbf{x},y,D}\left[\left[h_{D}(\mathbf{x}) - y\right]^{2}\right] &= E_{\mathbf{x},y,D}\left[\left[\left(h_{D}(\mathbf{x}) - \bar{h}(\mathbf{x})\right) + \left(\bar{h}(\mathbf{x}) - y\right)\right]^{2}\right] \nonumber \\ &= E_{\mathbf{x}, D}\left[(\bar{h}_{D}(\mathbf{x}) - \bar{h}(\mathbf{x}))^{2}\right] + 2 \mathrm{\;} E_{\mathbf{x}, y, D} \left[\left(h_{D}(\mathbf{x}) - \bar{h}(\mathbf{x})\right)\left(\bar{h}(\mathbf{x}) - y\right)\right] + E_{\mathbf{x}, y} \left[\left(\bar{h}(\mathbf{x}) - y\right)^{2}\right] \end{align}

The middle term of the above equation is 0 as we show below:

Ex,y,D[(hD(x)hˉ(x))(hˉ(x)y)]=Ex,y{ED[(hD(x)hˉ(x))(hˉ(x)y)]}=Ex,y[ED[hD(x)hˉ(x)](hˉ(x)y)]=Ex,y[(ED[hD(x)]hˉ(x))(hˉ(x)y)]=Ex,y[(hˉ(x)hˉ(x))(hˉ(x)y)]=0\begin{align*} E_{\mathbf{x}, y, D} \left[\left(h_{D}(\mathbf{x}) - \bar{h}(\mathbf{x})\right) \left(\bar{h}(\mathbf{x}) - y\right)\right] &= E_{\mathbf{x}, y} \left\{ E_{D}\left[\left(h_{D}(\mathbf{x}) - \bar{h}(\mathbf{x})\right) \left(\bar{h}(\mathbf{x}) - y\right)\right] \right\} \\ &= E_{\mathbf{x}, y} \left[E_{D} \left[ h_{D}(\mathbf{x}) - \bar{h}(\mathbf{x})\right] \left(\bar{h}(\mathbf{x}) - y\right) \right] \\ &= E_{\mathbf{x}, y} \left[ \left( E_{D} \left[ h_{D}(\mathbf{x}) \right] - \bar{h}(\mathbf{x}) \right) \left(\bar{h}(\mathbf{x}) - y \right)\right] \\ &= E_{\mathbf{x}, y} \left[ \left(\bar{h}(\mathbf{x}) - \bar{h}(\mathbf{x}) \right) \left(\bar{h}(\mathbf{x}) - y \right)\right] \\ &= 0 \end{align*}

Returning to the earlier expression, we’re left with the variance and another term:

Ex,y,D[(hD(x)y)2]=Ex,D[(hD(x)hˉ(x))2]Variance+Ex,y[(hˉ(x)y)2]E_{\mathbf{x}, y, D} \left[ \left( h_{D}(\mathbf{x}) - y \right)^{2} \right] = \underbrace{E_{\mathbf{x}, D} \left[ \left(h_{D}(\mathbf{x}) - \bar{h}(\mathbf{x}) \right)^{2} \right]}_\mathrm{Variance} + E_{\mathbf{x}, y}\left[ \left( \bar{h}(\mathbf{x}) - y \right)^{2} \right]

We can break down the second term in the above equation just as what we did to ϵA\epsilon_\mathcal A at first:

Ex,y[(hˉ(x)y)2]=Ex,y[(hˉ(x)yˉ(x))+(yˉ(x)y)2]=Ex,y[(yˉ(x)y)2]Noise+Ex[(hˉ(x)yˉ(x))2]Bias2+2  Ex,y[(hˉ(x)yˉ(x))(yˉ(x)y)]\begin{align} E_{\mathbf{x}, y} \left[ \left(\bar{h}(\mathbf{x}) - y \right)^{2}\right] &= E_{\mathbf{x}, y} \left[ \left(\bar{h}(\mathbf{x}) -\bar y(\mathbf{x}) )+(\bar y(\mathbf{x}) - y \right)^{2}\right] \\ &=\underbrace{E_{\mathbf{x}, y} \left[\left(\bar{y}(\mathbf{x}) - y\right)^{2}\right]}_\mathrm{Noise} + \underbrace{E_{\mathbf{x}} \left[\left(\bar{h}(\mathbf{x}) - \bar{y}(\mathbf{x})\right)^{2}\right]}_\mathrm{Bias^2} + 2 \mathrm{\;} E_{\mathbf{x}, y} \left[ \left(\bar{h}(\mathbf{x}) - \bar{y}(\mathbf{x})\right)\left(\bar{y}(\mathbf{x}) - y\right)\right] \end{align}

The third term in the equation above is 0, as we show in the same way as before, but this time decomposes Ex,yE_{\mathbf{x}, y} into ExEyxE_{\mathbf{x}}E_{y \mid \mathbf{x}} (can show correctness if you write out the definition of expectation):

Ex,y[(hˉ(x)yˉ(x))(yˉ(x)y)]=Ex[Eyx[yˉ(x)y](hˉ(x)yˉ(x))]=0\begin{align*} E_{\mathbf{x}, y} \left[\left(\bar{h}(\mathbf{x}) - \bar{y}(\mathbf{x})\right)\left(\bar{y}(\mathbf{x}) - y\right)\right] &= E_{\mathbf{x}}\left[E_{y \mid \mathbf{x}} \left[\bar{y}(\mathbf{x}) - y \right] \left(\bar{h}(\mathbf{x}) - \bar{y}(\mathbf{x}) \right) \right] = 0 \end{align*}

This gives us the decomposition of expected test error:

Ex,y,D[(hD(x)y)2]Expected  Test  Error=Ex,D[(hD(x)hˉ(x))2]Variance+Ex,y[(yˉ(x)y)2]Noise+Ex[(hˉ(x)yˉ(x))2]Bias2\underbrace{E_{\mathbf{x}, y, D} \left[\left(h_{D}(\mathbf{x}) - y\right)^{2}\right]}_\mathrm{Expected\;Test\;Error} = \underbrace{E_{\mathbf{x}, D}\left[\left(h_{D}(\mathbf{x}) - \bar{h}(\mathbf{x})\right)^{2}\right]}_\mathrm{Variance} + \underbrace{E_{\mathbf{x}, y}\left[\left(\bar{y}(\mathbf{x}) - y\right)^{2}\right]}_\mathrm{Noise} + \underbrace{E_{\mathbf{x}}\left[\left(\bar{h}(\mathbf{x}) - \bar{y}(\mathbf{x})\right)^{2}\right]}_\mathrm{Bias^2}

Interpretation

img

Fig 1: Graphical illustration of bias and variance. If we have bias, it will be the board constantly shaking, so you can hardly ever hit it.

Detecting High Bias and High Variance

If a classifier is under-performing (test or training error is too high), the first step is to determine the root of the problem.

img

Figure 3: Test and training error as the number of training instances increases.

The graph above plots the training error and the test error and can be divided into two overarching regimes. In the first regime (on the left side of the graph), training error is below the desired error threshold (denoted by ϵ\epsilon), but test error is significantly higher. In the second regime (on the right side of the graph), test error is remarkably close to training error, but both are above the desired tolerance of ϵ\epsilon.

Regime 1 (High Variance)

In the first regime, the cause of the poor performance is high variance.

Symptoms:

  1. Training error is much lower than test error
  2. Training error is lower than ϵ\epsilon
  3. Test error is above ϵ\epsilon

Remedies:

Regime 2 (High Bias)

Unlike the first regime, the second regime indicates high bias: the model being used is not robust enough to produce an accurate prediction.

Symptoms:

  1. Training error is higher than ϵ\epsilon

Remedies:

Thought Process to determine whether a model is high bias / variance:

Example:

BiasVariance
kNN with small klowhigh
kNN with large khighlow

Ideal Case

Training and test error are both below the acceptable test error line.

Reduce Noise

Two reasons can cause noise and we introduce the corresponding solution: