Skip to article frontmatterSkip to article content

Model Selection Tricks

Cornell University

Overfitting and Underfitting

There are two problematic cases which can arise when learning a classifier on a data set: underfitting and overfitting, each of which relate to the degree to which the data in the training set is extrapolated to apply to unknown data:

Identify Regularizer λ\lambda

img

Figure 1: overfitting and underfitting. Note the x-axis is λ\lambda:

  • the bigger λ\lambda is, the simpler the model is, so is likely to underfit.
  • the smaller λ\lambda is, the more complex the model is, so is more likely to overfit

K-Fold Cross Validation

When we train the data, we can use the k-fold cross validation: divide the training data into kk partitions. Train on k1k-1 of them and leave one out as validation set. Do this kk times (i.e. leave out every partition exactly once) and average the validation error across runs. This gives you a good estimate of the validation error (even with standard deviation).

In the extreme case, you can have k=nk=n, i.e. you only leave a single data point out (this is often referred to as LOOCV - Leave One Out Cross Validation). LOOCV is important if your data set is small and cannot afford to leave out many data points for evaluation.

We can also use k-fold cross validation solely to determine the hyperparameter λ\lambda: divide data into kk partitions, each with a specific λ\lambda; and we will select the λ\lambda that gives the best validation error.

No matter what we use k-fold for, after selecting the best model/parameter based on k-fold, we will have to train our model on the whole dataset for another time (remember we divided it into k1k-1 training set and 1 validation set), and use that as our final model.

Do two searches:

  1. find the best order of magnitude for λ\lambda. 先搜索数量级
  2. do a more fine-grained search around the best λ\lambda found so far. 再搜索此数量级下的最好值

For example, first you try λ=0.01,0.1,1,10,100\lambda=0.01,0.1,1,10,100. It turns out 10 is the best performing value. Then you try out λ=5,10,15,20,25,...,95\lambda=5,10,15,20,25,...,95 to test values “around” 10.

img

Figure 2: Grid Search vs Random search (red indicates lower loss, blue indicates high loss).

Early Stopping

Stop your optimization after M (>= 0) number of gradient steps, even if optimization has not converged yet, because our final model almost always overfits.

实际操作中,边训练边存模型,找到有最小 validation error 的模型(即我们不一定用最终的模型,相当于我们提早结束了训练)

img

Figure 3: Early stopping,注意这里和前图 1 不一样,横轴是 M