Skip to article frontmatterSkip to article content

Empirical Risk Minimization

Cornell University

Recap

Remember the unconstrained SVM Formulation

minw Ci=1nmax[1yi(wxi+b)h(xi),0]HingeLoss+wz2l2Regularizer\min_{\mathbf{w}}\ C\underset{Hinge-Loss}{\underbrace{\sum_{i=1}^{n}\max[1-y_{i}\underset{h({\mathbf{x}_i})}{\underbrace{(w^{\top}{\mathbf{x}_i}+b)}},0]}}+\underset{l_{2}-Regularizer}{\underbrace{\left\Vert w\right\Vert _{z}^{2}}}

where h(xi)=wxi+bh({\mathbf{x}_i}) = w^{\top}{\mathbf{x}_i}+b is our prediction, the hinge loss is the SVM’s error function, and the l2\left.l_{2}\right.-regularizer reflects the complexity of the solution, and penalizes complex solutions (those with big ww).

We can generalize problem of this form as empirical risk minimization with

minw1ni=1nl(hw(xi),yi)Loss+λr(w)Regularizer,\min_{\mathbf{w}}\frac{1}{n}\sum_{i=1}^{n}\underset{Loss}{\underbrace{l(h_{\mathbf{w}}({\mathbf{x}_i}),y_{i})}}+\underset{Regularizer}{\underbrace{\lambda r(w)}},

In SVM, =max[1yih(xi),0]\ell = \max[1-y_{i}h({\mathbf{x}_i}), 0], r(w)=wz2r(w) = \left\Vert w\right\Vert _{z}^{2}, λ=1C\lambda = \frac{1}{C}.

Binary Classification Loss Functions

Loss function in binary classification problem is always about h(xi)yih(\mathbf{x}_i)y_{i} - classification result correctness.

Loss \ellUsageComments
Hinge-Loss
max[1hw(xi)yi,0]p\max\left[1-h_{\mathbf{w}}(\mathbf{x}_{i})y_{i},0\right]^{p}
Standard SVM(p=1\left.p=1\right.)
Differentiable/Squared Hinge Loss SVM (p=2\left.p=2\right.)
When used for Standard SVM, the loss function denotes the size of the margin between linear separator and its closest points in either class. Only differentiable everywhere with p=2\left.p=2\right., but then it penalizes mistake much more aggressively.
Log-Loss
log(1+ehw(xi)yi)\left.\log(1+e^{-h_{\mathbf{w}}(\mathbf{x}_{i})y_{i}})\right.
Logistic RegressionVery popular loss functions in ML, since its outputs are well-calibrated probabilities.
Exponential Loss
ehw(xi)yi\left. e^{-h_{\mathbf{w}}(\mathbf{x}_{i})y_{i}}\right.
AdaBoostThis function is very aggressive. The loss of a mis-prediction increases exponentially with the value of hw(xi)yi-h_{\mathbf{w}}(\mathbf{x}_i)y_i. This can lead to nice convergence results, for example in the case of Adaboost, but it can also cause problems with noisy data or when you simply mistakenly mislabeled data.
Zero-One Loss
δ(sign(hw(xi))yi)\left.\delta(\textrm{sign}(h_{\mathbf{w}}(\mathbf{x}_{i}))\neq y_{i})\right.
Actual Classification LossNon-continuous and thus impractical to optimize.
img

Figure 4.1: Plots of Common Classification Loss Functions:

  • x-axis: h(xi)yi\left.h(\mathbf{x}_{i})y_{i}\right., or “correctness” of prediction
  • y-axis: loss value

A minor point: from the graph we know that Exponential Loss is a strict upperbound of 0/1 Loss. This will be useful later for proving its convergence.

Regression Loss Functions

Loss function in regression is always about the offset between prediction and original value z=yh(x)z = y - h(\mathbf x).

Loss \ellComments
Squared Loss
(h(xi)yi)2\left(h(\mathbf{x}_{i})-y_{i}\right)^{2}
Most popular regression loss function
Also known as Ordinary Least Squares (OLS)
ADVANTAGE: Differentiable everywhere
DISADVANTAGE: Somewhat sensitive to outliers/noise
Estimates Mean Label
Absolute Loss
h(xi)yi\left|h(\mathbf{x}_{i})-y_{i}\right|
Also a very popular loss function
Estimates Median Label
ADVANTAGE: Less sensitive to noise
DISADVANTAGE: Not differentiable at 0 (the point which minimization is intended to bring us to)
Huber Loss
{12(h(xi)yi)2ifh(xi)yi<δδ(h(xi)yiδ2)otherwise\begin{cases} \frac{1}{2}\left(h(\mathbf{x}_{i})-y_{i}\right)^{2} & \text{if} \left|h(\mathbf{x}_{i})-y_{i}\right|<\delta\\ \delta(\left|h(\mathbf{x}_{i})-y_{i}\right|-\frac{\delta}{2})& \text{otherwise} \end{cases}
Also known as Smooth Absolute Loss
ADVANTAGE: “Best of Both Worlds” of Squared and Absolute Loss
Once-differentiable
Takes on behavior of Squared-Loss when loss is small, and Absolute Loss when loss is large.
Log-Cosh Loss log(cosh(h(xi)yi)),cosh(x)=ex+ex2\left.log(cosh(h(\mathbf{x}_{i})-y_{i}))\right. ,\left.cosh(x)=\frac{e^{x}+e^{-x}}{2}\right.ADVANTAGE: Similar to Huber Loss, but twice differentiable everywhere

In the squared loss, the biggest loss shadows all the other losses - it wants to use whatever’s possible to decrease the biggest loss. We say the absolute loss is somewhat an “improvement” of the squared loss because it treats all losses more fairly. For example, in squared loss, 10 samples each diff by 1 is 10 loss, but 1 sample differs by 10 is 100 loss. On the other hand, 10 diff by 1 and 1 diff by 10 are both 10 loss in absolute value loss.

img

Figure 4.2: Plots of Common Regression Loss Functions:

  • x-axis: h(xi)yih(\mathbf{x}_{i})-y_{i}, or “error” of prediction
  • y-axis: loss value

Regularizers

Remember with Lagrange multipliers, which says for all B0B\geq0, there exists λ0\lambda\geq0 such that the two problems below are equivalent, and vice versa.

minwf(w) s.t. g(w)Bminwf(w)+λg(w)\min_{\mathbf{w}} f(\mathbf w) \textrm { s.t. } g(\mathbf{w})\leq B \Leftrightarrow \min_{\mathbf{w}} f(\mathbf w) +\lambda g(\mathbf{w})

We can therefore change the formulation of the optimization problem with regularizers to obtain a better geometric intuition:

minw,bi=1n(hw(x),yi)+λr(w)minw,bi=1n(hw(x),yi) subject to: r(w)B\min_{\mathbf{w},b} \sum_{i=1}^n\ell(h_\mathbf{w}(\mathbf{x}),y_i)+\lambda r(\mathbf{w}) \Leftrightarrow \min_{\mathbf{w},b} \sum_{i=1}^n\ell(h_\mathbf{w}(\mathbf{x}),y_i) \textrm { subject to: } r(\mathbf{w})\leq B
Regularizer r(w)r(\mathbf{w})Properties
l2l_{2}-Regularization
r(w)=ww=w22\left.r(\mathbf{w}) = \mathbf{w}^{\top}\mathbf{w} = |{\mathbf{w}}|_{2}^{2}\right.
ADVANTAGE: Strictly Convex, Differentiable
DISADVANTAGE: Dense Solutions (it uses weights on all features, i.e. relies on all features to some degree. Ideally we would like to avoid this)
l1l_{1}-Regularization
r(w)=w1\left.r(\mathbf{w}) = |\mathbf{w}|_{1}\right.
Convex (but not strictly)
DISADVANTAGE: Not differentiable at 0 (the point which minimization is intended to bring us to)
Effect: Sparse (i.e. not Dense) Solutions
lpl_p-Norm
wp=(i=1dvip)1/p\left.|{\mathbf{w}}|_{p} = (\sum\limits_{i=1}^d v_{i}^{p})^{1/p}\right.
(often 0<p1\left.0<p\leq1\right.)
DISADVANTAGE: Non-convex, Not differentiable, Initialization dependent
ADVANTAGE: Very sparse solutions

img
Figure 4.3: Plots of Common Regularizers

Famous Special Cases

This section includes several special cases that deal with risk minimization, such as Ordinary Least Squares, Ridge Regression, Lasso, and Logistic Regression. Table 4.4 provides information on their loss functions, regularizers, as well as solutions.

Loss and RegularizerComments
Ordinary Least Squares minw1ni=1n(wxiyi)2\min_{\mathbf{w}} \frac{1}{n}\sum\limits_{i=1}^n (\mathbf{w}^{\top}x_{i}-y_{i})^{2}Squared Loss No Regularization Closed form solution: w=(XX)1Xy\left.\mathbf{w}=(\mathbf{X}\mathbf{X}^\top)^{-1}\mathbf{X}\mathbf{y}^{\top}\right. X=[x1,...,xn]\left.\mathbf{X}=[\mathbf{x}_{1}, ..., \mathbf{x}_{n}]\right. y=[y1,...,yn]\left.\mathbf{y}=[y_{1},...,y_{n}]\right.
Ridge Regression minw1ni=1n(wxiyi)2+λw22\min_{\mathbf{w}} \frac{1}{n}\sum\limits_{i=1}^n (\mathbf{w}^{\top}x_{i}-y_{i})^{2}+\lambda|{w}|_{2}^{2}Squared Loss l2l_{2}-Regularization w=(XX+λI)1Xy\left.\mathbf{w}=(\mathbf{X}\mathbf{X}^{\top}+\lambda\mathbb{I})^{-1}\mathbf{X}\mathbf{y}^{\top}\right.
Lasso minw1ni=1n(wxiyi)2+λw1\min_{\mathbf{w}} \frac{1}{n}\sum\limits_{i=1}^n (\mathbf{w}^{\top}\mathbf{x}_{i}-{y}_{i})^{2}+\lambda|\mathbf{w}|_{1}+ sparsity inducing (good for feature selection) + Convex - Not strictly convex (no unique solution) - Not differentiable (at 0) Solve with (sub)-gradient descent or SVEN
Elastic Net minw1ni=1n(wxiyi)2+αw1+(1α)w22\min_{\mathbf{w}} \frac{1}{n}\sum\limits_{i=1}^n (\mathbf{w}^{\top}\mathbf{x}_{i}-{y}_{i})^{2}+\left.\alpha|\mathbf{w}|_{1}+(1-\alpha)|{\mathbf{w}}|_{2}^{2}\right. α[0,1)\left.\alpha\in[0, 1)\right.ADVANTAGE: Strictly convex (i.e. unique solution) + sparsity inducing (good for feature selection) + Dual of squared-loss SVM, see SVEN DISADVANTAGE: - Non-differentiable
Logistic Regression minw,b1ni=1nlog(1+eyi(wxi+b))\min_{\mathbf{w},b} \frac{1}{n}\sum\limits_{i=1}^n \log{(1+e^{-y_i(\mathbf{w}^{\top}\mathbf{x}_{i}+b)})}Often l1l_{1} or l2l_{2} Regularized Solve with gradient descent. $\left.\Pr{(y
Linear Support Vector Machine minw,bCi=1nmax[1yi(wxi+b),0]+w22\min_{\mathbf{w},b} C\sum\limits_{i=1}^n \max[1-y_{i}(\mathbf{w}^\top{\mathbf{x}_i+b}), 0]+|\mathbf{w}|_2^2Typically l2l_2 regularized (sometimes l1l_1). Quadratic program. When kernelized leads to sparse solutions. Kernelized version can be solved very efficiently with specialized algorithms (e.g. SMO)

Table 4.4: Special Cases