Skip to article frontmatterSkip to article content

SVM - Support Vector Machine

Cornell University
img

The left picture is what we have achieved so far with a perceptron. The perceptron can give us either the blue line or a red line, so the result does not generalize well. We can view SVM (on the right) as an improvement of the Perceptron - it finds the one that maximizes the distance to the closest data points from both classes. We say it is the maximum margin separating hyperplane.

Defining Margin

We know the distance from a point x\textbf x to a plane H={xwTx+b=0}\mathcal{H} = \left\{\mathbf{x}\vert{}\mathbf{w}^T\mathbf{x}+b=0\right\} is d(xi,H)=wTx+bw2\mathbf{d(x_i, \mathcal H)} = \frac{\left | \mathbf{w}^T\mathbf{x}+b \right |}{\left \| \mathbf{w} \right \|_{2}}.

Using the definition of margin from Perceptron, we have margin of H\mathcal{H} with respect to out dataset DD is γ(w,b)=minxDwTx+bw2\gamma(\mathbf{w},b)=\min_{\mathbf{x}\in D}\frac{\left | \mathbf{w}^T\mathbf{x}+b \right |}{\left \| \mathbf{w} \right \|_{2}}

By definition, the margin and hyperplane are scale invariant: γ(βw,βb)=γ(w,b),β0\gamma(\beta\mathbf{w},\beta b)=\gamma(\mathbf{w},b), \forall \beta \neq 0

Max Margin Classifier

We can formulate our search for the maximum margin separating hyperplane as a constrained optimization problem. The objective is to maximize the margin under the constraints that all data points must lie on the correct side of the hyperplane:

maxw,bγ(w,b)maximize marginsuch that  i yi(wTxi+b)0separating hyperplane\underbrace{\max_{\mathbf{w},b}\gamma(\mathbf{w},b)}_{maximize \ margin} \textrm{such that} \ \ \underbrace{\forall i \ y_{i}(\mathbf{w}^Tx_{i}+b)\geq 0}_{separating \ hyperplane}

If we plug in the definition of γ\gamma we obtain:

maxw,b1w2minxiDwTxi+bγ(w,b) maximize margin  s.t.  i yi(wTxi+b)0separating hyperplane\underbrace{\max_{\mathbf{w},b}\underbrace{\frac{1}{\left \| \mathbf{w} \right \|}_{2}\min_{\mathbf{x}_{i}\in D}\left | \mathbf{w}^T\mathbf{x}_{i}+b \right |}_{\gamma(\mathbf{w},b)} \ }_{maximize \ margin} \ \ s.t. \ \ \underbrace{\forall i \ y_{i}(\mathbf{w}^Tx_{i}+b)\geq 0}_{separating \ hyperplane}

Maximizing minimum seems almost impossible to do, so we want to remove the minmin in the value we are trying to maximize. We can move this minmin to be a new constraint: Because the hyperplane is scale invariant, we can fix the scale of w,b\mathbf{w},b anyway we want. We choose them such that

minxDwTx+b=1\min_{\mathbf{x}\in D}\left | \mathbf{w}^T\mathbf{x}+b \right |=1

so our objective becomes:

maxw,b1w2minxDwTx+b=maxw,b1w21=minw,bw2=minw,bww\max_{\mathbf{w},b}\frac{1}{\left \| \mathbf{w} \right \|_{2}}\min_{\mathbf{x}\in D}\left | \mathbf{w}^T\mathbf{x}+b \right | = \max_{\mathbf{w},b}\frac{1}{\left \| \mathbf{w} \right \|_{2}}\cdot 1 = \min_{\mathbf{w},b}\left \| \mathbf{w} \right \|_{2} = \min_{\mathbf{w},b} \mathbf{w}^\top \mathbf{w}

Add our re-scaling as an equality constraint to the overall problem, our new optimization problem to solve is:

minw,bwws.t. i, yi(wTxi+b)0miniwTxi+b=1\begin{align} &\min_{\mathbf{w},b} \mathbf{w}^\top\mathbf{w} & \\ &\textrm{s.t. } \begin{matrix} \forall i, \ y_{i}(\mathbf{w}^T \mathbf{x}_{i}+b)&\geq 0\\ \min_{i}\left | \mathbf{w}^T \mathbf{x}_{i}+b \right | &= 1 \end{matrix} \end{align}

These two new constraints are equivalent to i yi(wTxi+b)1\forall i \ y_{i}(\mathbf{w}^T \mathbf{x}_{i}+b) \geq 1 (Don’t ask me how to do a right to left proof, I do not know), so we can rewrite our optimization problem as:

minw,bwTws.t.   i yi(wTxi+b)1\begin{align} &\min_{\mathbf{w},b}\mathbf{w}^T\mathbf{w}&\\ &\textrm{s.t.} \ \ \ \forall i \ y_{i}(\mathbf{w}^T \mathbf{x}_{i}+b) \geq 1 & \end{align}

This new formulation is simply a quadratic optimization problem. The objective is quadratic and the constraints are all linear. We can be solve it efficiently with any QCQP (Quadratically Constrained Quadratic Program) solver. It has a unique solution whenever a separating hyper plane exists. It also has a nice interpretation: Find the simplest hyperplane (where simpler means smaller ww\mathbf{w}^\top\mathbf{w}) such that all inputs lie at least 1 unit away from the hyperplane on the correct side.

An intuition behind this formula is: the new constraint can be written as wTxi+b1|\mathbf{w}^T \mathbf{x}_{i}+b| \geq 1 . Note this is w2  d(xi,H)1| \mathbf{w}|_{2} \; \mathbf{d(x_i, \mathcal H)} \geq 1, so wTw  d(xi,H)1\sqrt{\mathbf{w}^T\mathbf{w}} \; \mathbf{d(x_i, \mathcal H)} \geq 1. Since we are minimizing wTw\sqrt{\mathbf{w}^T\mathbf{w}}, the only way we can achieve the greater than 1 constraint is to make d\mathbf d big. This is exactly what we want.

We can write the prediction result as h(xi)=wxi+bh({\mathbf{x}_i}) = w^{\top}{\mathbf{x}_i}+b, so h(xi)yih(\mathbf{x}_i)y_{i} represents classification result correctness. If it is positive, the prediction is correct. And the bigger this value, the more correct we get. If it is negative, the prediction is wrong.

Support Vectors

For the optimal w,b\mathbf{w},b pair, some training points will have tight constraints, i.e.

yi(wTxi+b)=1y_{i}(\mathbf{w}^T \mathbf{x}_{i}+b) = 1

We refer to these training points as support vectors. These vectors define the maximum margin of the hyperplane to the data set and they therefore determine the shape of the hyperplane. If you were to move one of them and retrain the SVM, the resulting hyperplane would change. The opposite is the case for non-support vectors.

SVM with soft constraints

If the data is low dimensional it is often the case that there is no separating hyperplane between the two classes, so there is no solution to the optimization problems above. We can fix this by allowing the constraints to be violated (distance of xix_i to the hyperplane to be greater than the margin) ever so slight with the introduction of slack variables ξ\xi. At the same time, we don’t want to loose the constraints too much, so we also minimize these slack variables along the way.

minw,bwTw+Ci=1nξis.t. i yi(wTxi+b)1ξii ξi0\begin{matrix} \min_{\mathbf{w},b}\mathbf{w}^T\mathbf{w}+C\sum_{i=1}^{n}\xi _{i} \\ s.t. \ \forall i \ y_{i}(\mathbf{w}^T\mathbf{x}_{i}+b)\geq 1-\xi_i \\ \forall i \ \xi_i \geq 0 \end{matrix}

As the slack ξi\xi_i gets larger, we allow point xix_i to be closer to the hyperplane  yi(wTxi+b)<1\ y_{i}(\mathbf{w}^T\mathbf{x}_{i}+b) \lt 1, or even on the other side  yi(wTxi+b)<0\ y_{i}(\mathbf{w}^T\mathbf{x}_{i}+b) \lt 0

If C is very large, the SVM becomes very strict and tries to get all points to be on the right side of the hyperplane. If C is very small, the SVM becomes very loose and may “sacrifice” some points to obtain a simpler (i.e. lower w22\|\mathbf{w}\|_2^2) solution.

Unconstrained Formulation

We will express our “SVM with soft constraints” in an unconstrained form.

Note if yi(wTxi+b)<1y_{i}(\mathbf{w}^T \mathbf{x}_{i}+b)<1, so current point is too close (or on the other side) to the plane, ξi\xi_i is the distance away from the margin point. If point xix_i is just as far as other good points, ξi\xi_i is simply 0. Formulate this in the equation below:

ξi={ 1yi(wTxi+b) if yi(wTxi+b)<10 if yi(wTxi+b)1\xi_i=\left\{ \begin{array}{cc} \ 1-y_{i}(\mathbf{w}^T \mathbf{x}_{i}+b) & \textrm{ if $y_{i}(\mathbf{w}^T \mathbf{x}_{i}+b)<1$}\\ 0 & \textrm{ if $y_{i}(\mathbf{w}^T \mathbf{x}_{i}+b)\geq 1$} \end{array} \right.

which is equivalent to the following closed form:

ξi=max(1yi(wTxi+b),0)\xi_i=\max(1-y_{i}(\mathbf{w}^T \mathbf{x}_{i}+b) ,0)

This closed form is called the hinge loss. If we plug this closed form into the objective of our SVM optimization problem, we obtain the following unconstrained version as loss function and regularizer:

minw,bwTwl2regularizer+C i=1nmax[1yi(wTx+b),0]hingeloss\min_{\mathbf{w},b}\underbrace{\mathbf{w}^T\mathbf{w}}_{l_{2}-regularizer}+ C\ \sum_{i=1}^{n}\underbrace{\max\left [ 1-y_{i}(\mathbf{w}^T \mathbf{x}+b),0 \right ]}_{hinge-loss}

We optimize SVM paramters w,b\mathbf{w},b by minimizing this loss function just like we did in logistic regression (e.g. through gradient descent). The only difference is that we have the hinge loss here instead of the logistic loss.