Skip to article frontmatterSkip to article content

Kernels

Cornell University

Linear classifiers are great, but what if there exists no linear decision boundary? We should observe that nonlinear classifier in our (usually low) dimensional space is just a linear classifier on a higher dimensional space. When we project this higher-dimensional linear classifier into our lower-dimensional space, it appears nonlinear. As it turns out, there is an elegant way to incorporate non-linearities into most linear classifiers.

Handcrafted Feature Expansion

We can make linear classifiers non-linear by expanding the existing features. Formally, for a data vector xRd\mathbf{x}\in\mathbb{R}^d, we apply the transformation xϕ(x)\mathbf{x} \rightarrow \phi(\mathbf{x}) where ϕ(x)RD\phi(\mathbf{x})\in\mathbb{R}^D. Usually DdD \gg d because we add dimensions that capture non-linear interactions among the original features. l

Consider the following example: x=(x1x2xd)\mathbf{x}=\begin{pmatrix}x_1\\ x_2\\ \vdots \\ x_d \end{pmatrix}, and define ϕ(x)=(1x1xdx1x2xd1xdx1x2xd)\phi(\mathbf{x})=\begin{pmatrix}1\\ x_1\\ \vdots \\x_d \\ x_1x_2 \\ \vdots \\ x_{d-1}x_d\\ \vdots \\x_1x_2\cdots x_d \end{pmatrix}.

This new representation, ϕ(x)\phi(\mathbf{x}), is very expressive and allows for complicated non-linear decision boundaries - but the dimensionality is extremely high D=2dD = 2^d. This makes our algorithm unbearable (and quickly prohibitively) slow.

The Kernel Trick

Gradient Descent with Squared Loss

The kernel trick is a way to get around this dilemma by learning a function in the much higher dimensional space, without explicitly computing the value of a single vector ϕ(x)\phi(\mathbf{x}) or ever computing the full vector w\mathbf{w}. We will represent these values only with x\mathbf x and yy.

It is based on the following observation: If we use gradient descent with any one of our standard loss functions, the gradient is a linear combination of the input samples. For example, in squared loss:

(w)=i=1n(wxiyi)2\ell(\mathbf{w}) = \sum_{i=1}^n (\mathbf{w}^\top \mathbf{x}_i-y_i)^2

and the gradient descent rule updates w\mathbf w over time with step size / learning rate s>0s > 0.

wt+1wts(w)  where: w=i=1n2(wxiyi)γi : function of xi,yixi=i=1nγixiw_{t+1} \leftarrow w_t - s(\frac{\partial \ell}{\partial \mathbf{w}})\ \textrm{ where: } \frac{\partial \ell}{\partial \mathbf{w}}=\sum_{i=1}^n \underbrace{2(\mathbf{w}^\top \mathbf{x}_i-y_i)}_{\gamma_i\ :\ \textrm{function of $\mathbf{x}_i, y_i$}} \mathbf{x}_i = \sum_{i=1}^n\gamma_i \mathbf{x}_i

where γi\gamma_i is the gradient of loss function: γi=2(h(xi)yi)\gamma_i = 2(h(\mathbf x_i)-y_i)

We will now show by induction that we can express w\mathbf{w} as a linear combination of all input vectors, namely

w=i=1nαixi.\mathbf{w}=\sum_{i=1}^n \alpha_i {\mathbf{x}}_i.

Base Case: Since the loss is convex, the final solution is independent of the initialization, and we can initialize w0\mathbf{w}_0 to be whatever we want. For convenience, set w0=(00)\mathbf{w}_0=\begin{pmatrix}0 \\ \vdots \\ 0\end{pmatrix}. This is trivially a linear combination of x\mathbf x.

Inductive Step:

$$

w1=w0si=1n2(w0xiyi)xi=i=1nαi0xisi=1nγi0xi=i=1nαi1xi(with αi1=αi0sγi0)w2=w1si=1n2(w1xiyi)xi=i=1nαi1xisi=1nγi1xi=i=1nαi2xi(with αi2=αi1xisγi1)wt=wt1si=1n2(wt1xiyi)xi=i=1nαit1xisi=1nγit1xi=i=1nαitxi(with αit=αit1sγit1)\begin{align} \mathbf{w}_1=&\mathbf{w}_0-s\sum_{i=1}^n2(\mathbf{w}_0^\top \mathbf{x}_i-y_i)\mathbf{x}_i=\sum_{i=1}^n \alpha_i^0 {\mathbf{x}}_i-s\sum_{i=1}^n\gamma_i^0\mathbf{x}_i=\sum_{i=1}^n\alpha_i^1\mathbf{x}_i&(\textrm{with $\alpha_i^1=\alpha_i^0-s\gamma_i^0$})\nonumber\\ \mathbf{w}_2=&\mathbf{w}_1-s\sum_{i=1}^n2(\mathbf{w}_1^\top \mathbf{x}_i-y_i)\mathbf{x}_i=\sum_{i=1}^n \alpha_i^1\mathbf{x}_i-s\sum_{i=1}^n\gamma_i^1\mathbf{x}_i=\sum_{i=1}^n\alpha_i^2\mathbf{x}_i&(\textrm{with $\alpha_i^2=\alpha_i^1\mathbf{x}_i-s\gamma_i^1$})\nonumber\\ \cdots & \qquad\qquad\qquad\cdots &\cdots\nonumber\\ \mathbf{w}_t=&\mathbf{w}_{t-1}-s\sum_{i=1}^n2(\mathbf{w}_{t-1}^\top \mathbf{x}_i-y_i)\mathbf{x}_i=\sum_{i=1}^n \alpha_i^{t-1}\mathbf{x}_i-s\sum_{i=1}^n\gamma_i^{t-1}\mathbf{x}_i=\sum_{i=1}^n\alpha_i^t\mathbf{x}_i&(\textrm{with $\alpha_i^t=\alpha_i^{t-1}-s\gamma_i^{t-1}$})\nonumber \end{align}

$$

The update-rule for αit\alpha_i^t is thus

αit=αit1sγit1\alpha_i^t=\alpha_i^{t-1}-s\gamma_i^{t-1}

Since ai0=0a_i^0 = 0, we can write the closed form

αit=sr=0t1γir\alpha_i^t=-s\sum_{r=0}^{t-1}\gamma_i^{r}

Therefore, we have shown that we can perform the entire gradient descent update rule without ever expressing w\mathbf{w} explicitly. We just keep track of the nn coefficients α1,,αn\alpha_1,\dots,\alpha_n.

Now that w\mathbf{w} can be written as a linear combination of the training set, we can also express the prediction result purely in terms of inner-products between training inputs:

h(xt)=wxt=j=1nαjxjxt.h({\mathbf{x}}_t)=\mathbf{w}^\top {\mathbf{x}}_t=\sum_{j=1}^n\alpha_j{\mathbf{x}}_j^\top {\mathbf{x}}_t.

Consequently, we can also re-write the squared-loss from (w)=i=1n(wxiyi)2\ell(\mathbf{w}) = \sum_{i=1}^n (\mathbf{w}^\top \mathbf{x}_i-y_i)^2 entirely in terms of inner-product between training inputs:

(α)=i=1n(j=1nαjxjxiyi)2\ell(\mathbf{\alpha}) = \sum_{i=1}^n \left(\sum_{j=1}^n\alpha_j\mathbf{x}_j^\top \mathbf{x}_i-y_i\right)^2

Do you notice a theme? The only information we ever need in order to learn a hyper-plane classifier with the squared-loss is inner-products between all pairs of data vectors.

Inner-Product Computation

Let’s go back to the previous example, ϕ(x)=(1x1xdx1x2xd1xdx1x2xd)\phi(\mathbf{x})=\begin{pmatrix}1\\ x_1\\ \vdots \\x_d \\ x_1x_2 \\ \vdots \\ x_{d-1}x_d\\ \vdots \\x_1x_2\cdots x_d \end{pmatrix}.

The inner product ϕ(x)ϕ(z)\phi(\mathbf{x})^\top \phi(\mathbf{z}) can be formulated as:

ϕ(x)ϕ(z)=11+x1z1+x2z2++x1x2z1z2++x1xdz1zd=k=1d(1+xkzk).\phi(\mathbf{x})^\top \phi(\mathbf{z})=1\cdot 1+x_1z_1+x_2z_2+\cdots +x_1x_2z_1z_2+ \cdots +x_1\cdots x_dz_1\cdots z_d=\prod_{k=1}^d(1+x_kz_k)\text{.}

The sum of 2d2^d terms becomes the product of dd terms. Define the kernel function k\mathsf k as:

k(xi,xj)=ϕ(xi)ϕ(xj)\mathsf{k}(\mathbf{x}_i,\mathbf{x}_j) =\phi(\mathbf{x}_i)^\top \phi(\mathbf{x}_j)

With a finite training set of nn samples, inner products are often pre-computed and stored in a Kernel Matrix:

Kij=k(xi,xj)=ϕ(xi)ϕ(xj)\mathsf{K}_{ij} = \mathsf{k}(\mathbf{x}_i,\mathbf{x}_j) = \phi(\mathbf{x}_i)^\top \phi(\mathbf{x}_j)

Obviously, K\mathsf K is symmetric. If we store the matrix K\mathsf{K}, we only need to do simple matrix look-ups and low-dimensional computations throughout the gradient descent algorithm. To make the formula more readable, we sill use the “kernel function” representation instead of the “matrix” representation. The final classifier becomes:

h(xt)=i=1nαik(xi,xt)h(\mathbf{x}_t)=\sum_{i=1}^n\alpha_i \mathsf{k}(\mathbf{x}_i,\mathbf{x}_t)

So we can rewrite function γ\gamma as

γi=2(h(xi)yi)=2[(j=1nαjk(xi,xj))yi]\gamma_i = 2(h(\mathbf x_i)-y_i) = 2 \left[\left(\sum_{j=1}^n \alpha_j \mathsf{k}(\mathbf{x}_i,\mathbf{x}_j) \right)-y_i \right]

The gradient update becomes:

αit=αit1sγit1=αit12s[(j=1nαjk(xi,xj))yi]\alpha_i^t = \alpha_i^{t-1} - s\gamma_i^{t-1} = \alpha_i^{t-1} - 2s\left[\left(\sum_{j=1}^n \alpha_j \mathsf{k}(\mathbf{x}_i,\mathbf{x}_j) \right)-y_i \right]

General Kernels

Kernel functions

Can any function K(,)\mathsf{K}(\cdot,\cdot) be used as a kernel?

No, the matrix K(xi,xj)\mathsf{K}(\mathbf{x}_i,\mathbf{x}_j) has to correspond to real inner-products of ϕ(x)\phi({\mathbf{x}}) - x\mathbf x after some transformation. This is the case if and only if K\mathsf{K} is positive semi-definite.

Prove: recall the definition of positive semi-definite: A matrix ARn×nA\in \mathbb{R}^{n\times n} is positive semi-definite iff qRn\forall \mathbf{q}\in\mathbb{R}^n, qAq0\mathbf{q}^\top A\mathbf{q}\geq 0.