Kernels - Incomplete CS Notes @ Cornell

Linear classifiers are great, but what if there exists no linear decision boundary? We should observe that nonlinear classifier in our (usually low) dimensional space is just a linear classifier on a higher dimensional space. When we project this higher-dimensional linear classifier into our lower-dimensional space, it appears nonlinear. As it turns out, there is an elegant way to incorporate non-linearities into most linear classifiers.

Handcrafted Feature Expansion¶

We can make linear classifiers non-linear by expanding the existing features. Formally, for a data vector $\mathbf{x}\in\mathbb{R}^d$ , we apply the transformation $\mathbf{x} \rightarrow \phi(\mathbf{x})$ where $\phi(\mathbf{x})\in\mathbb{R}^D$ . Usually $D \gg d$ because we add dimensions that capture non-linear interactions among the original features. l

Advantage: It is simple, and your problem stays convex and well behaved. (i.e. you can still use your original gradient descent code, just with the higher dimensional representation)
Disadvantage: $\phi(\mathbf{x})$ might be very high dimensional.

Consider the following example: $\mathbf{x}=\begin{pmatrix}x_1\\ x_2\\ \vdots \\ x_d \end{pmatrix}$ , and define $\phi(\mathbf{x})=\begin{pmatrix}1\\ x_1\\ \vdots \\x_d \\ x_1x_2 \\ \vdots \\ x_{d-1}x_d\\ \vdots \\x_1x_2\cdots x_d \end{pmatrix}$ .

This new representation, $\phi(\mathbf{x})$ , is very expressive and allows for complicated non-linear decision boundaries - but the dimensionality is extremely high $D = 2^d$ . This makes our algorithm unbearable (and quickly prohibitively) slow.

The Kernel Trick¶

Gradient Descent with Squared Loss¶

The kernel trick is a way to get around this dilemma by learning a function in the much higher dimensional space, without explicitly computing the value of a single vector $\phi(\mathbf{x})$ or ever computing the full vector $\mathbf{w}$ . We will represent these values only with $\mathbf x$ and $y$ .

It is based on the following observation: If we use gradient descent with any one of our standard loss functions, the gradient is a linear combination of the input samples. For example, in squared loss:

\ell(\mathbf{w}) = \sum_{i=1}^n (\mathbf{w}^\top \mathbf{x}_i-y_i)^2

(1)

and the gradient descent rule updates $\mathbf w$ over time with step size / learning rate $s > 0$ .

w_{t+1} \leftarrow w_t - s(\frac{\partial \ell}{\partial \mathbf{w}})\ \textrm{ where: } \frac{\partial \ell}{\partial \mathbf{w}}=\sum_{i=1}^n \underbrace{2(\mathbf{w}^\top \mathbf{x}_i-y_i)}_{\gamma_i\ :\ \textrm{function of $\mathbf{x}_i, y_i$}} \mathbf{x}_i = \sum_{i=1}^n\gamma_i \mathbf{x}_i

(2)

where $\gamma_i$ is the gradient of loss function: $\gamma_i = 2(h(\mathbf x_i)-y_i)$

We will now show by induction that we can express $\mathbf{w}$ as a linear combination of all input vectors, namely

\mathbf{w}=\sum_{i=1}^n \alpha_i {\mathbf{x}}_i.

(3)

Base Case: Since the loss is convex, the final solution is independent of the initialization, and we can initialize $\mathbf{w}_0$ to be whatever we want. For convenience, set $\mathbf{w}_0=\begin{pmatrix}0 \\ \vdots \\ 0\end{pmatrix}$ . This is trivially a linear combination of $\mathbf x$ .

Inductive Step:

\begin{align} \mathbf{w}_1=&\mathbf{w}_0-s\sum_{i=1}^n2(\mathbf{w}_0^\top \mathbf{x}_i-y_i)\mathbf{x}_i=\sum_{i=1}^n \alpha_i^0 {\mathbf{x}}_i-s\sum_{i=1}^n\gamma_i^0\mathbf{x}_i=\sum_{i=1}^n\alpha_i^1\mathbf{x}_i&(\textrm{with $\alpha_i^1=\alpha_i^0-s\gamma_i^0$})\nonumber\\ \mathbf{w}_2=&\mathbf{w}_1-s\sum_{i=1}^n2(\mathbf{w}_1^\top \mathbf{x}_i-y_i)\mathbf{x}_i=\sum_{i=1}^n \alpha_i^1\mathbf{x}_i-s\sum_{i=1}^n\gamma_i^1\mathbf{x}_i=\sum_{i=1}^n\alpha_i^2\mathbf{x}_i&(\textrm{with $\alpha_i^2=\alpha_i^1\mathbf{x}_i-s\gamma_i^1$})\nonumber\\ \cdots & \qquad\qquad\qquad\cdots &\cdots\nonumber\\ \mathbf{w}_t=&\mathbf{w}_{t-1}-s\sum_{i=1}^n2(\mathbf{w}_{t-1}^\top \mathbf{x}_i-y_i)\mathbf{x}_i=\sum_{i=1}^n \alpha_i^{t-1}\mathbf{x}_i-s\sum_{i=1}^n\gamma_i^{t-1}\mathbf{x}_i=\sum_{i=1}^n\alpha_i^t\mathbf{x}_i&(\textrm{with $\alpha_i^t=\alpha_i^{t-1}-s\gamma_i^{t-1}$})\nonumber \end{align}

(4)

The update-rule for $\alpha_i^t$ is thus

\alpha_i^t=\alpha_i^{t-1}-s\gamma_i^{t-1}

(5)

Since $a_i^0 = 0$ , we can write the closed form

\alpha_i^t=-s\sum_{r=0}^{t-1}\gamma_i^{r}

(6)

Therefore, we have shown that we can perform the entire gradient descent update rule without ever expressing $\mathbf{w}$ explicitly. We just keep track of the $n$ coefficients $\alpha_1,\dots,\alpha_n$ .

Now that $\mathbf{w}$ can be written as a linear combination of the training set, we can also express the prediction result purely in terms of inner-products between training inputs:

h({\mathbf{x}}_t)=\mathbf{w}^\top {\mathbf{x}}_t=\sum_{j=1}^n\alpha_j{\mathbf{x}}_j^\top {\mathbf{x}}_t.

(7)

Consequently, we can also re-write the squared-loss from $\ell(\mathbf{w}) = \sum_{i=1}^n (\mathbf{w}^\top \mathbf{x}_i-y_i)^2$ entirely in terms of inner-product between training inputs:

\ell(\mathbf{\alpha}) = \sum_{i=1}^n \left(\sum_{j=1}^n\alpha_j\mathbf{x}_j^\top \mathbf{x}_i-y_i\right)^2

(8)

Do you notice a theme? The only information we ever need in order to learn a hyper-plane classifier with the squared-loss is inner-products between all pairs of data vectors.

Inner-Product Computation¶

Let’s go back to the previous example, $\phi(\mathbf{x})=\begin{pmatrix}1\\ x_1\\ \vdots \\x_d \\ x_1x_2 \\ \vdots \\ x_{d-1}x_d\\ \vdots \\x_1x_2\cdots x_d \end{pmatrix}$ .

The inner product $\phi(\mathbf{x})^\top \phi(\mathbf{z})$ can be formulated as:

\phi(\mathbf{x})^\top \phi(\mathbf{z})=1\cdot 1+x_1z_1+x_2z_2+\cdots +x_1x_2z_1z_2+ \cdots +x_1\cdots x_dz_1\cdots z_d=\prod_{k=1}^d(1+x_kz_k)\text{.}

(9)

The sum of $2^d$ terms becomes the product of $d$ terms. Define the kernel function $\mathsf k$ as:

\mathsf{k}(\mathbf{x}_i,\mathbf{x}_j) =\phi(\mathbf{x}_i)^\top \phi(\mathbf{x}_j)

(10)

With a finite training set of $n$ samples, inner products are often pre-computed and stored in a Kernel Matrix:

\mathsf{K}_{ij} = \mathsf{k}(\mathbf{x}_i,\mathbf{x}_j) = \phi(\mathbf{x}_i)^\top \phi(\mathbf{x}_j)

(11)

Obviously, $\mathsf K$ is symmetric. If we store the matrix $\mathsf{K}$ , we only need to do simple matrix look-ups and low-dimensional computations throughout the gradient descent algorithm. To make the formula more readable, we sill use the “kernel function” representation instead of the “matrix” representation. The final classifier becomes:

h(\mathbf{x}_t)=\sum_{i=1}^n\alpha_i \mathsf{k}(\mathbf{x}_i,\mathbf{x}_t)

(12)

So we can rewrite function $\gamma$ as

\gamma_i = 2(h(\mathbf x_i)-y_i) = 2 \left[\left(\sum_{j=1}^n \alpha_j \mathsf{k}(\mathbf{x}_i,\mathbf{x}_j) \right)-y_i \right]

(13)

The gradient update becomes:

\alpha_i^t = \alpha_i^{t-1} - s\gamma_i^{t-1} = \alpha_i^{t-1} - 2s\left[\left(\sum_{j=1}^n \alpha_j \mathsf{k}(\mathbf{x}_i,\mathbf{x}_j) \right)-y_i \right]

(14)

General Kernels¶

Linear: $\mathsf{K}(\mathbf{x},\mathbf{z})=\mathbf{x}^\top \mathbf{z}, \phi(\mathbf{x}) = \mathbf{x}$ : The linear kernel is equivalent to just using a linear classifier, but it actually runs faster because it is a kernel and updates in a kernel way (covered in the next lecture what means to “update in a kernel way”)
Polynomial: $\mathsf{K}(\mathbf{x},\mathbf{z})=(1+\mathbf{x}^\top \mathbf{z})^d$ , $\phi(\mathbf x)$ is something similar to the $\phi$ above, but in even higher dimension. This kernel contains all polynomials of up to $d$ , including something like $\mathbf x^2 \mathbf z ^3$ , which is not covered by the kernel function in the previous section (See Equation 8).
Radial Basis Function (RBF) (aka Gaussian Kernel): $\mathsf{K}(\mathbf{x},\mathbf{z})= e^\frac{-\|\mathbf{x}-\mathbf{z}\|^2}{\sigma^2}$ : Though the polynomial kernel is in very high dimensional, it is still finite, but the dimension of RBF’s corresponding feature vector $\phi(\mathbf x)$ is infinite and cannot be computed. However, an effective low dimensional approximation exists.
Exponential Kernel: $\mathsf{K}(\mathbf{x},\mathbf{z})= e^\frac{-\| \mathbf{x}-\mathbf{z}\|}{2\sigma^2}$
Laplacian Kernel: $\mathsf{K}(\mathbf{x},\mathbf{z})= e^\frac{-| \mathbf{x}-\mathbf{z}|}{\sigma}$
Sigmoid Kernel: $\mathsf{K}(\mathbf{x},\mathbf{z})=\tanh(\mathbf{a}\mathbf{x}^\top + c)$

Kernel functions¶

Can any function $\mathsf{K}(\cdot,\cdot)$ be used as a kernel?

No, the matrix $\mathsf{K}(\mathbf{x}_i,\mathbf{x}_j)$ has to correspond to real inner-products of $\phi({\mathbf{x}})$ - $\mathbf x$ after some transformation. This is the case if and only if $\mathsf{K}$ is positive semi-definite.

Prove: recall the definition of positive semi-definite: A matrix $A\in \mathbb{R}^{n\times n}$ is positive semi-definite iff $\forall \mathbf{q}\in\mathbb{R}^n$ , $\mathbf{q}^\top A\mathbf{q}\geq 0$ .

All Kernel functions are positive semi-definite:
Remember $\mathsf{K}_{ij}=\phi(\mathbf{x}_i)^\top \phi(\mathbf{x}_j)$ . So $\mathsf{K}=\Phi^\top\Phi$ , where $\Phi=[\phi(\mathbf{x}_1),\dots,\phi(\mathbf{x}_n)]$ . It follows that $\mathsf{K}$ is p.s.d., because $\mathbf{q}^\top\mathsf{K}\mathbf{q}=(\Phi^\top \mathbf{q})^2\geq 0$ .
All positive semi-definite matrix produces a kernel:
if any matrix $\mathbf{A}$ is p.s.d., it can be decomposed as $A=\Phi^\top\Phi$ for some realization of $\Phi$ .

CS4780 Intro to Machine Learning

Model Selection Tricks

CS4780 Intro to Machine Learning