Skip to article frontmatterSkip to article content

More on Kernels

Cornell University

Well-defined kernels

We can build kernels by recursively combining one or more of the following rules:

  1. k(x,z)=ck1(x,z)\mathsf{k}(\mathbf{x}, \mathbf{z})=c\mathsf{k_1}(\mathbf{x},\mathbf{z})
  2. k(x,z)=k1(x,z)+k2(x,z)\mathsf{k}(\mathbf{x}, \mathbf{z})=\mathsf{k_1}(\mathbf{x},\mathbf{z})+\mathsf{k_2}(\mathbf{x},\mathbf{z})
  3. k(x,z)=g(k(x,z))\mathsf{k}(\mathbf{x}, \mathbf{z})=g(\mathsf{k}(\mathbf{x},\mathbf{z}))
  4. k(x,z)=k1(x,z)k2(x,z)\mathsf{k}(\mathbf{x}, \mathbf{z})=\mathsf{k_1}(\mathbf{x},\mathbf{z})\mathsf{k_2}(\mathbf{x},\mathbf{z})
  5. k(x,z)=f(x)k1(x,z)f(z)\mathsf{k}(\mathbf{x}, \mathbf{z})=f(\mathbf{x})\mathsf{k_1}(\mathbf{x},\mathbf{z})f(\mathbf{z})
  6. k(x,z)=ek1(x,z)\mathsf{k}(\mathbf{x}, \mathbf{z})=e^{\mathsf{k_1}(\mathbf{x},\mathbf{z})}

where k1,k2k_1,k_2 are well-defined kernels, c0c\geq 0, gg is a polynomial function with positive coefficients, ff is any function and A0\mathbf{A}\succeq 0 is positive semi-definite.

Kernel Machines

Kernalizing an Algorithm

An algorithm can be kernelized in 3 steps:

  1. Prove that the solution lies in the span of the training points (i.e. w=i=1nαixi\mathbf{w}=\sum_{i=1}^n \alpha_i \mathbf{x}_i for some αi\alpha_i)
  2. Replace w\mathbf w with αixi\sum\alpha_i \mathbf{x}_i in the algorithm, so we have prediction h(xt)=wxt=i=1nαixiTxih(\mathbf{x}_t)= \mathbf w \mathbf{x}_t = \sum_{i=1}^n\alpha_i \mathbf{x}_i^T \mathbf{x}_i
  3. Define a kernel function and substitute k(xi,xj)\mathsf{k}(\mathbf{x}_i,\mathbf{x}_j) for xixj\mathbf{x}_i^\top \mathbf{x}_j, so h(xt)=i=1nαixiTxi=i=1nαik(xi,xt)h(\mathbf{x}_t) = \sum_{i=1}^n\alpha_i \mathbf{x}_i^T \mathbf{x}_i= \sum_{i=1}^n\alpha_i \mathsf{k}(\mathbf{x}_i,\mathbf{x}_t)

Example: Algorithm with Squared Loss

Recap

linear regression minimizes the following squared loss:

minwi=1n(wxiyi)2\min_\mathbf{w} \sum_{i=1}^{n} (\mathbf{w}^\top \mathbf{x}_i -y_i)^2

The solution of OLS can be written in closed form:

w=(XX)1Xy\mathbf{w}=(\mathbf{X}\mathbf{X}^\top)^{-1} \mathbf{X} \mathbf{y}

Note here our x\mathbf x has already gone through the transformation ϕ\phi into the feature space, so x=ϕ(xoriginal  data)\mathbf x = \phi(\mathbf x_{original \;data}). Therefore, the conclusion we make here generalizes to all kinds of kernel with k(xog  i,xog  j)=ϕ(xog  i)ϕ(xog  j)\mathsf{k}(\mathbf{x}_{og \;i},\mathbf{x}_{og \;j}) =\phi(\mathbf{x}_{og \;i})^\top \phi(\mathbf{x}_{og \;j}) and K=XTX\mathsf K = \mathbf X ^T \mathbf X.

Kernelization

We begin by expressing the solution w\mathbf{w} as a linear combination of the training inputs

w=i=1nαixi=Xα\mathbf{w}=\sum_{i=1}^{n} \alpha_i\mathbf{x}_i=\mathbf{X}\vec{\alpha}

We derived in the previous lecture that such a vector α\vec \alpha must always exists. Also rewrite the prediction result hh as:

h(z)=wz=i=1nαixiz.h(\mathbf{z})=\mathbf{w}^\top \mathbf{z} = \sum_{i=1}^n\alpha_i \mathbf{x}_i^\top\mathbf{z}.

Revisit our minimization problem:

minwi=1n(wxiyi)2=minwXTwy22=minwXTXαy22=minwKαy22\begin{align} & \min_\mathbf{w} \sum_{i=1}^{n} (\mathbf{w}^\top \mathbf{x}_i -y_i)^2 \\ =& \min_\mathbf{w} || \mathbf X^Tw - y ||_2^2 \\ =& \min_\mathbf{w} || \mathbf X^T \mathbf{X}{\alpha} - y ||_2^2 \\ =& \min_\mathbf{w} || \mathsf K \alpha - y ||_2^2 \end{align}

We obtain a min value when Kαy=0\mathsf K \alpha - y = 0, so when α=K1y\alpha = \mathsf K^{-1}y , but this is only true when K\mathsf K is invertible (only happens when all pivots in K\mathsf K are non-zero, so only happens when K\mathsf K is positive definite) Since K\mathsf K is merely positive semi-definite, its invertible is not guaranteed, so we generalize it to:

(K+τ2I)αy=0(\mathsf K + \tau^2 I) \alpha - y = 0

and the solution becomes

α=(K+τ2I)1y\alpha = (\mathsf K + \tau^2 I)^{-1} y

Testing

The prediction of a test point z\mathbf{z} then becomes

h(z)=zw=zXαw=kzX(K+τ2I)1yα=kαh(\mathbf{z})=\mathbf{z}^\top \mathbf{w} =\mathbf{z}^\top\underbrace{\mathbf{X}\vec{\alpha}}_{\mathbf{w}} =\underbrace{\mathbf{k}_*}_{\mathbf{z}^\top\mathbf{X}}\underbrace{(\mathbf{K}+\tau^2\mathbf{I})^{-1}\mathbf{y}}_{\vec{\alpha}}=\mathbf{k}_*\vec{\alpha}

where k\mathbf{k}_* is the kernel (vector) of the test point with the training points after the mapping into feature space through ϕ\phi, i.e. the ithi^{th} dimension corresponds to [k]i=ϕ(z)ϕ(xi)[\mathbf{k}_*]_{i}=\phi(\mathbf{z})^\top\phi(\mathbf{x}_i).