More on Kernels - Incomplete CS Notes @ Cornell

Well-defined kernels¶

We can build kernels by recursively combining one or more of the following rules:

$\mathsf{k}(\mathbf{x}, \mathbf{z})=c\mathsf{k_1}(\mathbf{x},\mathbf{z})$
$\mathsf{k}(\mathbf{x}, \mathbf{z})=\mathsf{k_1}(\mathbf{x},\mathbf{z})+\mathsf{k_2}(\mathbf{x},\mathbf{z})$
$\mathsf{k}(\mathbf{x}, \mathbf{z})=g(\mathsf{k}(\mathbf{x},\mathbf{z}))$
$\mathsf{k}(\mathbf{x}, \mathbf{z})=\mathsf{k_1}(\mathbf{x},\mathbf{z})\mathsf{k_2}(\mathbf{x},\mathbf{z})$
$\mathsf{k}(\mathbf{x}, \mathbf{z})=f(\mathbf{x})\mathsf{k_1}(\mathbf{x},\mathbf{z})f(\mathbf{z})$
$\mathsf{k}(\mathbf{x}, \mathbf{z})=e^{\mathsf{k_1}(\mathbf{x},\mathbf{z})}$

where $k_1,k_2$ are well-defined kernels, $c\geq 0$ , $g$ is a polynomial function with positive coefficients, $f$ is any function and $\mathbf{A}\succeq 0$ is positive semi-definite.

Kernel Machines¶

Kernalizing an Algorithm¶

An algorithm can be kernelized in 3 steps:

Prove that the solution lies in the span of the training points (i.e. $\mathbf{w}=\sum_{i=1}^n \alpha_i \mathbf{x}_i$ for some $\alpha_i$ )
Replace $\mathbf w$ with $\sum\alpha_i \mathbf{x}_i$ in the algorithm, so we have prediction $h(\mathbf{x}_t)= \mathbf w \mathbf{x}_t = \sum_{i=1}^n\alpha_i \mathbf{x}_i^T \mathbf{x}_i$
Define a kernel function and substitute $\mathsf{k}(\mathbf{x}_i,\mathbf{x}_j)$ for $\mathbf{x}_i^\top \mathbf{x}_j$ , so $h(\mathbf{x}_t) = \sum_{i=1}^n\alpha_i \mathbf{x}_i^T \mathbf{x}_i= \sum_{i=1}^n\alpha_i \mathsf{k}(\mathbf{x}_i,\mathbf{x}_t)$

Example: Algorithm with Squared Loss¶

Recap¶

linear regression minimizes the following squared loss:

\min_\mathbf{w} \sum_{i=1}^{n} (\mathbf{w}^\top \mathbf{x}_i -y_i)^2

(1)

The solution of OLS can be written in closed form:

\mathbf{w}=(\mathbf{X}\mathbf{X}^\top)^{-1} \mathbf{X} \mathbf{y}

(2)

Note here our $\mathbf x$ has already gone through the transformation $\phi$ into the feature space, so $\mathbf x = \phi(\mathbf x_{original \;data})$ . Therefore, the conclusion we make here generalizes to all kinds of kernel with $\mathsf{k}(\mathbf{x}_{og \;i},\mathbf{x}_{og \;j}) =\phi(\mathbf{x}_{og \;i})^\top \phi(\mathbf{x}_{og \;j})$ and $\mathsf K = \mathbf X ^T \mathbf X$ .

Kernelization¶

We begin by expressing the solution $\mathbf{w}$ as a linear combination of the training inputs

\mathbf{w}=\sum_{i=1}^{n} \alpha_i\mathbf{x}_i=\mathbf{X}\vec{\alpha}

(3)

We derived in the previous lecture that such a vector $\vec \alpha$ must always exists. Also rewrite the prediction result $h$ as:

h(\mathbf{z})=\mathbf{w}^\top \mathbf{z} = \sum_{i=1}^n\alpha_i \mathbf{x}_i^\top\mathbf{z}.

(4)

Revisit our minimization problem:

\begin{align} & \min_\mathbf{w} \sum_{i=1}^{n} (\mathbf{w}^\top \mathbf{x}_i -y_i)^2 \\ =& \min_\mathbf{w} || \mathbf X^Tw - y ||_2^2 \\ =& \min_\mathbf{w} || \mathbf X^T \mathbf{X}{\alpha} - y ||_2^2 \\ =& \min_\mathbf{w} || \mathsf K \alpha - y ||_2^2 \end{align}

(5)

We obtain a min value when $\mathsf K \alpha - y = 0$ , so when $\alpha = \mathsf K^{-1}y$ , but this is only true when $\mathsf K$ is invertible (only happens when all pivots in $\mathsf K$ are non-zero, so only happens when $\mathsf K$ is positive definite) Since $\mathsf K$ is merely positive semi-definite, its invertible is not guaranteed, so we generalize it to:

(\mathsf K + \tau^2 I) \alpha - y = 0

(6)

and the solution becomes

\alpha = (\mathsf K + \tau^2 I)^{-1} y

(7)

Testing¶

The prediction of a test point $\mathbf{z}$ then becomes

h(\mathbf{z})=\mathbf{z}^\top \mathbf{w} =\mathbf{z}^\top\underbrace{\mathbf{X}\vec{\alpha}}_{\mathbf{w}} =\underbrace{\mathbf{k}_*}_{\mathbf{z}^\top\mathbf{X}}\underbrace{(\mathbf{K}+\tau^2\mathbf{I})^{-1}\mathbf{y}}_{\vec{\alpha}}=\mathbf{k}_*\vec{\alpha}

(8)

where $\mathbf{k}_*$ is the kernel (vector) of the test point with the training points after the mapping into feature space through $\phi$ , i.e. the $i^{th}$ dimension corresponds to $[\mathbf{k}_*]_{i}=\phi(\mathbf{z})^\top\phi(\mathbf{x}_i)$ .

CS4780 Intro to Machine Learning

Kernels

CS4780 Intro to Machine Learning

Decision Tree