More on Kernels November 4, 2021
Well-defined kernels ¶ We can build kernels by recursively combining one or more of the following rules:
k ( x , z ) = c k 1 ( x , z ) \mathsf{k}(\mathbf{x}, \mathbf{z})=c\mathsf{k_1}(\mathbf{x},\mathbf{z}) k ( x , z ) = c k 1 ( x , z ) k ( x , z ) = k 1 ( x , z ) + k 2 ( x , z ) \mathsf{k}(\mathbf{x}, \mathbf{z})=\mathsf{k_1}(\mathbf{x},\mathbf{z})+\mathsf{k_2}(\mathbf{x},\mathbf{z}) k ( x , z ) = k 1 ( x , z ) + k 2 ( x , z ) k ( x , z ) = g ( k ( x , z ) ) \mathsf{k}(\mathbf{x}, \mathbf{z})=g(\mathsf{k}(\mathbf{x},\mathbf{z})) k ( x , z ) = g ( k ( x , z )) k ( x , z ) = k 1 ( x , z ) k 2 ( x , z ) \mathsf{k}(\mathbf{x}, \mathbf{z})=\mathsf{k_1}(\mathbf{x},\mathbf{z})\mathsf{k_2}(\mathbf{x},\mathbf{z}) k ( x , z ) = k 1 ( x , z ) k 2 ( x , z ) k ( x , z ) = f ( x ) k 1 ( x , z ) f ( z ) \mathsf{k}(\mathbf{x}, \mathbf{z})=f(\mathbf{x})\mathsf{k_1}(\mathbf{x},\mathbf{z})f(\mathbf{z}) k ( x , z ) = f ( x ) k 1 ( x , z ) f ( z ) k ( x , z ) = e k 1 ( x , z ) \mathsf{k}(\mathbf{x}, \mathbf{z})=e^{\mathsf{k_1}(\mathbf{x},\mathbf{z})} k ( x , z ) = e k 1 ( x , z ) where k 1 , k 2 k_1,k_2 k 1 , k 2 are well-defined kernels, c ≥ 0 c\geq 0 c ≥ 0 , g g g is a polynomial function with positive coefficients, f f f is any function and A ⪰ 0 \mathbf{A}\succeq 0 A ⪰ 0 is positive semi-definite.
Kernel Machines ¶ Kernalizing an Algorithm ¶ An algorithm can be kernelized in 3 steps:
Prove that the solution lies in the span of the training points (i.e. w = ∑ i = 1 n α i x i \mathbf{w}=\sum_{i=1}^n \alpha_i \mathbf{x}_i w = ∑ i = 1 n α i x i for some α i \alpha_i α i ) Replace w \mathbf w w with ∑ α i x i \sum\alpha_i \mathbf{x}_i ∑ α i x i in the algorithm, so we have prediction h ( x t ) = w x t = ∑ i = 1 n α i x i T x i h(\mathbf{x}_t)= \mathbf w \mathbf{x}_t = \sum_{i=1}^n\alpha_i \mathbf{x}_i^T \mathbf{x}_i h ( x t ) = w x t = ∑ i = 1 n α i x i T x i Define a kernel function and substitute k ( x i , x j ) \mathsf{k}(\mathbf{x}_i,\mathbf{x}_j) k ( x i , x j ) for x i ⊤ x j \mathbf{x}_i^\top \mathbf{x}_j x i ⊤ x j , so h ( x t ) = ∑ i = 1 n α i x i T x i = ∑ i = 1 n α i k ( x i , x t ) h(\mathbf{x}_t) = \sum_{i=1}^n\alpha_i \mathbf{x}_i^T \mathbf{x}_i= \sum_{i=1}^n\alpha_i \mathsf{k}(\mathbf{x}_i,\mathbf{x}_t) h ( x t ) = ∑ i = 1 n α i x i T x i = ∑ i = 1 n α i k ( x i , x t ) Example: Algorithm with Squared Loss ¶ Recap ¶ linear regression minimizes the following squared loss:
min w ∑ i = 1 n ( w ⊤ x i − y i ) 2 \min_\mathbf{w} \sum_{i=1}^{n} (\mathbf{w}^\top \mathbf{x}_i -y_i)^2 w min i = 1 ∑ n ( w ⊤ x i − y i ) 2 The solution of OLS can be written in closed form:
w = ( X X ⊤ ) − 1 X y \mathbf{w}=(\mathbf{X}\mathbf{X}^\top)^{-1} \mathbf{X} \mathbf{y} w = ( X X ⊤ ) − 1 Xy Note here our x \mathbf x x has already gone through the transformation ϕ \phi ϕ into the feature space , so x = ϕ ( x o r i g i n a l d a t a ) \mathbf x = \phi(\mathbf x_{original \;data}) x = ϕ ( x or i g ina l d a t a ) . Therefore, the conclusion we make here generalizes to all kinds of kernel with k ( x o g i , x o g j ) = ϕ ( x o g i ) ⊤ ϕ ( x o g j ) \mathsf{k}(\mathbf{x}_{og \;i},\mathbf{x}_{og \;j}) =\phi(\mathbf{x}_{og \;i})^\top \phi(\mathbf{x}_{og \;j}) k ( x o g i , x o g j ) = ϕ ( x o g i ) ⊤ ϕ ( x o g j ) and K = X T X \mathsf K = \mathbf X ^T \mathbf X K = X T X .
Kernelization ¶ We begin by expressing the solution w \mathbf{w} w as a linear combination of the training inputs
w = ∑ i = 1 n α i x i = X α ⃗ \mathbf{w}=\sum_{i=1}^{n} \alpha_i\mathbf{x}_i=\mathbf{X}\vec{\alpha} w = i = 1 ∑ n α i x i = X α We derived in the previous lecture that such a vector α ⃗ \vec \alpha α must always exists. Also rewrite the prediction result h h h as:
h ( z ) = w ⊤ z = ∑ i = 1 n α i x i ⊤ z . h(\mathbf{z})=\mathbf{w}^\top \mathbf{z} = \sum_{i=1}^n\alpha_i \mathbf{x}_i^\top\mathbf{z}. h ( z ) = w ⊤ z = i = 1 ∑ n α i x i ⊤ z . Revisit our minimization problem:
min w ∑ i = 1 n ( w ⊤ x i − y i ) 2 = min w ∣ ∣ X T w − y ∣ ∣ 2 2 = min w ∣ ∣ X T X α − y ∣ ∣ 2 2 = min w ∣ ∣ K α − y ∣ ∣ 2 2 \begin{align}
& \min_\mathbf{w} \sum_{i=1}^{n} (\mathbf{w}^\top \mathbf{x}_i -y_i)^2 \\
=& \min_\mathbf{w} || \mathbf X^Tw - y ||_2^2 \\
=& \min_\mathbf{w} || \mathbf X^T \mathbf{X}{\alpha} - y ||_2^2 \\
=& \min_\mathbf{w} || \mathsf K \alpha - y ||_2^2
\end{align} = = = w min i = 1 ∑ n ( w ⊤ x i − y i ) 2 w min ∣∣ X T w − y ∣ ∣ 2 2 w min ∣∣ X T X α − y ∣ ∣ 2 2 w min ∣∣ K α − y ∣ ∣ 2 2 We obtain a min value when K α − y = 0 \mathsf K \alpha - y = 0 K α − y = 0 , so when α = K − 1 y \alpha = \mathsf K^{-1}y α = K − 1 y , but this is only true when K \mathsf K K is invertible (only happens when all pivots in K \mathsf K K are non-zero, so only happens when K \mathsf K K is positive definite) Since K \mathsf K K is merely positive semi-definite, its invertible is not guaranteed, so we generalize it to:
( K + τ 2 I ) α − y = 0 (\mathsf K + \tau^2 I) \alpha - y = 0 ( K + τ 2 I ) α − y = 0 and the solution becomes
α = ( K + τ 2 I ) − 1 y \alpha = (\mathsf K + \tau^2 I)^{-1} y α = ( K + τ 2 I ) − 1 y Testing ¶ The prediction of a test point z \mathbf{z} z then becomes
h ( z ) = z ⊤ w = z ⊤ X α ⃗ ⏟ w = k ∗ ⏟ z ⊤ X ( K + τ 2 I ) − 1 y ⏟ α ⃗ = k ∗ α ⃗ h(\mathbf{z})=\mathbf{z}^\top \mathbf{w} =\mathbf{z}^\top\underbrace{\mathbf{X}\vec{\alpha}}_{\mathbf{w}} =\underbrace{\mathbf{k}_*}_{\mathbf{z}^\top\mathbf{X}}\underbrace{(\mathbf{K}+\tau^2\mathbf{I})^{-1}\mathbf{y}}_{\vec{\alpha}}=\mathbf{k}_*\vec{\alpha} h ( z ) = z ⊤ w = z ⊤ w X α = z ⊤ X k ∗ α ( K + τ 2 I ) − 1 y = k ∗ α where k ∗ \mathbf{k}_* k ∗ is the kernel (vector) of the test point with the training points after the mapping into feature space through ϕ \phi ϕ , i.e. the i t h i^{th} i t h dimension corresponds to [ k ∗ ] i = ϕ ( z ) ⊤ ϕ ( x i ) [\mathbf{k}_*]_{i}=\phi(\mathbf{z})^\top\phi(\mathbf{x}_i) [ k ∗ ] i = ϕ ( z ) ⊤ ϕ ( x i ) .