Linear classifiers are great, but what if there exists no linear decision boundary? We should observe that nonlinear classifier in our (usually low) dimensional space is just a linear classifier on a higher dimensional space. When we project this higher-dimensional linear classifier into our lower-dimensional space, it appears nonlinear. As it turns out, there is an elegant way to incorporate non-linearities into most linear classifiers.
We can make linear classifiers non-linear by expanding the existing features. Formally, for a data vector x∈Rd, we apply the transformation x→ϕ(x) where ϕ(x)∈RD. Usually D≫d because we add dimensions that capture non-linear interactions among the original features.
l
Advantage: It is simple, and your problem stays convex and well behaved. (i.e. you can still use your original gradient descent code, just with the higher dimensional representation)
Disadvantage: ϕ(x) might be very high dimensional.
Consider the following example: x=⎝⎛x1x2⋮xd⎠⎞, and define ϕ(x)=⎝⎛1x1⋮xdx1x2⋮xd−1xd⋮x1x2⋯xd⎠⎞.
This new representation, ϕ(x), is very expressive and allows for complicated non-linear decision boundaries - but the dimensionality is extremely high D=2d. This makes our algorithm unbearable (and quickly prohibitively) slow.
The kernel trick is a way to get around this dilemma by learning a function in the much higher dimensional space, without explicitly computing the value of a single vector ϕ(x) or ever computing the full vector w. We will represent these values only with x and y.
It is based on the following observation: If we use gradient descent with any one of our standard loss functions, the gradient is a linear combination of the input samples. For example, in squared loss:
Base Case: Since the loss is convex, the final solution is independent of the initialization, and we can initialize w0 to be whatever we want. For convenience, set w0=⎝⎛0⋮0⎠⎞. This is trivially a linear combination of x.
Therefore, we have shown that we can perform the entire gradient descent update rule without ever expressing w explicitly. We just keep track of the n coefficients α1,…,αn.
Now that w can be written as a linear combination of the training set, we can also express the prediction result purely in terms of inner-products between training inputs:
Do you notice a theme? The only information we ever need in order to learn a hyper-plane classifier with the squared-loss is inner-products between all pairs of data vectors.
Obviously, K is symmetric. If we store the matrix K, we only need to do simple matrix look-ups and low-dimensional computations throughout the gradient descent algorithm. To make the formula more readable, we sill use the “kernel function” representation instead of the “matrix” representation. The final classifier becomes:
Linear: K(x,z)=x⊤z,ϕ(x)=x: The linear kernel is equivalent to just using a linear classifier, but it actually runs faster because it is a kernel and updates in a kernel way (covered in the next lecture what means to “update in a kernel way”)
Polynomial: K(x,z)=(1+x⊤z)d, ϕ(x) is something similar to the ϕ above, but in even higher dimension. This kernel contains all polynomials of up to d, including something like x2z3, which is not covered by the kernel function in the previous section (See Equation 8).
Radial Basis Function (RBF) (aka Gaussian Kernel): K(x,z)=eσ2−∥x−z∥2: Though the polynomial kernel is in very high dimensional, it is still finite, but the dimension of RBF’s corresponding feature vector ϕ(x) is infinite and cannot be computed. However, an effective low dimensional approximation exists.
No, the matrix K(xi,xj) has to correspond to real inner-products of ϕ(x) - x after some transformation. This is the case if and only if K is positive semi-definite.
Prove: recall the definition of positive semi-definite: A matrix A∈Rn×n is positive semi-definite iff ∀q∈Rn, q⊤Aq≥0.
All Kernel functions are positive semi-definite:
Remember Kij=ϕ(xi)⊤ϕ(xj). So K=Φ⊤Φ, where Φ=[ϕ(x1),…,ϕ(xn)]. It follows that K is p.s.d., because q⊤Kq=(Φ⊤q)2≥0.
All positive semi-definite matrix produces a kernel:
if any matrix A is p.s.d., it can be decomposed as A=Φ⊤Φ for some realization of Φ.