Preconditioning and Element Specific Learning Rate
Last time, we saw how we can use momentum to speed up GD when is large. We reduced the number of iterations from to . This time, we will see how we can use preconditioning and adaptive learning rates to speed up computation.
Level Sets - Visualize Condition Number ¶
We define a level set of a function as
It is just the contour line of function at a level .
Take the same function as last lecture, where
We say that the function is well-conditioned when we have a small condition number. And the level set will look like a circle.
Preconditioning¶
Transforming Space¶
One nice thing to note is that when we stretch the space, we reserve the minimum value of a function. That is, as long as the matrix is invertible, we have (this is to ensure )
Therefore, if we only look at the minimum point, we have
where solves for in the transformed space and we get back by multiplying it with
While reserving the minimum point, we actually changed how the function looks in everywhere else so the condition number in this new transformed space is also changed (hopefully smaller).
Problem Setup¶
Define , we want to solve and map the result back to
For example, take as a symmetric positive definite matrix with all positive eigenvalues and define
We can actually transform to as the Hessian matrix in another space so we can have a perfect condition number . To do this, we want to find a such that , so
Suppose is also symmetric. Solve for , we have
We can denote . This is just a notation and doesn’t have practical computation meaning.
In-place Transformation¶
So far, we first transform the problem to another space, solve it at that space, and transform the solution back to original space. Can we do everything in the original space, but pretend we are in the transformed space when we do the GD update?
Imagine we are again solving the minimization problem on the transformed function
To find out the value of , we go through a similar derivation as we did in Gradient Descent Appendix: First note
We then look at the definition of gradient of
Substitute the with this equation
To get back , recall so we just multiply both sides by , we get
Therefore, this is just Gradient Descent with gradients scaled by a positive semidefinite matrix . We call this matrix the preconditioner. In practice, we won’t labor ourselves finding , we just find some positive semidefinite , because we know any such matrix can be decomposed into form.
If we relate to our previous simple example,
A weird question: despite this derivation, what happens if we use a non-positive semidefinite matrix here? Let’s just imagine a very simple diagonal -1 matrix. It will simply blow up by directing it to go to the step where increases at each step.
Finding Transformation¶
How do we find this then?
use statistics from the dataset: For example, for a linear model you could precondition based on the variance of the features in your dataset. Note this is is almost the same as normalization: we both want to transform our problem into some easier domain. One slight difference is that preconditioning scales regularizer too, while normalization doesn’t. This really doesn’t matter in practice.
use information from the matrix of second-partial derivatives. For example, you could use a preconditioning matrix that is a diagonal approximation of the Newton’s method update at some point. This is similar to what we did in the quadratic function example. These methods are sometimes called Newton Sketch methods.
Element Specific Learning Rate¶
Note this is computationally expensive: previously each update only takes , where is the time computing gradients. Now that we have a matrix multiplication term, the time has become . In addition, we’ll have to store a matrix in memory. As we discussed before, this is really bad when we had high dimensional data.
Therefore, we want to find an that is a diagonal matrix so that when we perform this matrix multiplication, it would work as two vector doing an elementwise multiplication and it can keep our running time at
If we have such a diagonal matrix, the update step on -th index will look like:
We actually have an element specific learning rate we can also rewrite it in vector form (added the vector sign to denote is actually a vector)
Adaptive Learning Rate¶
However, if we make a hyperparameter, we would have to choose hyperparameters, which is a lot. Imagine when we want our model to be optimal so we have to do a hyperparameter search, this many hyperparameters are impossible to search through. Therefore, we want to find a way to change intelligently.
Maybe we can keep a running sum of gradient square at each dimension? That’ll be our topic for next class.