Preconditioning and Element Specific Learning Rate

Last time, we saw how we can use momentum to speed up GD when $\kappa$ is large. We reduced the number of iterations from $\mathcal O(n \kappa \log \frac 1 \epsilon)$ to $\mathcal O (n \sqrt \kappa \log \frac 1 \epsilon)$ . This time, we will see how we can use preconditioning and adaptive learning rates to speed up computation.

Level Sets - Visualize Condition Number $\kappa$ ¶

We define a level set $L_c$ of a function $f: \R^d \rightarrow \R$ as

L_C = \{ w \mid f(w) = C \}

(1)

It is just the contour line of function $f$ at a level $C$ .

Take the same function as last lecture, where

\begin{align} f(w_1, w_2) &= \frac{L}{2} w_1^2 + \frac{\mu}{2} w_2^2 &&\kappa = \frac L \mu &&2C = Lw_1^2 + \mu w_2^2\\ f(w_1, w_2) &= \frac{1}{2} w_1^2 + \frac{1}{2} w_2^2 &&\kappa = 1 &&2C = w_1^2 + w_2^2 \text{ is a circle}\\ f(w_1, w_2) &= \frac{10}{2} w_1^2 + \frac{1}{2} w_2^2 &&\kappa = 10 &&2C = 10w_1^2 + w_2^2 \text{ is an eclipse} \end{align}

(2)

We say that the function is well-conditioned when we have a small condition number. And the level set will look like a circle.

Preconditioning¶

Transforming Space¶

One nice thing to note is that when we stretch the space, we reserve the minimum value of a function. That is, as long as the matrix $P$ is invertible, we have (this is to ensure $\forall w, \exists u \; s.t. u = P^{-1}w$ )

\min_w f(w) = \min_u f(Pu)

(3)

Therefore, if we only look at the minimum point, we have

\operatorname*{argmin}_w f(w) = P\left(\operatorname*{argmin}_u f(Pu) \right)

(4)

where $\operatorname*{argmin}_u f(Pu)$ solves for $u$ in the transformed space and we get back $w$ by multiplying it with $P$

While reserving the minimum point, we actually changed how the function looks in everywhere else so the condition number in this new transformed space is also changed (hopefully smaller).

Problem Setup¶

Define $g(u) = f(Pu)$ , we want to solve $\operatorname*{argmin}_u g(u)$ and map the result back to $w = Pu$

For example, take $A$ as a symmetric positive definite matrix with all positive eigenvalues and define

f(w) = \frac 1 2 w^T A w

(5)

We can actually transform $A$ to $I$ as the Hessian matrix in another space so we can have a perfect condition number $\kappa = 1$ . To do this, we want to find a $P$ such that $P^TAP = I$ , so

\begin{align} f(Pu) &= \frac 1 2 (Pu)^T A (Pu)\\ g(u) &= \frac 1 2 uIu\\ \end{align}

(6)

Suppose $P$ is also symmetric. Solve for $P$ , we have

P^{-2} = A

(7)

We can denote $P=A^{- \frac 1 2}$ . This is just a notation and doesn’t have practical computation meaning.

In-place Transformation¶

So far, we first transform the problem to another space, solve it at that space, and transform the solution back to original space. Can we do everything in the original space, but pretend we are in the transformed space when we do the GD update?

Imagine we are again solving the minimization problem on the transformed function $g(u)$

u_{t+1} = u_t - \alpha \nabla g(u_t)

(8)

To find out the value of $\nabla g(u_t)$ , we go through a similar derivation as we did in Gradient Descent Appendix: First note

g(u + \eta v) = f(P (u + \eta v)) = f(Pu + \eta Pv)

(9)

We then look at the definition of gradient of $g$

\nabla_v g(u) = \lim_{\eta \to 0} \frac{d}{d \eta} g(u + \eta v) = \lim_{\eta \to 0} \frac{d}{d \eta} f(Pu + \eta Pv) = \nabla_{Pv} f(Pu) = (Pv)^T \nabla f(Pu) = v^T P^T \nabla f(Pu)

(10)

Substitute the $\nabla g(u)$ with this equation

u_{t+1} = u_t - \alpha P^T \nabla f(P u_t).

(11)

To get back $w_t$ , recall $w_t = P u_t$ so we just multiply both sides by $P$ , we get

\begin{align} w_{t+1} &= P u_{t+1} \\ &= P u_t - \alpha P P^T \nabla f(P u_t) \\ &= w_t - \alpha P P^T \nabla f(w_t)\\ &= w_t - \alpha R \nabla f(w_t) \end{align}

(12)

Therefore, this is just Gradient Descent with gradients scaled by a positive semidefinite matrix $R = PP^T$ . We call this matrix $R$ the preconditioner. In practice, we won’t labor ourselves finding $P$ , we just find some positive semidefinite $R$ , because we know any such matrix can be decomposed into $R = PP^T$ form.

If we relate to our previous simple example, $R = P^2 = A^{-1}$

A weird question: despite this derivation, what happens if we use a non-positive semidefinite matrix here? Let’s just imagine a very simple diagonal -1 matrix. It will simply blow up $w_t$ by directing it to go to the step where $f(w_t)$ increases at each step.

Finding Transformation¶

How do we find this $R$ then?

use statistics from the dataset: For example, for a linear model you could precondition based on the variance of the features in your dataset. Note this is is almost the same as normalization: we both want to transform our problem into some easier domain. One slight difference is that preconditioning scales regularizer too, while normalization doesn’t. This really doesn’t matter in practice.
use information from the matrix of second-partial derivatives. For example, you could use a preconditioning matrix that is a diagonal approximation of the Newton’s method update at some point. This is similar to what we did in the quadratic function example. These methods are sometimes called Newton Sketch methods.

Element Specific Learning Rate¶

Note this is computationally expensive: previously each update only takes $\mathcal O(d+t)$ , where $t$ is the time computing gradients. Now that we have a matrix multiplication term, the time has become $\mathcal O(d^2 + t)$ . In addition, we’ll have to store a $d \times d$ matrix in memory. As we discussed before, this is really bad when we had high dimensional data.

Therefore, we want to find an $R$ that is a diagonal matrix so that when we perform this matrix multiplication, it would work as two vector doing an elementwise multiplication and it can keep our running time at $\mathcal O(d + t)$

If we have such a diagonal matrix, the update step on $i$ -th index will look like:

\begin{align} (w_{t+1})_i &= (w_t)_i - \alpha R_{ii} (\nabla f(w_t))_i\\ &= (w_t)_i - \alpha'_i (\nabla f(w_t))_i\\ \end{align}

(13)

We actually have an element specific learning rate $\alpha'_i = \alpha R_{ii}$ we can also rewrite it in vector form (added the vector sign to denote $\alpha$ is actually a vector)

\vec{w_{t+1}} = \vec{w_t} - \vec \alpha \nabla f(w_t)

(14)

Adaptive Learning Rate¶

However, if we make $\alpha$ a hyperparameter, we would have to choose $d$ hyperparameters, which is a lot. Imagine when we want our model to be optimal so we have to do a hyperparameter search, this many hyperparameters are impossible to search through. Therefore, we want to find a way to change $\alpha$ intelligently.

Maybe we can keep a running sum of gradient square at each dimension? That’ll be our topic for next class.

CS4787 Principles of Large-Scale Machine Learning

SGD with Momentum