Skip to article frontmatterSkip to article content

Preconditioning and Element Specific Learning Rate

Cornell University

Last time, we saw how we can use momentum to speed up GD when κ\kappa is large. We reduced the number of iterations from O(nκlog1ϵ)\mathcal O(n \kappa \log \frac 1 \epsilon) to O(nκlog1ϵ)\mathcal O (n \sqrt \kappa \log \frac 1 \epsilon). This time, we will see how we can use preconditioning and adaptive learning rates to speed up computation.

Level Sets - Visualize Condition Number κ\kappa

We define a level set LcL_c of a function f:RdRf: \R^d \rightarrow \R as

LC={wf(w)=C}L_C = \{ w \mid f(w) = C \}

It is just the contour line of function ff at a level CC.

Take the same function as last lecture, where

f(w1,w2)=L2w12+μ2w22κ=Lμ2C=Lw12+μw22f(w1,w2)=12w12+12w22κ=12C=w12+w22 is a circlef(w1,w2)=102w12+12w22κ=102C=10w12+w22 is an eclipse\begin{align} f(w_1, w_2) &= \frac{L}{2} w_1^2 + \frac{\mu}{2} w_2^2 &&\kappa = \frac L \mu &&2C = Lw_1^2 + \mu w_2^2\\ f(w_1, w_2) &= \frac{1}{2} w_1^2 + \frac{1}{2} w_2^2 &&\kappa = 1 &&2C = w_1^2 + w_2^2 \text{ is a circle}\\ f(w_1, w_2) &= \frac{10}{2} w_1^2 + \frac{1}{2} w_2^2 &&\kappa = 10 &&2C = 10w_1^2 + w_2^2 \text{ is an eclipse} \end{align}

We say that the function is well-conditioned when we have a small condition number. And the level set will look like a circle.

Preconditioning

Transforming Space

One nice thing to note is that when we stretch the space, we reserve the minimum value of a function. That is, as long as the matrix PP is invertible, we have (this is to ensure w,u  s.t.u=P1w\forall w, \exists u \; s.t. u = P^{-1}w)

minwf(w)=minuf(Pu)\min_w f(w) = \min_u f(Pu)

Therefore, if we only look at the minimum point, we have

argminwf(w)=P(argminuf(Pu))\operatorname*{argmin}_w f(w) = P\left(\operatorname*{argmin}_u f(Pu) \right)

where argminuf(Pu)\operatorname*{argmin}_u f(Pu) solves for uu in the transformed space and we get back ww by multiplying it with PP

While reserving the minimum point, we actually changed how the function looks in everywhere else so the condition number in this new transformed space is also changed (hopefully smaller).

Problem Setup

Define g(u)=f(Pu)g(u) = f(Pu), we want to solve argminug(u)\operatorname*{argmin}_u g(u) and map the result back to w=Puw = Pu

For example, take AA as a symmetric positive definite matrix with all positive eigenvalues and define

f(w)=12wTAwf(w) = \frac 1 2 w^T A w

We can actually transform AA to II as the Hessian matrix in another space so we can have a perfect condition number κ=1\kappa = 1. To do this, we want to find a PP such that PTAP=IP^TAP = I, so

f(Pu)=12(Pu)TA(Pu)g(u)=12uIu\begin{align} f(Pu) &= \frac 1 2 (Pu)^T A (Pu)\\ g(u) &= \frac 1 2 uIu\\ \end{align}

Suppose PP is also symmetric. Solve for PP, we have

P2=AP^{-2} = A

We can denote P=A12P=A^{- \frac 1 2}. This is just a notation and doesn’t have practical computation meaning.

In-place Transformation

So far, we first transform the problem to another space, solve it at that space, and transform the solution back to original space. Can we do everything in the original space, but pretend we are in the transformed space when we do the GD update?

Imagine we are again solving the minimization problem on the transformed function g(u)g(u)

ut+1=utαg(ut)u_{t+1} = u_t - \alpha \nabla g(u_t)

To find out the value of g(ut)\nabla g(u_t), we go through a similar derivation as we did in Gradient Descent Appendix: First note

g(u+ηv)=f(P(u+ηv))=f(Pu+ηPv)g(u + \eta v) = f(P (u + \eta v)) = f(Pu + \eta Pv)

We then look at the definition of gradient of gg

vg(u)=limη0ddηg(u+ηv)=limη0ddηf(Pu+ηPv)=Pvf(Pu)=(Pv)Tf(Pu)=vTPTf(Pu)\nabla_v g(u) = \lim_{\eta \to 0} \frac{d}{d \eta} g(u + \eta v) = \lim_{\eta \to 0} \frac{d}{d \eta} f(Pu + \eta Pv) = \nabla_{Pv} f(Pu) = (Pv)^T \nabla f(Pu) = v^T P^T \nabla f(Pu)

Substitute the g(u)\nabla g(u) with this equation

ut+1=utαPTf(Put).u_{t+1} = u_t - \alpha P^T \nabla f(P u_t).

To get back wtw_t, recall wt=Putw_t = P u_t so we just multiply both sides by PP, we get

wt+1=Put+1=PutαPPTf(Put)=wtαPPTf(wt)=wtαRf(wt)\begin{align} w_{t+1} &= P u_{t+1} \\ &= P u_t - \alpha P P^T \nabla f(P u_t) \\ &= w_t - \alpha P P^T \nabla f(w_t)\\ &= w_t - \alpha R \nabla f(w_t) \end{align}

Therefore, this is just Gradient Descent with gradients scaled by a positive semidefinite matrix R=PPTR = PP^T. We call this matrix RR the preconditioner. In practice, we won’t labor ourselves finding PP, we just find some positive semidefinite RR, because we know any such matrix can be decomposed into R=PPTR = PP^T form.

If we relate to our previous simple example, R=P2=A1R = P^2 = A^{-1}

A weird question: despite this derivation, what happens if we use a non-positive semidefinite matrix here? Let’s just imagine a very simple diagonal -1 matrix. It will simply blow up wtw_t by directing it to go to the step where f(wt)f(w_t) increases at each step.

Finding Transformation

How do we find this RR then?

  1. use statistics from the dataset: For example, for a linear model you could precondition based on the variance of the features in your dataset. Note this is is almost the same as normalization: we both want to transform our problem into some easier domain. One slight difference is that preconditioning scales regularizer too, while normalization doesn’t. This really doesn’t matter in practice.

  2. use information from the matrix of second-partial derivatives. For example, you could use a preconditioning matrix that is a diagonal approximation of the Newton’s method update at some point. This is similar to what we did in the quadratic function example. These methods are sometimes called Newton Sketch methods.

Element Specific Learning Rate

Note this is computationally expensive: previously each update only takes O(d+t)\mathcal O(d+t), where tt is the time computing gradients. Now that we have a matrix multiplication term, the time has become O(d2+t)\mathcal O(d^2 + t). In addition, we’ll have to store a d×dd \times d matrix in memory. As we discussed before, this is really bad when we had high dimensional data.

Therefore, we want to find an RR that is a diagonal matrix so that when we perform this matrix multiplication, it would work as two vector doing an elementwise multiplication and it can keep our running time at O(d+t)\mathcal O(d + t)

If we have such a diagonal matrix, the update step on ii-th index will look like:

(wt+1)i=(wt)iαRii(f(wt))i=(wt)iαi(f(wt))i\begin{align} (w_{t+1})_i &= (w_t)_i - \alpha R_{ii} (\nabla f(w_t))_i\\ &= (w_t)_i - \alpha'_i (\nabla f(w_t))_i\\ \end{align}

We actually have an element specific learning rate αi=αRii\alpha'_i = \alpha R_{ii} we can also rewrite it in vector form (added the vector sign to denote α\alpha is actually a vector)

wt+1=wtαf(wt)\vec{w_{t+1}} = \vec{w_t} - \vec \alpha \nabla f(w_t)

Adaptive Learning Rate

However, if we make α\alpha a hyperparameter, we would have to choose dd hyperparameters, which is a lot. Imagine when we want our model to be optimal so we have to do a hyperparameter search, this many hyperparameters are impossible to search through. Therefore, we want to find a way to change α\alpha intelligently.

Maybe we can keep a running sum of gradient square at each dimension? That’ll be our topic for next class.