SGD with Momentum - Incomplete CS Notes @ Cornell

Recall from previous lecture, the running time of GD on a strongly convex (PL-condition) function depends only on condition number $\kappa$ . When the condition number is high, convergence can become slow.

T \ge \kappa \log\left( \frac{f(w_0) - f^*}{\epsilon} \right)

(1)

So how can we speed up gradient descent when the condition is high? There are three common solutions:

Momentum (this lecture)
Adaptive learning rates
Preconditioning

We introduce momentum here. A direct analysis would be messy, so we use a very simple example to give some intuition.

Simple Quadratic Function¶

The simplest possible setting with a non-1 condition number is a 2D quadratic. To get a high condition number, we want in the Hessian matrix, the biggest (bigger) value to be very big and the smallest (smaller) to be very small. Consider the following example with $a \gt b$ . Here $a$ is just the curvature of the first dimension and $b$ is the curvature of the second dimension.

f(w) = f(w_1, w_2) = \frac{a}{2} w_1^2 + \frac{b}{2} w_2^2 = \frac{1}{2} \begin{bmatrix} w_1 \\ w_2 \end{bmatrix}^T \begin{bmatrix} a & 0 \\ 0 & b \end{bmatrix} \begin{bmatrix} w_1 \\ w_2 \end{bmatrix}

(2)

The second derivative matrix is the constant

\nabla^2 f(w) = \begin{bmatrix} a & 0 \\ 0 & b \end{bmatrix}

(3)

By definition of a condition number in linear algebra, we know $\kappa = \frac a b$ . By its definition in ML optimization problem, we know $\kappa = \frac L \mu$ , so we can just set $a = L, b = \mu$ , so

\begin{align} f(w) = \frac{L}{2} w_1^2 + \frac{\mu}{2} w_2^2 &= \frac{1}{2} \begin{bmatrix} w_1 \\ w_2 \end{bmatrix}^T \begin{bmatrix} L & 0 \\ 0 & \mu \end{bmatrix} \begin{bmatrix} w_1 \\ w_2 \end{bmatrix} \\ \nabla f(w) &= \begin{bmatrix} L & 0 \\ 0 & \mu \end{bmatrix} w\\ \nabla^2 f(w) &= \begin{bmatrix} L & 0 \\ 0 & \mu \end{bmatrix} \end{align}

(4)

With this, we can write our update step as

\begin{align} w_{t+1} &= w_t - \alpha \nabla f(w) \\ &= w_t - \alpha \begin{bmatrix} L & 0 \\ 0 & \mu \end{bmatrix} w_t \\ &= \begin{bmatrix} 1 - \alpha L & 0 \\ 0 & 1- \alpha \mu \end{bmatrix} w_t \end{align}

(5)

If we run the update $T$ steps, we will have (from now on, $T$ will most likely denote number of iterations instead of transpose.)

\begin{align} w_{T} &= \begin{bmatrix} 1 - \alpha L & 0 \\ 0 & 1- \alpha \mu \end{bmatrix}^T w_0 \\ &= \begin{bmatrix} (1 - \alpha L)^T & 0 \\ 0 & (1- \alpha \mu)^T \end{bmatrix} w_0 \\ & = \begin{bmatrix} (1 - \alpha L)^{T} (w_0)_1 \\ (1 - \alpha \mu)^{T} (w_0)_2 \end{bmatrix} \end{align}

(6)

Feed it into $f$ , we have

f(w_T) = \frac{L}{2} (1 - \alpha L)^{2T} \left( (w_0)_1 \right)^2 + \frac{\mu}{2} (1 - \alpha \mu)^{2T} \left( (w_0)_2 \right)^2

(7)

Therefore, the final value of $f(w_t)$ will be dominated by the exponential term: one of $(1 - \alpha L)^{2T}$ or $(1 - \alpha \mu)^{2T}$

To minimize $f(w_t)$ , we have to minimize $| 1 - \alpha L|$ and $|1 - \alpha \mu|$ at the same time. That is to minimize $\max(| 1 - \alpha L|, |1 - \alpha \mu|)$ Note if we just look at the first dimension, so we just minimize $| 1 - \alpha L|$ , we set $\alpha = \frac 1 L$ . With $L \gt \mu$ , this will always be a small number. Therefore, for the first dimension with a respective larger curvature $L$ , we want a smaller learning rate. On the other hand, if we only look at the second dimension, where the respective curvature is the smaller $\mu$ , we should want a higher learning rate.

Now that we have to minimize both at the same time, we are forced to choose something in the middle. We will always reach minimum when we have $\alpha L - 1 = 1 - \alpha \mu$ , so $\alpha = \frac 2 {L + \mu}$ . Substitute this $\alpha$ in,

\max(|{1 - \alpha L}|, |{1 - \alpha \mu}|) = \frac{L - \mu}{L + \mu} = \frac{\kappa - 1}{\kappa + 1} = 1 - \frac{2}{\kappa + 1}

(8)

As we said, the bigger of these two terms will dominate the final value of $f(w_T)$ , so

f(w_T) = \mathcal O\left( (1 - \frac{2}{\kappa + 1})^{2T} \right)

(9)

We know $1-x \approx e^{-x}$ around $x = 1$ , so (here $1 - \frac{2}{\kappa + 1} \approx 1$ with $\kappa$ being large)

f(w_T) = \mathcal O\left( exp(- \frac{4T}{\kappa + 1}) \right)

(10)

Therefore, even in the simplest setting, we can’t get rid of this $\frac {} {\kappa + 1}$ term.

Polyak Momentum¶

We want to sort of detect high vs low curvature during GD. The idea is:

If this dim has high curvature, we tend to overshoot, so in next step, gradient will point in the opposite direction
If this dim has low curvature, we are more likely to stay in the same direction, which means the step size is too small

Therefore, we want to make steps smaller when gradients reverse sign and larger when gradients are consistently in the same direction. Polyak momentum does this

w_{t+1} = w_t - \alpha \nabla f(w_t) + \beta (w_t - w_{t-1})

(11)

The intuition is that

If the current gradient is in the same direction as the previous step, move a little further in the same direction
If it’s in the opposite direction, move a little less far

This is equivalent to

\begin{align*} w_{t+1} &= w_t - \alpha \nabla f(w_t) + \beta (w_t - w_{t-1})\\ w_{t+1} - w_t &= - \alpha \nabla f(w_t) + \beta (w_t - w_{t-1})\\ m_{t+1} &= -\alpha \nabla f(w_t) + \beta m_t\\ w_{t+1} &= w_t + m_{t+1} \end{align*}

(12)

Stage Transition Matrix¶

Go back to our simple example, denote $A = \begin{bmatrix} L & 0 \\ 0 & \mu \end{bmatrix}$ , so

\begin{align} w_{t+1} &= w_t - \alpha A w_t + \beta (w_t - w_{t-1}) \end{align}

(13)

We can write this update process as a matrix operation too. The matrix here will be a block matrix, where each entry is actually a $2 \times 2$ matrix.

\begin{align} \begin{bmatrix} w_{t+1} \\ w_t \end{bmatrix} & = \begin{bmatrix} w_t - \alpha A w_t + \beta (w_t - w_{t-1}) \\ w_t \end{bmatrix}\\ & = \begin{bmatrix} (1 + \beta) I - \alpha A & -\beta I \\ I &0 \end{bmatrix} \begin{bmatrix} w_{t} \\ w_{t-1} \end{bmatrix}\\ \begin{bmatrix} w_{T+1} \\ w_T \end{bmatrix} & = \begin{bmatrix} (1 + \beta) I - \alpha A & -\beta I \\ I &0 \end{bmatrix}^T \begin{bmatrix} w_{1} \\ w_{0} \end{bmatrix} \end{align}

(14)

This block matrix in whole is actually a $4 \times 4$ matrix that transforms a vector in $\R^4$ to $\R^4$ , so if we write in the basis form

\begin{bmatrix} w_{1} \\ w_{0} \end{bmatrix} = c_1u_1 + c_2u_2 + c_3u_3 + c_4u_4

(15)

And if we find the eigenvalues of this block matrix, denote them as $\lambda_1, \lambda_2, \lambda_3, \lambda_4$ , we can write

\begin{bmatrix} w_{T+1} \\ w_T \end{bmatrix} = \lambda_1^Tc_1u_1 + \lambda_2^Tc_2u_2 + \lambda_3^Tc_3u_3 + \lambda_4^Tc_4u_4

(16)

As before, what dominates the value of $f(w_t)$ will be the biggest among the exponential $\lambda^T$ term. Therefore, we want to minimize all these eigenvalues at the same time.

In addition, recall the square term in $f(w)$

f(w_T) = \frac{L}{2} w_{T1}^2 + \frac{\mu}{2} w_{T2}^2

(17)

Therefore, we actually only care about the magnitude of $w_T$ . That is, we want to minimize the magnitude of all these eigenvalues at the same time.

\begin{align} &\min \|w_T\| \\ \iff &\min_{\lambda_{1,2,3,4}} \; \max \left(\| \lambda_1^Tc_1u_1 \|, \|\lambda_2^Tc_2u_2\| , \|\lambda_3^Tc_3u_3\| , \|\lambda_4^Tc_4u_4 \| \right)\\ \iff &\min_{\lambda_{1,2,3,4}} \max \left( \| \lambda_1^T \|, \|\lambda_2^T\| , \|\lambda_3^T\| , \|\lambda_4^T \|\right) \end{align}

(18)

Analyzing Eigenvalues¶

We start to analyze the eigenvalues of this block matrix.

We write this block matrix out:

\begin{bmatrix} 1+\beta - \alpha L & 0 & -\beta & 0\\ 0 & 1+\beta - \alpha \mu & 0 & -\beta\\ 1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0 \end{bmatrix}

(19)

Since we are analyzing the eigenvalues and the basis are not of any importance, we swap the 2nd column with 3rd column and swap 2nd row with 3rd row. This is equivalent to swap the 2nd and 3rd basis in domain vector space and also swap 2nd and 3rd basis in the codomain space. Again, this is fine because we only care about the eigenvalues.

\begin{bmatrix} 1+\beta - \alpha L & -\beta & 0 & 0\\ 1 & 0 & 0 & 0 \\ 0 & 0 & 1+\beta - \alpha \mu & -\beta\\ 0 & 0 & 1 & 0 \end{bmatrix}

(20)

We write this new matrix also in block matrix form,

\begin{bmatrix} B & 0 \\ 0 & B \end{bmatrix}, \text{where } B = \begin{bmatrix} 1+\beta - \alpha \chi & -\beta \\ 1 & 0 \\ \end{bmatrix}, \chi = \text{$L$ or $\mu$ repsectively}

(21)

This new block matrix is also in diagonal form, so to solve for its eigenvalues, we just have to solve for $B$ ’s eigen values.

Recall that we want to minimize all of eigen values’ norms all at the same time. We achieve this when they all have the same norm. For $B$ specifically, it has two eigenvalues $\lambda_1, \lambda_2$ and we want $|\lambda_1| = |\lambda_2|$ .

Write out the characteristic polynomial of matrix $B$

\begin{align} det(B - \lambda I) &= 0\\ det(B - \lambda I) &= det\left( \begin{bmatrix} 1+\beta - \alpha \chi - \lambda & -\beta \\ 1 & -\lambda \\ \end{bmatrix} \right) \\ &= \lambda \left( \lambda - (1 + \beta - \alpha \chi) \right) + \beta \\ &= \lambda^2 - (1 + \beta - \alpha \chi) \lambda + \beta \end{align}

(22)

Solve this quadratic equation, we have

\lambda = \frac{ (1 + \beta - \alpha \chi) \pm \sqrt{ (1 + \beta - \alpha \chi)^2 - 4 \beta } }{ 2 }

(23)

Two solutions to this equation have the same norm when we have them to be the same or they are complex numbers. That is when

(1 + \beta - \alpha \chi)^2 - 4 \beta \le 0

(24)

To find the exact value of $|\lambda|$ , we don’t have to go through the process of finding norm of a complex number. In fact, we just recall that product of all eigenvalues is equal to the determinant of this matrix, so

\begin{align} \lambda^2 &= det(B) = \beta\\ |\lambda| &= \sqrt \beta \end{align}

(25)

Therefore, to minimize $|\lambda|$ , we actually need to minimize $\sqrt \beta$ . So we have this new linear optimization problem to solve:

\begin{align} &\min_{\alpha, \beta} \sqrt \beta & \\ &\textrm{s.t. } \begin{matrix} (1 + \beta - \alpha L)^2 - 4 \beta \le 0 \\ (1 + \beta - \alpha \mu)^2 - 4 \beta \le 0 \end{matrix} \end{align}

(26)

This is a special case of Karush–Kuhn–Tucker conditions (KTT). When we solve a linear programming problem of $k$ variables and $n$ inequalities, $k$ of these $n$ inequalities will actually hold as equality. In this case, this minimization problem is solved when 2 of these 2 inequality constraints achieve equality. . So we solve

(1 + \beta - \alpha L)^2 - 4 \beta = 0 \\ (1 + \beta - \alpha \mu)^2 - 4 \beta = 0

(27)

which gives us

\begin{align} 2 \sqrt{\beta} &= | 1 + \beta - \alpha L | \\ &= | 1 + \beta - \alpha \mu | \end{align}

(28)

Since we have $L \gt \mu$ , to make the two absolute values to be equal, we must have

\begin{align} -2 \sqrt{\beta} &= 1 + \beta - \alpha L \\ 2 \sqrt{\beta} &= 1 + \beta - \alpha \mu \end{align}

(29)

Solve for both $\alpha$ and $\beta$ , we have

\alpha = \frac{2 + 2 \beta}{L + \mu} \hspace{1em}\text{ and }\hspace{1em} \sqrt{\beta} = \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1} = 1 - \frac{2}{\sqrt{\kappa} + 1}.

(30)

Recall that the norm of $w_t$ will be dominated with the one with max eigenvalue ( $C = \max \|c\|$ )

w_t = \max_{i \in [1,\dots,4]} \| \lambda_i^T c_i u_i\| = \max_{i \in [1,\dots,4]} \| \lambda_i^T\|C = \sqrt \beta ^ T C

(31)

Therefore, we have ( $C' = \frac {\max(L, \mu)} 2 C^2$ )

\begin{align} f(w_t) &= C' |w_t|^2 \\ &= C' \sqrt \beta ^{2T} \\ &= C' (1 - \frac 2 {\sqrt \kappa +1})^{2T}\\ &\le C' \exp(- \frac {2 \cdot 2T} {\sqrt \kappa +1}) \end{align}

(32)

Therefore, our function converges to a given interval $\epsilon$ (remember this is a quadratic function), when we have $T$ satisfies the following condition

f(w_T) \le \epsilon \\ T \ge \frac{\sqrt{\kappa} + 1}{4} \log (\frac C \epsilon)

(33)

\mathcal O (T) = \sqrt \kappa

(34)

Recall a normal GD convergence rate is

\mathcal O (T) = \kappa

(35)

Therefore, we have shown that using momentum with GD on this simple example of quadratic function has a faster convergence rate than a vanilla GD. The result does generalize to more general cases with other kinds of functions.

When we use momentum with SGD, it is not guaranteed that it gives a better result, but people still use it.

Nesterov Momentum¶

One disadvantage of Polyak momentum is that the momentum term is not guaranteed to point to the right direction. Also, it is only guaranteed to have this nice acceleration for quadratics. Therefore, we introduce Nesterov Momentum, which works for general strongly convex objectives.

Polyak:

\begin{align*} m_{t+1} &= \beta m_t - \alpha \nabla f(w_t) \\ w_{t+1} &= w_t + m_{t+1} \end{align*}

(36)

Nesterov:

\begin{align*} m_{t+1} &= \beta m_t - \alpha \nabla f(w_t + \beta m_t) \\ w_{t+1} &= w_t + m_{t+1}. \end{align*}

(37)

Difference: instead of calculating the momentum term at the current position, we pretend to have already taken one step and calculate the momentum term there.

CS4787 Principles of Large-Scale Machine Learning

Stochastic Gradient Descent Improved

CS4787 Principles of Large-Scale Machine Learning

Preconditioning and Element Specific Learning Rate