Adaptive Learning Rate and Variance Reduction - Incomplete CS Notes @ Cornell

All operations in this section are element-wise.

AdaGrad¶

Set the learning rate to be inversely proportional to square root of the sum of all gradients we observed so far squared.

\alpha' = \frac \alpha {\sqrt{g_1^2 + g_2^2 + \dots + g_t^2}}

(1)

Algorithm¶

Input scalar global learning rate factor $\alpha \gt 0$
Initialize $w \in \R^d, r \leftarrow 0 \in \R^d$
loop
- get example gradient $g \in \R^d$
- accumulate second moment estimate $r \leftarrow r + g^2$
- update model $w \leftarrow w - \frac{\alpha}{\sqrt{r}} \cdot g$

AdaGrad is an algorithm independent of what kind of Gradient Descent we use. The $g$ can be calculated on a specific example, a minibatch, or the whole dataset.

Advantage¶

If we look at the $i$ -th dimension:
$r_i \approx T \mathbb E[g_i^2] = T \left (\mathbb E[g_i]^2 + Var(g_i) \right)$
(2)
Imagine we’re near convergence, so $\mathbb E[g_i]^2$ should be small. Then the dominant term would be $Var(g_i)$ , so we set our learning rate to be small when this term is big, and big when this term is small.
It has a diminishing LR that is proportional to $\frac 1 {\sqrt T}$
$\alpha' = \frac \alpha {\sqrt r} \approx \frac \alpha {\sqrt {T\mathbb E[g_i^2]}}$
(3)
When AdaGrad was popular, people only focused on strongly convergence problem. With these problems, as we discussed in SGD on PL function, people thought it was optimal to have a diminishing LR proportional to $\frac 1 T$ . AdaGrad sort of achieves that.

Downside¶

The LR monotonically diminishes, so if we have a non-convex problem and there are a lot of peaks and valleys, it will fail. Because such situation requires it to sometimes employ a high LR and sometimes a low LR.

RMSProp¶

Modifies AdaGrad to use an exponential moving average instead of a sum, so

\alpha' \approx \frac \alpha {\bar{g_{1:t}} }

(4)

Algorithm¶

Input scalar global learning rate factor $\alpha \gt 0$ and decay rate $\rho \in (0,1)$
Initialize $w \in \R^d, r \leftarrow 0 \in \R^d$
loop
- get example gradient $g \in \R^d$
- accumulate second moment estimate $r \leftarrow \rho r + (1 - \rho) g^2$
- update model $w \leftarrow w - \frac{\alpha}{\sqrt{r}} \cdot g$

Note $r_i$ is the exponential running average of $g^2$ , so $r = \mathbb E[g_i^2]$

In practice, we usually set $\rho = 0.99 / 0.999$ so the window size of moving average is 100 / 1000.

Advantage¶

More effective for non-convex optimization problems

Exponential Moving Average - Bias Term¶

When we take the exponential moving average $r_t$ of a series $\{x_1, x_2, \dots, x_t\}$ with decay rate $\rho$ , we follow this recurrence relation:

\begin{align} r_0 &= 0\\ r_t &= \rho r_{t-1} + (1 - \rho) x_t \end{align}

(5)

We can write out the closed form of $r$ :

r_t = \sum^t_{i=1} \rho^{t-i}(1-\rho)x_i

(6)

If we assume the series all have the same value $x$ and take the sum of this geometric series,

\begin{align} r_t &= \sum^t_{i=1} \rho^{t-i}(1-\rho)x\\ &= (1-\rho)x \sum^t_{i=1} \rho^{t-i} \\ &= (1-\rho)x \frac {1-\rho^t} {1-\rho}\\ &= (1-\rho^t)x \end{align}

(7)

If we take $\rho = 0.999$ , we see that $r_1 = (1-\rho^1)x = 0.001x, r_2 = (1-\rho^2)x \approx 0.002x$ , which is not even near to the average of these terms. This $1-\rho^t$ term here is really annoying and preventing us from getting the correct “average”. We will call it the bias correction term and divide our $r_t$ with it to get the real average:

r'_t = \frac{\rho r_{t-1} + (1 - \rho) x_t} {1-\rho^t}

(8)

The original RMSProp didn’t do this so it performs bad at the first steps

Adam¶

In short, Adam used a correct average for everything.

Adam is a combination of RMSProp and momentum. It also corrects for bias to estimate the first-order and second-order moments of the gradients

Algorithm¶

Since time step values actually matter here, we will use step update way instead of value assignment way to describe this algorithm. Note all operations are still elementwise and subscripts mean time step.

Input scalar global learning rate factor $\alpha \gt 0$ and decay rates $\rho_1, \rho_2 \in (0,1)$
Initialize weight $w \in \R^d$ , first moment and second moment estimator $s, r \leftarrow 0 \in \R^d$ , time step $t \leftarrow 0$
loop:
- We are currently updating for time step $t+1$
- get example gradient $g_{t+1} \in \R^d$
- accumulate first and second moment estimate with bias correction:
  $s_{t+1} = \frac{\rho_1 s_t + (1 - \rho_1) g_{t+1}}{1 - \rho_1^t}, \; r_{t+1} = \frac{\rho_2 r_t + (1 - \rho_2) g_{t+1}^2}{1 - \rho_2^t}$
  (9)
- update weight
  $w_{t+1} = w_t - \frac{\alpha}{\sqrt{r_{t+1}}} \cdot s_{t+1}$
  (10)

In practice, we also make changes to $\alpha$ every some number of iterations, but here just treat it as a constant for simplicity.

Relate to RMSProp¶

If we set $\rho_1=0$ so we only look at current gradient, this will be the same as RMSProp, where $w_{t+1} = w_t - \frac{\alpha}{\sqrt{r_{t+1}}} \cdot g_{t+1}$

Relate to Momentum¶

Note with momentum, we have

\begin{align*} m_{t+1} &= \beta m_t - \alpha g_t \\ w_{t+1} &= w_t + m_{t+1} \end{align*}

(11)

If we discard the $\frac 1 {\sqrt{r}}$ term in Adam,

\begin{align*} s_{t+1} &= \frac{\rho_1 s_t + (1 - \rho_1) g_{t+1}}{1 - \rho_1^t}\\ w_{t+1} &= w_t - \alpha s_{t+1} \end{align*}

(12)

so Adam will become the same as Gradient Descent with momentum.

SVRG - Stochastic Variance Reduction Gradient Descent¶

It is a good algorithm, but only works on convex function so we won’t go into details for it since deep learning is almost always not convex. The basic idea is to reduce the variance of the gradient estimators by using an infrequent full-gradient step.

It is very powerful on convex function that it reduces the running time of GD from

\underset {GD} { \mathcal O (n \kappa \log{\frac 1 \epsilon})} \rightarrow \underset {SVRG} {\mathcal O\left((n+\kappa) \log{\frac 1 \epsilon} \right)}

(13)

Polyak–Ruppert Averaging¶

Redefine the model weight as the average of all model weights that we have observed so far.

\bar w_T = \frac{1}{T} \sum_{t=1}^T w_t

(14)

The idea is that when we are near to convergence in SGD, we will encounter a noise ball around the optimum. If we take an average of the model weights at the periphery of that noise ball, we can then cancel out the noise and reach the minimum.

One issue with the simple setup we’ve used above: we are averaging with equal weight iterates from the very start of training, when we have not reached the noise ball. To address this, we vary the original algorithm by first using an initial warm-up period during which we do not average, and then only starting to average after the warm-up period is over.

On the other hand, if we make two other changes to the original Polyak Averaging algorithm

Instead of using an average that assigns equal weight to all model weights, we use an exponential running average instead.
No average over all iterations or doing a warm-up period, we use a running window instead.

It will become the same algorithm as Stochastic Weight Averaging (SWA).

CS4787 Principles of Large-Scale Machine Learning

Preconditioning and Element Specific Learning Rate

CS4787 Principles of Large-Scale Machine Learning

Sparsity and Dimension Reduction