Skip to article frontmatterSkip to article content

Adaptive Learning Rate and Variance Reduction

Cornell University

All operations in this section are element-wise.

AdaGrad

Set the learning rate to be inversely proportional to square root of the sum of all gradients we observed so far squared.

α=αg12+g22++gt2\alpha' = \frac \alpha {\sqrt{g_1^2 + g_2^2 + \dots + g_t^2}}

Algorithm

AdaGrad is an algorithm independent of what kind of Gradient Descent we use. The gg can be calculated on a specific example, a minibatch, or the whole dataset.

Advantage

  1. If we look at the ii-th dimension:

    riTE[gi2]=T(E[gi]2+Var(gi))r_i \approx T \mathbb E[g_i^2] = T \left (\mathbb E[g_i]^2 + Var(g_i) \right)

    Imagine we’re near convergence, so E[gi]2\mathbb E[g_i]^2 should be small. Then the dominant term would be Var(gi)Var(g_i), so we set our learning rate to be small when this term is big, and big when this term is small.

  2. It has a diminishing LR that is proportional to 1T\frac 1 {\sqrt T}

    α=αrαTE[gi2]\alpha' = \frac \alpha {\sqrt r} \approx \frac \alpha {\sqrt {T\mathbb E[g_i^2]}}

    When AdaGrad was popular, people only focused on strongly convergence problem. With these problems, as we discussed in SGD on PL function, people thought it was optimal to have a diminishing LR proportional to 1T\frac 1 T. AdaGrad sort of achieves that.

Downside

The LR monotonically diminishes, so if we have a non-convex problem and there are a lot of peaks and valleys, it will fail. Because such situation requires it to sometimes employ a high LR and sometimes a low LR.

RMSProp

Modifies AdaGrad to use an exponential moving average instead of a sum, so

ααg1:tˉ\alpha' \approx \frac \alpha {\bar{g_{1:t}} }

Algorithm

Note rir_i is the exponential running average of g2g^2, so r=E[gi2]r = \mathbb E[g_i^2]

In practice, we usually set ρ=0.99/0.999\rho = 0.99 / 0.999 so the window size of moving average is 100 / 1000.

Advantage

More effective for non-convex optimization problems

Exponential Moving Average - Bias Term

When we take the exponential moving average rtr_t of a series {x1,x2,,xt}\{x_1, x_2, \dots, x_t\} with decay rate ρ\rho, we follow this recurrence relation:

r0=0rt=ρrt1+(1ρ)xt\begin{align} r_0 &= 0\\ r_t &= \rho r_{t-1} + (1 - \rho) x_t \end{align}

We can write out the closed form of rr:

rt=i=1tρti(1ρ)xir_t = \sum^t_{i=1} \rho^{t-i}(1-\rho)x_i

If we assume the series all have the same value xx and take the sum of this geometric series,

rt=i=1tρti(1ρ)x=(1ρ)xi=1tρti=(1ρ)x1ρt1ρ=(1ρt)x\begin{align} r_t &= \sum^t_{i=1} \rho^{t-i}(1-\rho)x\\ &= (1-\rho)x \sum^t_{i=1} \rho^{t-i} \\ &= (1-\rho)x \frac {1-\rho^t} {1-\rho}\\ &= (1-\rho^t)x \end{align}

If we take ρ=0.999\rho = 0.999, we see that r1=(1ρ1)x=0.001x,r2=(1ρ2)x0.002xr_1 = (1-\rho^1)x = 0.001x, r_2 = (1-\rho^2)x \approx 0.002x, which is not even near to the average of these terms. This 1ρt1-\rho^t term here is really annoying and preventing us from getting the correct “average”. We will call it the bias correction term and divide our rtr_t with it to get the real average:

rt=ρrt1+(1ρ)xt1ρtr'_t = \frac{\rho r_{t-1} + (1 - \rho) x_t} {1-\rho^t}

The original RMSProp didn’t do this so it performs bad at the first steps

Adam

In short, Adam used a correct average for everything.

Adam is a combination of RMSProp and momentum. It also corrects for bias to estimate the first-order and second-order moments of the gradients

Algorithm

Since time step values actually matter here, we will use step update way instead of value assignment way to describe this algorithm. Note all operations are still elementwise and subscripts mean time step.

In practice, we also make changes to α\alpha every some number of iterations, but here just treat it as a constant for simplicity.

Relate to RMSProp

If we set ρ1=0\rho_1=0 so we only look at current gradient, this will be the same as RMSProp, where wt+1=wtαrt+1gt+1w_{t+1} = w_t - \frac{\alpha}{\sqrt{r_{t+1}}} \cdot g_{t+1}

Relate to Momentum

Note with momentum, we have

mt+1=βmtαgtwt+1=wt+mt+1\begin{align*} m_{t+1} &= \beta m_t - \alpha g_t \\ w_{t+1} &= w_t + m_{t+1} \end{align*}

If we discard the 1r\frac 1 {\sqrt{r}} term in Adam,

st+1=ρ1st+(1ρ1)gt+11ρ1twt+1=wtαst+1\begin{align*} s_{t+1} &= \frac{\rho_1 s_t + (1 - \rho_1) g_{t+1}}{1 - \rho_1^t}\\ w_{t+1} &= w_t - \alpha s_{t+1} \end{align*}

so Adam will become the same as Gradient Descent with momentum.

SVRG - Stochastic Variance Reduction Gradient Descent

It is a good algorithm, but only works on convex function so we won’t go into details for it since deep learning is almost always not convex. The basic idea is to reduce the variance of the gradient estimators by using an infrequent full-gradient step.

It is very powerful on convex function that it reduces the running time of GD from

O(nκlog1ϵ)GDO((n+κ)log1ϵ)SVRG\underset {GD} { \mathcal O (n \kappa \log{\frac 1 \epsilon})} \rightarrow \underset {SVRG} {\mathcal O\left((n+\kappa) \log{\frac 1 \epsilon} \right)}

Polyak–Ruppert Averaging

Redefine the model weight as the average of all model weights that we have observed so far.

wˉT=1Tt=1Twt\bar w_T = \frac{1}{T} \sum_{t=1}^T w_t

The idea is that when we are near to convergence in SGD, we will encounter a noise ball around the optimum. If we take an average of the model weights at the periphery of that noise ball, we can then cancel out the noise and reach the minimum.

One issue with the simple setup we’ve used above: we are averaging with equal weight iterates from the very start of training, when we have not reached the noise ball. To address this, we vary the original algorithm by first using an initial warm-up period during which we do not average, and then only starting to average after the warm-up period is over.

On the other hand, if we make two other changes to the original Polyak Averaging algorithm

  1. Instead of using an average that assigns equal weight to all model weights, we use an exponential running average instead.
  2. No average over all iterations or doing a warm-up period, we use a running window instead.

It will become the same algorithm as Stochastic Weight Averaging (SWA).