Adaptive Learning Rate and Variance Reduction
All operations in this section are element-wise.
AdaGrad¶
Set the learning rate to be inversely proportional to square root of the sum of all gradients we observed so far squared.
Algorithm¶
Input scalar global learning rate factor
Initialize
loop
- get example gradient
- accumulate second moment estimate
- update model
AdaGrad is an algorithm independent of what kind of Gradient Descent we use. The can be calculated on a specific example, a minibatch, or the whole dataset.
Advantage¶
If we look at the -th dimension:
Imagine we’re near convergence, so should be small. Then the dominant term would be , so we set our learning rate to be small when this term is big, and big when this term is small.
It has a diminishing LR that is proportional to
When AdaGrad was popular, people only focused on strongly convergence problem. With these problems, as we discussed in SGD on PL function, people thought it was optimal to have a diminishing LR proportional to . AdaGrad sort of achieves that.
Downside¶
The LR monotonically diminishes, so if we have a non-convex problem and there are a lot of peaks and valleys, it will fail. Because such situation requires it to sometimes employ a high LR and sometimes a low LR.
RMSProp¶
Modifies AdaGrad to use an exponential moving average instead of a sum, so
Algorithm¶
Input scalar global learning rate factor and decay rate
Initialize
loop
- get example gradient
- accumulate second moment estimate
- update model
Note is the exponential running average of , so
In practice, we usually set so the window size of moving average is 100 / 1000.
Advantage¶
More effective for non-convex optimization problems
Exponential Moving Average - Bias Term¶
When we take the exponential moving average of a series with decay rate , we follow this recurrence relation:
We can write out the closed form of :
If we assume the series all have the same value and take the sum of this geometric series,
If we take , we see that , which is not even near to the average of these terms. This term here is really annoying and preventing us from getting the correct “average”. We will call it the bias correction term and divide our with it to get the real average:
The original RMSProp didn’t do this so it performs bad at the first steps
Adam¶
In short, Adam used a correct average for everything.
Adam is a combination of RMSProp and momentum. It also corrects for bias to estimate the first-order and second-order moments of the gradients
Algorithm¶
Since time step values actually matter here, we will use step update way instead of value assignment way to describe this algorithm. Note all operations are still elementwise and subscripts mean time step.
Input scalar global learning rate factor and decay rates
Initialize weight , first moment and second moment estimator , time step
loop:
We are currently updating for time step
get example gradient
accumulate first and second moment estimate with bias correction:
update weight
In practice, we also make changes to every some number of iterations, but here just treat it as a constant for simplicity.
Relate to RMSProp¶
If we set so we only look at current gradient, this will be the same as RMSProp, where
Relate to Momentum¶
Note with momentum, we have
If we discard the term in Adam,
so Adam will become the same as Gradient Descent with momentum.
SVRG - Stochastic Variance Reduction Gradient Descent¶
It is a good algorithm, but only works on convex function so we won’t go into details for it since deep learning is almost always not convex. The basic idea is to reduce the variance of the gradient estimators by using an infrequent full-gradient step.
It is very powerful on convex function that it reduces the running time of GD from
Polyak–Ruppert Averaging¶
Redefine the model weight as the average of all model weights that we have observed so far.
The idea is that when we are near to convergence in SGD, we will encounter a noise ball around the optimum. If we take an average of the model weights at the periphery of that noise ball, we can then cancel out the noise and reach the minimum.
One issue with the simple setup we’ve used above: we are averaging with equal weight iterates from the very start of training, when we have not reached the noise ball. To address this, we vary the original algorithm by first using an initial warm-up period during which we do not average, and then only starting to average after the warm-up period is over.
On the other hand, if we make two other changes to the original Polyak Averaging algorithm
- Instead of using an average that assigns equal weight to all model weights, we use an exponential running average instead.
- No average over all iterations or doing a warm-up period, we use a running window instead.
It will become the same algorithm as Stochastic Weight Averaging (SWA).