Stochastic Gradient Descent Improved
Minibatch SGD Running Time¶
Recall from last time, we have
Similar to our proof of GD converges linearly on PL-condition functions, set our goal as for a given margin , by running minibatch SGD times, we can output a prediction s.t. , so we can write
To make it easier for ourselves, we instead find at least how many we need to run so that
The first expression gives
The second expression gives
Recall we have another constraint on that , so
This is actually a pretty interesting inequality. It says if we have a big batch size and low variance, so the batch gradient is representative of the global gradient, we can set to be big. Replace the in expression here
Note is the condition number. All the satisfies this condition will produce a small enough , so the first / smallest one that satisfices this condition will simply be the floor. Therefore, the number of step for minibatch SGD to converge is
Each step, we need time to compute the batch gradient (assuming time to calculate each sample’s gradient. This time is usually but it depends, so we just ignore it for now). The running time of minibatch SGD to reach precision is
For most of the time, is the bigger part, and we ignore the value, so running time of SGD is in short
Comparing to GD¶
Compare the full version of minibatch SGD running time with GD, we see that the is basically replaced with the
So there are some situations more suitable to minibatch SGD we can immediately see:
- huge : GD runs too slowly to calculate gradient on the whole dataset
- not too big precision needed
In what situation does GD runs better? Maybe when we have too large , so the average batch gradient is not representative of the whole dataset? Is this true though? Think when we have a dataset of a very large variance (maybe all the data points are uniformly distributed on some space), so randomly sample a data point , will have almost nothing to do with . If this is the case, even GD cannot learn much and this learning / optimization problem just doesn’t hold itself.
Another Advantage of SGD - Hardware¶
Minibatch SGD runs significantly faster than GD not only because of the decrease in computation cost, but also because when we have too large an , we will blow up the memory and have to do a lot of memory swap and other overheads.
Minibatch SGD is faster than vanilla SGD because minibatch can utilize parallelism of the hardware.
Minibatch SGD in Practice¶
When analyzing convergence and running time, we drew the batch with replacement. However, in practice, drawing the batch without replacement gives us better result.
The best practice is random reshuffle, where we randomly shuffle the whole dataset and divide it into several minibatches. So there is no duplicate elements across all the batches. This method will make our previous convergence proof absolutely not hold, because everything the previous proof built on was we assumed each batch is drawn independently, so all those expectations only depend on . However, now the expectation will depend on a whole lot of stuff - what was the previous batch, the order they were drawn, ...
The term related to random shuffling is epoch, which means 1 pass through the whole dataset.
We have other methods for doing minibatch SGD:
- without replacement within batch level, so there can be duplicates across different batches
- shuffle-once: only shuffle the dataset once before start training, so each epoch has the same order
Improve: Non-Constant Learning Rate on PL Function¶
Now instead of having a constant learning rate , we will have an adaptive (usually diminishing) learning rate specific for each step:
Think of as a function of
Since we want to be as small as possible, we take derivative of with respect to , we solve , so see value of when reaches its minimum.
Substitute the into the above inequality and take its inverse
Then we use the fact . We can say because we can make really small. To do this, at the beginning of the training stage, we first run minibatch SGD at a constant step size for several epochs. In these constant runs, we don’t care about this property, so nothing is violated. At the same time, we can make relatively small so we are ready to switch to a diminishing step size while making sure holds.
Now look at a iterations:
Now we have sort of a value of , or at least we now know how it changes. Recall we solved what value of can make the next iteration as good as possible.
Therefore, we conclude that we should make changes proportional to step’s inverse.