Accelerate DNN Training - Incomplete CS Notes @ Cornell

Describing Capacity¶

Capacity of a model informally refers to the ability of the model to fit a wide range of possible functions.

Models with high capacity tend to overfit.
Models with low capacity tend to underfit.

Representational Capacity of a model refers to the extent of functions this model can approximate well in theory. Deep neural networks have very high representational capacity. In fact, they’re universal approximators.

Effective Capacity of a model refers to the extent of functions this model can approximate well in practice given a specific learning algorithm and dataset.

For convex optimization problem, a models’ effective capacity is just the representational capacity since local minimum is a global minimum. But for nonconvex problems, a model’s effective capacity also depends on whether the dataset is representative or whether the learning algorithm is powerful. These two factors can have great effect on effective capacity, though have zero effect on representational capacity.

Early Stopping¶

We stop the training when validation loss starts to get bigger.

patience = K - terminate training when validation loss has no improvement in the past K epochs
min_delta=0.01 - only an improvement greater than 0.01 is considered as an improvement. Anything lower than that is considered as no improvement.

Dropout¶

During training, randomly drop some neurons (force their activation value to 0) so they do not count as computation. For these neurons, we use 0 to back propagate their values.

In this way, we force all neurons to participate in training: imagine we have two neurons A and B. B always outputs a constant and it’s A always changing to approximate all the values. Now we force A to output 0, B can no longer remain constant and it has to also participate in learning.

Batch Normalization¶

To avoid great variance in the output of each neuron, we restrain the output of one neuron on a minibatch to have 0 mean and unit standard deviation.

Think of one specific neuron, let $u = [u_1, u_2, \dots, u_B] \in \mathbb R^B$ be its output for all samples in this batch. Let

BN(u) = \frac {u-\mu} {\sigma}

(1)

where $\mu$ and $\sigma$ are the mean and standard deviation of $u = [u_1, u_2, \dots, u_B]$ . After BN layer, it has 0 mean and unit sd.

This is how we calculate BN during training. In training, we also keep an exponential running average of $\mu$ and $\sigma^2$ , so during evaluation, we use these averages as the batch statistics. Batch Norm is usually applied before activation.

There are two problems with Batch Norm:

We effectively reduced the representative capacity of our model by constraining mean to only be 0 and sd to only be 1. Note previously, each neuron can have different mean and sd, but now they all have the same.
We try to solve this problem by adding an affine term to assign a new non-zero mean $\beta$ and a new non-unit sd $\gamma$ to the activation value on this batch from this neuron. $\gamma, \beta$ are two learnable terms here. (Note $\mu, \sigma, \gamma, \beta$ are all scalars here)
$BN(u) = \frac {u-\mu} {\sigma} \gamma + \beta$
(2)
However, doesn’t this defeat the purpose of normalizing in the first place? This considered, the above equation is still the common practice though we don’t know why it works
All previous proofs depend on the fact that calculation of gradient step on each sample in minibatch is independent. However, by adding the mean $\mu$ and sd $\sigma$ , we make the update step dependent of each other, since mean and sd depend on all samples. This nullifies all the proofs we did previously ensuring good properties of SGD. Despite this, it performs good and people use it.

Residual Block / Skip Connection¶

L0 ---- L1 ---- sigma1 ---- L2 ---- sigma2 --+--
     |                                       |
     -----------------------------------------

The L here are some computation layers and sigma is the non-linear activation layer.

So why do we add this skip connection? Imagine L2 does something weird that prevent learning (maybe all its weights go to 0, so gradient of this point to earlier layers all go to 0 when we compute it in back propagation). By adding L0 value, we ensure that even if L2 doesn’t work, the whole network can still work (for this particular example, we can then make L0 gradient non-zero even what’s in between all have gradient 0)

The classic ResNet employs this architecture:

L0 ---- conv1 ---- BN1 ---- sigma1 ---- conv2 ---- BN2 --+-- sigma2
     |                                                   |
     -----------------------------------------------------

# https://github.com/akamaster/pytorch_resnet_cifar10/blob/master/resnet.py/ class BasicBlock
def forward(self, x):
    out = F.relu(self.bn1(self.conv1(x)))
    out = self.bn2(self.conv2(out))
    out += self.shortcut(x)
    out = F.relu(out)
    return out

People don’t usually relate this to RNN because an RNN takes in sequence of inputs and each token of it is passed into the same weight / model.

CS4787 Principles of Large-Scale Machine Learning

Neural Network Review

CS4787 Principles of Large-Scale Machine Learning

Beyond supervised learning