Skip to article frontmatterSkip to article content

Accelerate DNN Training

Cornell University

Describing Capacity

Capacity of a model informally refers to the ability of the model to fit a wide range of possible functions.

Representational Capacity of a model refers to the extent of functions this model can approximate well in theory. Deep neural networks have very high representational capacity. In fact, they’re universal approximators.

Effective Capacity of a model refers to the extent of functions this model can approximate well in practice given a specific learning algorithm and dataset.

For convex optimization problem, a models’ effective capacity is just the representational capacity since local minimum is a global minimum. But for nonconvex problems, a model’s effective capacity also depends on whether the dataset is representative or whether the learning algorithm is powerful. These two factors can have great effect on effective capacity, though have zero effect on representational capacity.

Early Stopping

We stop the training when validation loss starts to get bigger.

Dropout

During training, randomly drop some neurons (force their activation value to 0) so they do not count as computation. For these neurons, we use 0 to back propagate their values.

In this way, we force all neurons to participate in training: imagine we have two neurons A and B. B always outputs a constant and it’s A always changing to approximate all the values. Now we force A to output 0, B can no longer remain constant and it has to also participate in learning.

Batch Normalization

To avoid great variance in the output of each neuron, we restrain the output of one neuron on a minibatch to have 0 mean and unit standard deviation.

Think of one specific neuron, let u=[u1,u2,,uB]RBu = [u_1, u_2, \dots, u_B] \in \mathbb R^B be its output for all samples in this batch. Let

BN(u)=uμσBN(u) = \frac {u-\mu} {\sigma}

where μ\mu and σ\sigma are the mean and standard deviation of u=[u1,u2,,uB]u = [u_1, u_2, \dots, u_B]. After BN layer, it has 0 mean and unit sd.

This is how we calculate BN during training. In training, we also keep an exponential running average of μ\mu and σ2\sigma^2, so during evaluation, we use these averages as the batch statistics. Batch Norm is usually applied before activation.

There are two problems with Batch Norm:

  1. We effectively reduced the representative capacity of our model by constraining mean to only be 0 and sd to only be 1. Note previously, each neuron can have different mean and sd, but now they all have the same.

    We try to solve this problem by adding an affine term to assign a new non-zero mean β\beta and a new non-unit sd γ\gamma to the activation value on this batch from this neuron. γ,β\gamma, \beta are two learnable terms here. (Note μ,σ,γ,β\mu, \sigma, \gamma, \beta are all scalars here)

    BN(u)=uμσγ+βBN(u) = \frac {u-\mu} {\sigma} \gamma + \beta

    However, doesn’t this defeat the purpose of normalizing in the first place? This considered, the above equation is still the common practice though we don’t know why it works

  2. All previous proofs depend on the fact that calculation of gradient step on each sample in minibatch is independent. However, by adding the mean μ\mu and sd σ\sigma, we make the update step dependent of each other, since mean and sd depend on all samples. This nullifies all the proofs we did previously ensuring good properties of SGD. Despite this, it performs good and people use it.

Residual Block / Skip Connection

L0 ---- L1 ---- sigma1 ---- L2 ---- sigma2 --+--
     |                                       |
     -----------------------------------------

The L here are some computation layers and sigma is the non-linear activation layer.

So why do we add this skip connection? Imagine L2 does something weird that prevent learning (maybe all its weights go to 0, so gradient of this point to earlier layers all go to 0 when we compute it in back propagation). By adding L0 value, we ensure that even if L2 doesn’t work, the whole network can still work (for this particular example, we can then make L0 gradient non-zero even what’s in between all have gradient 0)

The classic ResNet employs this architecture:

L0 ---- conv1 ---- BN1 ---- sigma1 ---- conv2 ---- BN2 --+-- sigma2
     |                                                   |
     -----------------------------------------------------
# https://github.com/akamaster/pytorch_resnet_cifar10/blob/master/resnet.py/ class BasicBlock
def forward(self, x):
    out = F.relu(self.bn1(self.conv1(x)))
    out = self.bn2(self.conv2(out))
    out += self.shortcut(x)
    out = F.relu(out)
    return out

People don’t usually relate this to RNN because an RNN takes in sequence of inputs and each token of it is passed into the same weight / model.