Mixed-Precision Training

 2024/03/01 

Mixed-precision training was introduced in Nvidia and Baidu’s research. The blogpost from Nvidia gave a nice summary of how it’s done and why it works. Nvidia also gave a more in-depth coverage of the same points in their tutorial on training with mixed precision.

I decided to learn this as I was reading nanoGPT’s code:

1 2	torch.backends.cuda.matmul.allow_tf32 = True # allow tf32 on matmul torch.backends.cudnn.allow_tf32 = True # allow tf32 on cudnn

Benefits

Decrease the required amount of memory: FP32 -> FP16
Shorten the training or inference time:
- memory bandwith: half-precision halves the number of bytes need to be accessed, thus reducing time-spent in memory-limited operations
- arithmetic bandwidth: half-precision arithmatic is inherintely faster than single-precision

3 Techniques in Original Paper

Accumulation into FP32: then convert to FP16 for storage
Loss Scaling (Gradient Scaling): There are four types of tensors encountered when training DNNs: activations, activation gradients, weights, and weight gradients. In experience activations, weights, and weight gradients can be represented with half precision. However, for some networks activation gradients are too small to be represented in half-precision range (underflow)

Therefore, we need to scale up the activation gradients. This can be done by simply multiply the training loss with the scale factor. This adds just a single multiplication and by the chain rule it ensures that all the gradients are scaled up at no additional cost.
FP32 Master Copy of Weights: Weight gradient magnitudes are smaller than corresponding weights, especially after multiplication with the learning rate. So sometimes no update takes place.

The remedy is to store the weights in single precision, but do computation in half precision. Update this master copy of weights after each computation.

More Recent Update

In NVIDIA Ampere GPU architecture, Nvidia introduced TensorFloat32 (TF32) with FP32 range (8bit) and FP16 precision (10bit). With the additional sign bit, it is a novel 19 bit representation of floats.

On an A100, it brings 8x speed up compared to FP32, while FP16/BF16 brings 16x speedup. Therefore, mixed-precision training with a native 16-bit format (FP16/BF16) is still the fastest option.

TF32 is only exposed as a Tensor Core operation mode, not a type. Internally, all storage in memory and other operations remain completely in FP32, only convolutions and matrix-multiplications convert their inputs to TF32 right before multiplication. Therefore, it does not provide the memory benifits or the native arithmetic speed up brought by 16-bit format. Its benefit is that it brings Tensor Core acceleration to single-precision DL workloads, without needing any changes to model scripts.

It needs to be noted that TF32 gives less accurate computation results. Therefore, PyTorch decided to set toggle torch.backends.matmul.allow_tf32 = False by default starting from 1.12. Read more about PyTorch’s official comparison of speed and numerical stability.