Mixed-precision training was introduced in <ahref="https://arxiv.org/abs/1710.03740">Nvidia and Baidu’s research</a>.<ahref="https://developer.nvidia.com/blog/mixed-precision-training-deep-neural-networks/">Theblogpost from Nvidia</a> gave a nice summary of how it’s done and why itworks. Nvidia also gave a more in-depth coverage of the same points intheir tutorial on <ahref="https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html">trainingwith mixed precision</a>.

<p>I decided to learn this as I was reading <ahref=”https://github.com/karpathy/nanoGPT/blob/325be85d9be8c81b436728a420e85796c57dba7e/sample.py#L28-L29”>nanoGPT’scode</a>:</p>

<table><tr><td class="gutter"><pre>1
2
</pre></td><td class="code"><pre>torch.backends.cuda.matmul.allow_tf32 = True # allow tf32 on matmul
torch.backends.cudnn.allow_tf32 = True # allow tf32 on cudnn
</pre></td></tr></table><h2 id="benefits">Benefits</h2><ul><li>Decrease the required amount of memory: FP32 ->FP16</li><li>Shorten the training or inference time:<ul><li>memory bandwith: half-precision halves the numberof bytes need to be accessed, thus reducing time-spent in memory-limitedoperations</li><li>arithmetic bandwidth: half-precision arithmatic isinherintely faster than single-precision</li></ul></li></ul><h2 id="techniques-in-original-paper">3 Techniques in OriginalPaper</h2><ul><li><p>Accumulation into FP32: then convert to FP16 forstorage</p></li><li><p>Loss Scaling (Gradient Scaling): There are fourtypes of tensors encountered when training DNNs: activations, activationgradients, weights, and weight gradients. In experience activations,weights, and weight gradients can be represented with half precision.However, for some networks activation gradients are too small to berepresented in half-precision range (underflow)</p><p>Therefore, we need to scale up the activation gradients. This can bedone by simply multiply the training loss with the scale factor. Thisadds just a single multiplication and by the chain rule it ensures thatall the gradients are scaled up at no additional cost.</p></li><li><p>FP32 Master Copy of Weights: Weight gradientmagnitudes are smaller than corresponding weights, especially aftermultiplication with the learning rate. So sometimes no update takesplace.</p><p>The remedy is to store the weights in single precision, but docomputation in half precision. Update this master copy of weights aftereach computation.</p></li></ul><h2 id="more-recent-update">More Recent Update</h2><p>In NVIDIA Ampere GPU architecture, Nvidia introduced <ahref=”https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/”>TensorFloat32(TF32)</a> with FP32 range (8bit) and FP16 precision (10bit). With theadditional sign bit, it is a novel 19 bit representation of floats.</p><p>On an A100, it brings 8x speed up compared to FP32, while FP16/BF16brings 16x speedup. Therefore, mixed-precision training with a native16-bit format (FP16/BF16) is still the fastest option.</p><p>TF32 is only exposed as a Tensor Core operation mode, not a type.Internally, all storage in memory and other operations remain completelyin FP32, only convolutions and matrix-multiplications convert theirinputs to TF32 right before multiplication. Therefore, it does notprovide the memory benifits or the native arithmetic speed up brought by16-bit format. Its benefit is that it brings Tensor Core acceleration tosingle-precision DL workloads, without needing any changes to modelscripts.</p><p>It needs to be noted that TF32 gives less accurate computationresults. Therefore, PyTorch decided to settoggle torch.backends.matmul.allow_tf32 = False by defaultstarting from <ahref=”https://github.com/huggingface/transformers/issues/16588”>1.12</a>.Read more about <ahref=”https://pytorch.org/docs/stable/notes/cuda.html#tf32-on-ampere”>PyTorch’sofficial comparison</a> of speed and numerical stability.</p><h2 id="best-practices">Best Practices</h2><p>PyTorch gave some <ahref=”https://pytorch.org/blog/what-every-user-should-know-about-mixed-precision-training-in-pytorch/#best-practices”>officialbest practices tips to developers</a>. Please do check them outhere.</p>