The paper Decoupled WeightDecay Regularization mainly introduces AdamW, which is the SOTAoptimizer since then. It investigates why Adam with L2 regularizationsometimes performs worse than SGD with L2 regularization. Itdemonstrates weight decay and L2 regularization, two things peopleusually draw an equal sign, are not the same. And it shows weight decayis the ultimate go-to choice.

<p>Weight decay and L2 regularization are equivalent in SGD when set L2regularizer $\lambda’ = \frac \lambda\alpha$, which is our common practice. The situation is morecomplicated with adaptive gradient algorithms like Adam. Adam performsmuch better with weight decay and the authors propose the new SOTAoptimizer AdamW (Adam with decoupled weight decay). All the conclusionsand main finding can be found in the first 2 pages of the paper andmostly in the Introduction section. I did not read the math.</p><p><ahref=”https://www.fast.ai/posts/2018-07-02-adam-weight-decay.html”>Thisblogpost from Fast.ai</a> demonstrates how the two methods are differentin code, a bit easier to understand than the paper which doesn’t providea comparision.</p><h2 id="weight-decay-in-transformers">Weight Decay in Transformers</h2><p>AdamW is the go-to optimizer for LLM these days. Researchers chose itbecause LLMs are hard to train and rarely overfit, and Adam is the bestchoice when convergence speed is considered (<ahref=”https://www.zhihu.com/question/519307910/answer/2384626354”>reference</a>).People have also found AdamW usually performs best with big weight decaycoefficient like 0.05 or 0.1 (<ahref=”https://www.zhihu.com/question/536185388”>zhihu question</a>, <ahref=”https://arxiv.org/abs/2010.11929”>ViT paper: Training &Fine-tuning section</a>)</p><p>When we apply weight decay in transformers, we apply it to all layersexcept LayerNorm and bias layers.</p><p>In <ahref=”https://github.com/karpathy/nanoGPT/blob/325be85d9be8c81b436728a420e85796c57dba7e/model.py#L268-L271”>nanoGPT</a>,Karpathy filtered them out using:

<table><tr><td class="gutter"><pre>1
2
3
4
5
6
7
8
</pre></td><td class="code"><pre># create optim groups. Any parameters that is 2D will be weight decayed, otherwise no.
# i.e. all weight tensors in matmuls + embeddings decay, all biases and layernorms don't.
decay_params = [p for n, p in param_dict.items() if p.dim() >= 2]
nodecay_params = [p for n, p in param_dict.items() if p.dim() < 2]
optim_groups = [
{'params': decay_params, 'weight_decay': weight_decay},
{'params': nodecay_params, 'weight_decay': 0.0}
]
</pre></td></tr></table>
</p><p>One caveat is that, <ahref=”https://github.com/karpathy/nanoGPT/commit/7fe4a099ad2a4654f96a51c0736ecf347149c34c#diff-fada037ad086638e65c7ae77e3d223963e9afaa26326aab0ea718f4013176e43L282”>inearlier versions</a>, Karpathy did NOT weight decay embeddings:
<table><tr><td class="gutter"><pre>1
2
3
4
5
</pre></td><td class="code"><pre>blacklist_weight_modules = (torch.nn.LayerNorm, LayerNorm, torch.nn.Embedding)

elif pn.endswith('weight') and isinstance(m, blacklist_weight_modules):
# weights of blacklist modules will NOT be weight decayed
no_decay.add(fpn)
</pre></td></tr></table>
</p><p>I couldn’t find any instruction on whether you should decayembeddings or not when training a transformer, but <ahref=”https://github.com/huggingface/transformers/blob/66ce9593fdb8e340df546ddd0774eb444f17a12c/src/transformers/trainer.py#L979-L988”>HuggingFace’s transformer implementation</a> also decays embeddings, in linewith Karpathy’s latest implementation.
<table><tr><td class="gutter"><pre>1
2
3
</pre></td><td class="code"><pre># get_parameter_names(model, name) excludes layers with name
decay_parameters = get_parameter_names(model, ALL_LAYERNORM_LAYERS)
decay_parameters = [name for name in decay_parameters if "bias" not in name]
</pre></td></tr></table>
</p>