The paper Decoupled Weight Decay Regularization mainly introduces AdamW, which is the SOTA optimizer since then. It investigates why Adam with L2 regularization sometimes performs worse than SGD with L2 regularization. It demonstrates weight decay and L2 regularization, two things people usually draw an equal sign, are not the same. And it shows weight decay is the ultimate go-to choice.
Mixed-precision training was introduced in Nvidia and Baidu’s research. The blogpost from Nvidia gave a nice summary of how it’s done and why it works. Nvidia also gave a more in-depth coverage of the same points in their tutorial on training with mixed precision.
We borrow the results of decoder-only transformer models from OpenAI’s paper Scaling Laws for Neural Language Models Section 2.1
PyTorch official documentation explains this concept very briefly and we go into more detail here.
Since long before have I realized I do need a more official and more academic homepage in addition to a personal blog site, but I didn’t find time to do it until I started working. Now after this switch, I have my personal homepage of al-folio in Jekyll and at the same time my blog of archer in Hexo.
This blog post is adapted from ex-OpenAI researcher, Anthropic co-founder Christopher Olah’s wonderful work. I removed parts that are generally commonsense to a CS kid and added some of my own notes & explanations.
This article A Guid to Parameter-efficient Fine-tuning (PEFT) made a very good summary with nice drawings. There are some differences between its explanation with the original paper but the basic architecture is all good.
I was trying out the sampling in MMM music generation model
today and encountered the problem described in this issue I
proposed. I have no experience writing C in python with
ctypes
, so I figured why not ask the magic conch
shell ChatGPT?