Yao Lirong's Blog

Decoupled Weight Decay Regularization (SGDW & AdamW)

The paper Decoupled Weight Decay Regularization mainly introduces AdamW, which is the SOTA optimizer since then. It investigates why Adam with L2 regularization sometimes performs worse than SGD with L2 regularization. It demonstrates weight decay and L2 regularization, two things people usually draw an equal sign, are not the same. And it shows weight decay is the ultimate go-to choice.

2024/03/13

ML

Mixed-Precision Training

Mixed-precision training was introduced in Nvidia and Baidu’s research. The blogpost from Nvidia gave a nice summary of how it’s done and why it works. Nvidia also gave a more in-depth coverage of the same points in their tutorial on training with mixed precision.

2024/03/01

ML

Parameter and FLOP Count in Transformer Model

We borrow the results of decoder-only transformer models from OpenAI’s paper Scaling Laws for Neural Language Models Section 2.1

2024/02/22

ML

Memory Pinning and Transfer Data between Host (CPU) and Device (GPU)

PyTorch official documentation explains this concept very briefly and we go into more detail here.

2024/02/09

Switching Personal Homepage Theme to al-folio

Since long before have I realized I do need a more official and more academic homepage in addition to a personal blog site, but I didn’t find time to do it until I started working. Now after this switch, I have my personal homepage of al-folio in Jekyll and at the same time my blog of archer in Hexo.

2024/01/27

Visual Information Theory

This blog post is adapted from ex-OpenAI researcher, Anthropic co-founder Christopher Olah’s wonderful work. I removed parts that are generally commonsense to a CS kid and added some of my own notes & explanations.

2024/01/21

ML

Quantization

K-bit Inference Scaling Laws This paper and its Appendix serves as a good summary of SOTA quantization techniques and their results. Why should we quantize? The overall computation latency – the time it takes from start to finish of a computation – is mainly determined by two factors: (1) how lon...

2023/12/01

ML

Fine-Tuning LLMs: Prompt Tuning, Adapter, LoRA

This article A Guid to Parameter-efficient Fine-tuning (PEFT) made a very good summary with nice drawings. There are some differences between its explanation with the original paper but the basic architecture is all good.

2023/11/20

Graph Networks & GraphCast

Graph Networks This is a very detailed and clear intro to Graph Networks by Deepmind. Graph Definition (Box 3 & 3.2.1) We define a graph to have node, edge, and global attributes. Global attribute, u, for example can be the gravitational field in a three body problem setting. Global attribute...

2023/11/16

ML

First Time Debugging with ChatGPT

I was trying out the sampling in MMM music generation model today and encountered the problem described in this issue I proposed. I have no experience writing C in python with ctypes, so I figured why not ask the magic ~~conch shell~~ ChatGPT?

2023/04/04

Journal