Parameter and FLOP Count in Transformer Model

 2024/02/22 

We borrow the results of decoder-only transformer models from OpenAI’s paper Scaling Laws for Neural Language Models Section 2.1

We use the following notations:

L = number of layers of transformer blocks (N in Attention is All You Need)
d_model = dimension of the input & output of a transformer block, also the output of the text encoder and input of the decoder
d_ff = dimension of the feed-forward network’s bottleneck. We defined the feed-forward network as fc1 = fc(d_model, d_ff), fc2 = fc(d_ff, d_model)
d_attn = dimension of the multi-head attention output (In Attention is All You Need, we have h number of heads. Queries and keys have dimension d_k. Values have dimension d_v. In practice, we usually have d_k = d_v. d_attn we have here is defined as d_k × h)

Part	Parameters	Explanation
Embed	$n_{vocab}\times d_{model} \\ +n_{ctx} \times d_{model}$	One word embedding matrix (mapping each token to corresponding embedding ) and one positional embedding matrix
Attention: Q K V Matrix	L3d_modeld_attn	W_Q has shape (d_model, d_attn). There’re also W_K and W_V
Attention: Multi-head Projection	$L d_{attn} d_{model} $	After we concat the output from all heads, there’s one projection from all-head output to the final output. This is that matrix. It was defined as W^O in Attention is All You Need 3.2.2.
Feedforward Network	L2d_modeld_ff	Explained in the definition of d_ff above.
Total (Non-Embedding)	2Ld_model(2d_attn + d_ff)

If we have the standard d_attn = d_model = d_ff/4, we can get N = 12Ld_model²

Put this into practice, let’s calculate a rough estimate of number of parameters the vanilla transformer has. The vanilla transformer base, per the paper Attention is All You Need Table 3, L = 6, d_model = 512, d_ff = 2048, d_attn = h × d_k = 8 × 64 = 512, n_vocab = 37000. I didn’t find info about n_ctx, but is probably 512.

Note that different from OpenAI’s favorite decoder-only transformer, the vanilla transformer has an encoder-decoder architecture and the decoder block has an additional attention block. Therefore, the encoder has a total 2Ld_model(2d_attn + d_ff) parameters, the decoder has a total 2Ld_model(4d_attn + d_ff) params, and the embedding part has a total n_vocab × d_model + n_ctx × d_model params. The final result is ∼ 63 × 10⁶. I tried hard to figure out where went off from the paper’s 65 × 10⁶ but had no luck. Adding the parameters of LayerNorm still didn’t even out the numbers. But it’s close enough so I’ll call it a day.

Author：Yao Lirong

Link：https://yao-lirong.github.io/blog/parameter-and-flop-count-in-transformer-model/

Publish date：February 22nd 2024, 12:00:00 am

Update date：September 2nd 2025, 7:34:26 pm

License：本文采用 Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) 进行许可

Next Post

Mixed-Precision Training
Previous Post

Memory Pinning and Transfer Data between Host (CPU) and Device (GPU)

CATALOG