We borrow the results of decoder-only transformer models from <ahref="https://arxiv.org/abs/2001.08361">OpenAI’s paper Scaling Laws forNeural Language Models</a> Section 2.1

We use the following notations:<ul><li>L = number of layers oftransformer blocks (N in <ahref=”https://arxiv.org/pdf/1706.03762.pdf”>Attention is All You Need</a>)</li><li><spanclass=”math inline”>d_model= dimension of the input & output of a transformer block, also theoutput of the text encoder and input of the decoder</li><li><spanclass=”math inline”>d_ff =dimension of the feed-forward network’s bottleneck. We defined thefeed-forward network asfc1 = fc(d_model, d_ff), fc2 = fc(d_ff, d_model)</li><li><spanclass=”math inline”>d_attn= dimension of the multi-head attention output (In <ahref=”https://arxiv.org/pdf/1706.03762.pdf”>Attention is All You Need</a>,we have h number of heads.Queries and keys have dimension <spanclass=”math inline”>d_k. Values havedimension <spanclass=”math inline”>d_v. In practice,we usually have <spanclass=”math inline”>d_k = d_v.<spanclass=”math inline”>d_attnwe have here is defined as <spanclass=”math inline”>d_k × h)</li></ul><table><colgroup><col style="width: 10%" /><col style="width: 17%" /><col style="width: 72%" /></colgroup><thead><tr class="header"><th>Part</th><th>Parameters</th><th>Explanation</th></tr></thead><tbody><tr class="odd"><td>Embed</td><td>$n_{vocab}\times d_{model} \ +n_{ctx}\times d_{model}$</td><td>One word embedding matrix (mapping each token to correspondingembedding ) and one positional embedding matrix</td></tr><tr class="even"><td>Attention: Q K V Matrix</td><td><spanclass=”math inline”>L3d_modeld_attn</td><td>W_Q hasshape <spanclass=”math inline”>(d_model, d_attn).There’re also <spanclass=”math inline”>W_K and <spanclass=”math inline”>W_V</td></tr><tr class="odd"><td>Attention: Multi-head Projection</td><td>$L d_{attn} d_{model} $</td><td>After we concat the output from all heads, there’s one projectionfrom all-head output to the final output. This is that matrix. It wasdefined as <spanclass=”math inline”>W^O in <ahref=”https://arxiv.org/pdf/1706.03762.pdf”>Attention is All You Need 3.2.2</a>.</td></tr><tr class="even"><td>Feedforward Network</td><td><spanclass=”math inline”>L2d_modeld_ff</td><td>Explained in the definition of <spanclass=”math inline”>d_ff above.</td></tr><tr class="odd"><td>Total (Non-Embedding)</td><td><spanclass=”math inline”>2Ld_model(2d_attn + d_ff)</td><td></td></tr></tbody></table>If we have the standard <spanclass=”math inline”>d_attn = d_model = d_ff/4,we can get <spanclass=”math inline”>N = 12Ld_model²Put this into practice, let’s calculate a rough estimate of number ofparameters the vanilla transformer has. The vanilla transformer base,per the paper <ahref=”https://arxiv.org/pdf/1706.03762.pdf”>Attention is All You Need Table3</a>, L = 6, <spanclass=”math inline”>d_model = 512,<spanclass=”math inline”>d_ff = 2048,<spanclass=”math inline”>d_attn = h × d_k = 8 × 64 = 512,<spanclass=”math inline”>n_vocab = 37000.I didn’t find info about <spanclass=”math inline”>n_ctx,but is probably 512.Note that different from OpenAI’s favorite decoder-only transformer,the vanilla transformer has an encoder-decoder architecture and thedecoder block has an additional attention block. Therefore, the encoderhas a total <spanclass=”math inline”>2Ld_model(2d_attn + d_ff)parameters, the decoder has a total <spanclass=”math inline”>2Ld_model(4d_attn + d_ff)params, and the embedding part has a total <spanclass=”math inline”>n_vocab × d_model + n_ctx × d_modelparams. The final result is <spanclass=”math inline”> ∼ 63 × 10⁶. I tried hard tofigure out where went off from the paper’s <spanclass=”math inline”>65 × 10⁶ but had no luck. Addingthe parameters of LayerNorm still didn’t even out the numbers. But it’sclose enough so I’ll call it a day.