We borrow the results of decoder-only transformer models from <ahref="https://arxiv.org/abs/2001.08361">OpenAI’s paper Scaling Laws forNeural Language Models</a> Section 2.1
<p>We use the following notations:</p><ul><li><p>L = number of layers oftransformer blocks (N in <ahref=”https://arxiv.org/pdf/1706.03762.pdf”>Attention is All You Need</a>)</p></li><li><p><spanclass=”math inline”>dmodel</span>= dimension of the input & output of a transformer block, also theoutput of the text encoder and input of the decoder</p></li><li><p><spanclass=”math inline”>dff</span> =dimension of the feed-forward network’s bottleneck. We defined thefeed-forward network asfc1 = fc(d_model, d_ff), fc2 = fc(d_ff, d_model)
</p></li><li><p><spanclass=”math inline”>dattn</span>= dimension of the multi-head attention output (In <ahref=”https://arxiv.org/pdf/1706.03762.pdf”>Attention is All You Need</a>,we have h number of heads.Queries and keys have dimension <spanclass=”math inline”>dk</span>. Values havedimension <spanclass=”math inline”>dv</span>. In practice,we usually have <spanclass=”math inline”>dk = dv</span>.<spanclass=”math inline”>dattn</span>we have here is defined as <spanclass=”math inline”>dk × h</span>)</p></li></ul><table><colgroup><col style="width: 10%" /><col style="width: 17%" /><col style="width: 72%" /></colgroup><thead><tr class="header"><th>Part</th><th>Parameters</th><th>Explanation</th></tr></thead><tbody><tr class="odd"><td>Embed</td><td>$n_{vocab}\times d_{model} \ +n_{ctx}\times d_{model}$</td><td>One word embedding matrix (mapping each token to correspondingembedding ) and one positional embedding matrix</td></tr><tr class="even"><td>Attention: Q K V Matrix</td><td><spanclass=”math inline”>L3dmodeldattn</span></td><td>WQ hasshape <spanclass=”math inline”>(dmodel, dattn)</span>.There’re also <spanclass=”math inline”>WK</span> and <spanclass=”math inline”>WV</span></td></tr><tr class="odd"><td>Attention: Multi-head Projection</td><td>$L d_{attn} d_{model} $</td><td>After we concat the output from all heads, there’s one projectionfrom all-head output to the final output. This is that matrix. It wasdefined as <spanclass=”math inline”>WO</span> in <ahref=”https://arxiv.org/pdf/1706.03762.pdf”>Attention is All You Need 3.2.2</a>.</td></tr><tr class="even"><td>Feedforward Network</td><td><spanclass=”math inline”>L2dmodeldff</span></td><td>Explained in the definition of <spanclass=”math inline”>dff</span> above.</td></tr><tr class="odd"><td>Total (Non-Embedding)</td><td><spanclass=”math inline”>2Ldmodel(2dattn + dff)</span></td><td></td></tr></tbody></table><p>If we have the standard <spanclass=”math inline”>dattn = dmodel = dff/4</span>,we can get <spanclass=”math inline”>N = 12Ldmodel2</span></p><p>Put this into practice, let’s calculate a rough estimate of number ofparameters the vanilla transformer has. The vanilla transformer base,per the paper <ahref=”https://arxiv.org/pdf/1706.03762.pdf”>Attention is All You Need Table3</a>, L = 6, <spanclass=”math inline”>dmodel = 512</span>,<spanclass=”math inline”>dff = 2048</span>,<spanclass=”math inline”>dattn = h × dk = 8 × 64 = 512</span>,<spanclass=”math inline”>nvocab = 37000</span>.I didn’t find info about <spanclass=”math inline”>nctx</span>,but is probably 512.</p><p>Note that different from OpenAI’s favorite decoder-only transformer,the vanilla transformer has an encoder-decoder architecture and thedecoder block has an additional attention block. Therefore, the encoderhas a total <spanclass=”math inline”>2Ldmodel(2dattn + dff)</span>parameters, the decoder has a total <spanclass=”math inline”>2Ldmodel(4dattn + dff)</span>params, and the embedding part has a total <spanclass=”math inline”>nvocab × dmodel + nctx × dmodel</span>params. The final result is <spanclass=”math inline”> ∼ 63 × 106</span>. I tried hard tofigure out where went off from the paper’s <spanclass=”math inline”>65 × 106</span> but had no luck. Addingthe parameters of LayerNorm still didn’t even out the numbers. But it’sclose enough so I’ll call it a day.</p>