This article <ahref="https://www.leewayhertz.com/parameter-efficient-fine-tuning/">AGuid to Parameter-efficient Fine-tuning (PEFT)</a> made a very goodsummary with nice drawings. There are some differences between itsexplanation with the original paper but the basic architecture is allgood.
<h2 id="prompt-tuning">Prompt Tuning</h2><p>Prefix tuning, prompt tuning, and p-tuning all prepend some vectorsas prefixes / soft prompts to the vector inputs to transformers. Theirgoal is to find a context that steers the language model towardgenerating text that solves a particular task.</p><h2 id="adapter"><ahref=”https://arxiv.org/pdf/1902.00751.pdf”>Adapter</a></h2><blockquote><p>Before Adapter, when performing vanilla fine-tuning, a modificationis made to the top layer of the network because the label spaces andlosses for the upstream and downstream tasks differ. Now, Adaptermodules perform more general architectural modifications to re-purpose apretrained network for a downstream task: it injects new layers into theoriginal network.</p><p>In standard fine-tuning, the new top-layer and the original weightsare co-trained. In contrast, in Adapter tuning, the parameters of theoriginal network are frozen and therefore may be shared by manytasks.</p></blockquote><figcaption aria-hidden="true">Adapter Architecture</figcaption><blockquote><p>Left: We add the adapter module twice to each Transformer layer:after the projection following multiheaded attention and after the twofeed-forward layers.</p><p>Right: The adapter consists of a bottleneck which contains fewparameters relative to the attention and feedforward layers in theoriginal model. The adapter also contains a skip-connection. Duringadapter tuning, the green layers are trained on the downstream data,this includes the adapter, the layer normalization parameters, and thefinal classification layer</p></blockquote><p>Therefore, we can denote the Adapter layer as:</p><p><spanclass=”math display”>y = B(σ(Ax)) + x</span></p><p>Define bottleneck dimension <spanclass=”math inline”>r</span> (to be consistent with LoRA, inthe original Adapter paper, this was <spanclass=”math inline”>m</span>), so $A\in \R^{r \times d}, B \in \R^{d \times r}$. Including biases, weadd a total <spanclass=”math inline”>2dr + d + r</span>parameters with <spanclass=”math inline”>r ≪ d</span>.</p><p>In initialization, we initialize the adapters to a near-identityfunction, so original network is unaffected when training starts.</p><p>Adapter achieves similar results with only 1% needed parameters ascompared to full fine-tuning.</p><h2 id="lora-low-rank-adaptation-of-llm"><ahref=”https://arxiv.org/abs/2106.09685”>LoRA: Low-Rank Adaptation ofLLM</a></h2><p>Before LoRA, SOTA techniques have some drawbacks:</p><ul><li>Adapter Layers Introduce Inference Latency: Adapter layers have fewparameters, but “large neural networks rely on hardwareparallelism to keep the latency low, and adapter layershave to be processed sequentially. This makes adifference in the online inference setting where the batch size istypically as small as one.” I actually don’t understand how toparallelize an LLM inference even if without Adapter</li><li>Directly Optimizing the Prompt is Hard: Prompt tuning and prefixtuning both require adding a prefix (to either the input vector or tothe hidden vector in middle). In this way, it “reduces thesequence length available to process a downstream task, whichwe suspect makes tuning the prompt less performant compared to othermethods.” Its performance changes non-monotonically in trainableparameters, too. So it’s hard to optimize.</li></ul><p>LoRA’s architecture is simply a matrix multiplication - Adapterwithout non-linearity or skip connection. So instead of <spanclass=”math inline”>y = B(σ(Ax)) + x</span>,we do <spanclass=”math inline”>Δy = BAx</span>.There is one fundamental difference though: Adapter is an extra layeradded into the original network, but LoRA is a layeradded along side with the original network. That’s whyI have a delta in LoRA’s formula.</p><p>In fact, LoRA was specifically designed for low-rank matrixmultiplication decomposition, note for any pretrained weight <spanclass=”math inline”>$W_0 \in \R^{d \times k}$</span> with <spanclass=”math inline”>y = W0x</span>and we want to update this weight matrix in fine-tuning: <spanclass=”math inline”>W = W0 + ΔW</span>,we define <spanclass=”math inline”>ΔW = BA</span>,so <spanclass=”math inline”>W = W0 + BA</span>.This simple design yields a unimaginable good result: note matrix <spanclass=”math inline”>BA</span> is also of dimension$\R^{d \times k}$. Therefore, we candirectly add this result matrix to the original matrix and inferencingwith LoRA, when we add this matrix in, gives us no additionalinference latency.</p><p>Similar to Adapter, LoRA uses a random Gaussian initialization for Aand zero for B, so <spanclass=”math inline”>ΔWx = BAx</span>is zero at the beginning of training and <spanclass=”math inline”>W = W0 + ΔW</span>yields the same result (identity) as before.</p><p>Thanks to the matrix decomposition nature, we can apply LoRA to anymatrix multiplication in principle. However, we found that intransformers, imagine if we have 8 ranks to distribute, when <spanclass=”math inline”>r(Wq) = r(Wv) = 4</span>or <spanclass=”math inline”>r(Wq) = r(Wk) = r(Wv) = r(Wo) = 2</span>gives best result. The author also found that the fine-tune matrixactually has a very low rank, so in practice even if we set <spanclass=”math inline”>r = 1</span> can give good enough results.Go to Section 7 for more interesting experiments they conducted.</p><p>In diffusion models, we use LoRA on the stable diffusion’s crossattention layer.</p>