Fine-Tuning LLMs: Prompt Tuning, Adapter, LoRA

 2023/11/20 

This article A Guid to Parameter-efficient Fine-tuning (PEFT) made a very good summary with nice drawings. There are some differences between its explanation with the original paper but the basic architecture is all good.

Prompt Tuning

Prefix tuning, prompt tuning, and p-tuning all prepend some vectors as prefixes / soft prompts to the vector inputs to transformers. Their goal is to find a context that steers the language model toward generating text that solves a particular task.

Adapter

Before Adapter, when performing vanilla fine-tuning, a modification is made to the top layer of the network because the label spaces and losses for the upstream and downstream tasks differ. Now, Adapter modules perform more general architectural modifications to re-purpose a pretrained network for a downstream task: it injects new layers into the original network.

In standard fine-tuning, the new top-layer and the original weights are co-trained. In contrast, in Adapter tuning, the parameters of the original network are frozen and therefore may be shared by many tasks.

Left: We add the adapter module twice to each Transformer layer: after the projection following multiheaded attention and after the two feed-forward layers.

Right: The adapter consists of a bottleneck which contains few parameters relative to the attention and feedforward layers in the original model. The adapter also contains a skip-connection. During adapter tuning, the green layers are trained on the downstream data, this includes the adapter, the layer normalization parameters, and the final classification layer

Therefore, we can denote the Adapter layer as:

y = B(σ(Ax)) + x

Define bottleneck dimension r (to be consistent with LoRA, in the original Adapter paper, this was m), so $A \in \R^{r \times d}, B \in \R^{d \times r}$. Including biases, we add a total 2dr + d + r parameters with r ≪ d.

In initialization, we initialize the adapters to a near-identity function, so original network is unaffected when training starts.

Adapter achieves similar results with only 1% needed parameters as compared to full fine-tuning.

LoRA: Low-Rank Adaptation of LLM

Before LoRA, SOTA techniques have some drawbacks:

Adapter Layers Introduce Inference Latency: Adapter layers have few parameters, but “large neural networks rely on hardware parallelism to keep the latency low, and adapter layers have to be processed sequentially. This makes a difference in the online inference setting where the batch size is typically as small as one.” I actually don’t understand how to parallelize an LLM inference even if without Adapter
Directly Optimizing the Prompt is Hard: Prompt tuning and prefix tuning both require adding a prefix (to either the input vector or to the hidden vector in middle). In this way, it “reduces the sequence length available to process a downstream task, which we suspect makes tuning the prompt less performant compared to other methods.” Its performance changes non-monotonically in trainable parameters, too. So it’s hard to optimize.

LoRA’s architecture is simply a matrix multiplication - Adapter without non-linearity or skip connection. So instead of y = B(σ(Ax)) + x, we do Δy = BAx. There is one fundamental difference though: Adapter is an extra layer added into the original network, but LoRA is a layer added along side with the original network. That’s why I have a delta in LoRA’s formula.

In fact, LoRA was specifically designed for low-rank matrix multiplication decomposition, note for any pretrained weight $W_0 \in \R^{d \times k}$ with y = W₀x and we want to update this weight matrix in fine-tuning: W = W₀ + ΔW, we define ΔW = BA, so W = W₀ + BA. This simple design yields a unimaginable good result: note matrix BA is also of dimension $\R^{d \times k}$. Therefore, we can directly add this result matrix to the original matrix and inferencing with LoRA, when we add this matrix in, gives us no additional inference latency.

Similar to Adapter, LoRA uses a random Gaussian initialization for A and zero for B, so ΔWx = BAx is zero at the beginning of training and W = W₀ + ΔW yields the same result (identity) as before.

Thanks to the matrix decomposition nature, we can apply LoRA to any matrix multiplication in principle. However, we found that in transformers, imagine if we have 8 ranks to distribute, when r(W_q) = r(W_v) = 4 or r(W_q) = r(W_k) = r(W_v) = r(W_o) = 2 gives best result. The author also found that the fine-tune matrix actually has a very low rank, so in practice even if we set r = 1 can give good enough results. Go to Section 7 for more interesting experiments they conducted.