<ahref="https://arxiv.org/abs/2212.09720">K-bit Inference ScalingLaws</a>

This paper and its Appendix serves as a good summary of SOTAquantization techniques and their results.

Why should we quantize? The overall computationlatency – the time it takes from start to finish of a computation – ismainly determined by two factors: (1) how long does it take to load thedata from main memory into caches and registers, (2) how long does ittake to perform the computation. Therefore, reducing the time spentloading data from main memory is often the best way to accelerateoverall computation latency. Such reductions can be achieved mainlythrough caching and lower precision numbers.

Note though, to do computation, we dequantize theweight in the cache and perform a 16-bit floating point multiplicationwith the 16-bit input. That’s because no CPU / GPU supports these weirddata type computation.

In this work, the author used blocking, a zero-shot quantizationmethod, for 3, 4, 5, 6, 7, and 8 bits. By plotting out perplexity vstotal bits of the model, they found that lowering the bit precisiongenerally improves scaling. However, this trend stops across all modelsat 3-bit precision, where performance degrades. Therefore, 4-bitprecision is optimal for almost all models at all scales, withfew exceptions.

BLOOM and BLOOMZ show almost the same quantization behavior,indicating that fine-tuning an existing model does notchange its quantization properties.

Data types: The quantile quantization and float data types providebetter scaling than integer and dynamic exponent quantization. Theauthor concluded that quantile quantization is thebest.

zero-shot quantization: directly quantize a modelwithout any additional information. Can be used immediately, which makesthem easy to use, but zero-shot quantization methods often fail at lowerprecisions.
one-shot quantizationL need a mini-batch of datafor quantization. more accurate, such as GPTQ, which optimizes therounding during quantization via a mini-batch of data. But they are alsomore complex and may require hours of optimization before a model can beused.

<ahref="https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/">LLM.int8()</a>

In this quantization paper, the author discovered an emergent outlierfeature in tranformers that totally wreck quantization. They alsoplotted a scaling law for this emergent outlier feature. By doing so, heproposed the LLM.int8() no-performance-degradationquantization method. I didn’t read the paper, but looked at the author’s<ahref="https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/">blogpost</a> instead.

How Quantization works

First, how do we quantize a number? Imagine the following example:you have a data type I5 with values [0, 1, 2, 3, 4, 5] and a data typeI3 with values [0, 2, 4]. We want to quantize I5 vector [3, 1, 2, 3] toI3:

map from original domain to unit domain [-1, 1]

find absolute maximum 3 = max(abs([3, 1, 2, 3])), dividethe vector by 3 to get [1.0, 0.33, 0.66, 1.0]
map from unit domain [-1, 1] to quantized domain

Multiply by the range of the target data type I3, which is 4:[1.0, 0.33, 0.66, 1.0] -> [4.0, 1.33, 2.66, 4.0]
round to the nearest representable number in this quantizeddomain

[4.0, 1.33, 2.66, 4.0] -> [4, 0, 2, 4]

To dequantize, we reverse this process:

map from quantized domain to unit domain [-1, 1]

Divide the vector by range 4 to get[1.0, 0.0, 0.5, 1.0]
map from unit domain [-1, 1] to original domain

Multiply by the stored absolute maximum 3:[1.0, 0.0, 0.5, 1.0] -> [3.0, 0.0, 1.5, 3.0]
round to the nearest representable number in the originaldomain

[3.0, 0.0, 1.5, 3.0] -> [3, 0, 2, 3]

We see that our dequantization and quantization led to one error atthe second element.

Emergent Outlier Features

Since we are using the absolute, it is obvious that if we have anoutlier, there will be more errors. So the authors go on to discover thedistribution of outliers. They call such outliers emergentoutlier features.

The authors explain such outlier features as to select only a singlefeature. At the same time, the other small value part brings theunimportant values down.

The authors also found this very interesting emergent phenomenom,which I directly quote below:

However, this full “coordination” through a single dimension onlyhappens after the phase shift. Before the phase shift, in transformerswith less than 6.7B parameters some layers disagree which dimension touse for these large features (no prominent outliers).

…

The phase shift happens around 6.7B, where 100% of layers use thesame dimension for outliers. At this point, a couple of things happenrapidly:

Transformers become more stable. If you treat the outlier featuresseparately, I believe you can probably run and even train transformersin less than 8-bit precision without degradation in performance.

Note the model perplexity rather than mere model sizedetermines the phase shift.

<ahref="https://mlabonne.github.io/blog/posts/Quantize_Llama_2_models_using_ggml.html#quantization-with-ggml">GGML</a>

<ahref="https://mlabonne.github.io/blog/posts/Quantize_Llama_2_models_using_ggml.html#quantization-with-ggml">MLabonne</a>did a great explanation to how GGML did quantization.

GGML quantizes weights in a rather naive way. Basically, it groupsblocks of values and rounds them to a lower precision. Some techniques,like Q4_K_M and Q5_K_M, implement a higher precision forcritical layers. In this case, every weight is stored in 4-bitprecision, with the exception of half of the attention.wv andfeed_forward.w2 tensors. Experimentally, this mixed precision proves tobe a good tradeoff between accuracy and resource usage.

If we look into the <ahref="https://github.com/ggerganov/ggml/blob/63d8fce8b57c5e97dd1d42b0d7b8c734df1f263c/src/ggml-quants.h">ggml-quants.hfile</a>, we can see how the blocks are defined. For example,the block_q4_0 structure is defined as:

#define QK4_0 32
typedef struct {
    ggml_fp16_t d;          // delta
    uint8_t qs[QK4_0 / 2];  // nibbles / quants
} block_q4_0;

In GGML, weights are processed in blocks, each consisting of 32values. For each block, a scale factor (delta) is derived from thelargest weight value. All weights in the block are then scaled,quantized, and packed efficiently for storage (nibbles).

Oobabooga, the author of <ahref="https://github.com/oobabooga/text-generation-webui">textgeneration webui</a>, did an <ahref="https://oobabooga.github.io/blog/posts/perplexities/">in-depthsurvey</a> of inference time, model size, vRAM usage on different typesof quantization formats (GPTQ, GGUF, EXL2, …)