K-bit Inference Scaling Laws
This paper and its Appendix serves as a good summary of SOTA quantization techniques and their results.
Why should we quantize? The overall computation latency – the time it takes from start to finish of a computation – is mainly determined by two factors: (1) how long does it take to load the data from main memory into caches and registers, (2) how long does it take to perform the computation. Therefore, reducing the time spent loading data from main memory is often the best way to accelerate overall computation latency. Such reductions can be achieved mainly through caching and lower precision numbers.
Note though, to do computation, we dequantize the weight in the cache and perform a 16-bit floating point multiplication with the 16-bit input. That’s because no CPU / GPU supports these weird data type computation.
In this work, the author used blocking, a zero-shot quantization method, for 3, 4, 5, 6, 7, and 8 bits. By plotting out perplexity vs total bits of the model, they found that lowering the bit precision generally improves scaling. However, this trend stops across all models at 3-bit precision, where performance degrades. Therefore, 4-bit precision is optimal for almost all models at all scales, with few exceptions.
BLOOM and BLOOMZ show almost the same quantization behavior, indicating that fine-tuning an existing model does not change its quantization properties.
Data types: The quantile quantization and float data types provide better scaling than integer and dynamic exponent quantization. The author concluded that quantile quantization is the best.
- zero-shot quantization: directly quantize a model without any additional information. Can be used immediately, which makes them easy to use, but zero-shot quantization methods often fail at lower precisions.
- one-shot quantizationL need a mini-batch of data for quantization. more accurate, such as GPTQ, which optimizes the rounding during quantization via a mini-batch of data. But they are also more complex and may require hours of optimization before a model can be used.
LLM.int8()
In this quantization paper, the author discovered an emergent outlier
feature in tranformers that totally wreck quantization. They also
plotted a scaling law for this emergent outlier feature. By doing so, he
proposed the LLM.int8()
no-performance-degradation
quantization method. I didn’t read the paper, but looked at the author’s
blog
post instead.
How Quantization works
First, how do we quantize a number? Imagine the following example: you have a data type I5 with values [0, 1, 2, 3, 4, 5] and a data type I3 with values [0, 2, 4]. We want to quantize I5 vector [3, 1, 2, 3] to I3:
map from original domain to unit domain [-1, 1]
find absolute maximum
3 = max(abs([3, 1, 2, 3]))
, divide the vector by 3 to get[1.0, 0.33, 0.66, 1.0]
map from unit domain [-1, 1] to quantized domain
Multiply by the range of the target data type I3, which is 4:
[1.0, 0.33, 0.66, 1.0] -> [4.0, 1.33, 2.66, 4.0]
round to the nearest representable number in this quantized domain
[4.0, 1.33, 2.66, 4.0] -> [4, 0, 2, 4]
To dequantize, we reverse this process:
map from quantized domain to unit domain [-1, 1]
Divide the vector by range 4 to get
[1.0, 0.0, 0.5, 1.0]
map from unit domain [-1, 1] to original domain
Multiply by the stored absolute maximum 3:
[1.0, 0.0, 0.5, 1.0] -> [3.0, 0.0, 1.5, 3.0]
round to the nearest representable number in the original domain
[3.0, 0.0, 1.5, 3.0] -> [3, 0, 2, 3]
We see that our dequantization and quantization led to one error at the second element.
Emergent Outlier Features
Since we are using the absolute, it is obvious that if we have an outlier, there will be more errors. So the authors go on to discover the distribution of outliers. They call such outliers emergent outlier features.
The authors explain such outlier features as to select only a single feature. At the same time, the other small value part brings the unimportant values down.
The authors also found this very interesting emergent phenomenom, which I directly quote below:
However, this full “coordination” through a single dimension only happens after the phase shift. Before the phase shift, in transformers with less than 6.7B parameters some layers disagree which dimension to use for these large features (no prominent outliers).
…
The phase shift happens around 6.7B, where 100% of layers use the same dimension for outliers. At this point, a couple of things happen rapidly:
- Transformers become more stable. If you treat the outlier features separately, I believe you can probably run and even train transformers in less than 8-bit precision without degradation in performance.
Note the model perplexity rather than mere model size determines the phase shift.
GGML
MLabonne did a great explanation to how GGML did quantization.
GGML quantizes weights in a rather naive way. Basically, it groups blocks of values and rounds them to a lower precision. Some techniques, like Q4_K_M and Q5_K_M, implement a higher precision for critical layers. In this case, every weight is stored in 4-bit precision, with the exception of half of the attention.wv and feed_forward.w2 tensors. Experimentally, this mixed precision proves to be a good tradeoff between accuracy and resource usage.
If we look into the ggml-quants.h
file, we can see how the blocks are defined. For example,
the block_q4_0
structure is defined as:
1 |
|
In GGML, weights are processed in blocks, each consisting of 32 values. For each block, a scale factor (delta) is derived from the largest weight value. All weights in the block are then scaled, quantized, and packed efficiently for storage (nibbles).
Oobabooga, the author of text generation webui, did an in-depth survey of inference time, model size, vRAM usage on different types of quantization formats (GPTQ, GGUF, EXL2, …)