📖 llm-tracker

Search

❯

❯

Quantization Overview

Quantization Overview

May 31, 2024, 2 min read

Tests

How does quantisation affect model output? - 15 basic tests on different quant levels
A detailed comparison between GPTQ, AWQ, EXL2, q4_K_M, q4_K_S, and load_in_4bit: perplexity, VRAM, speed, model size, and loading time.
A direct comparison between llama.cpp, AutoGPTQ, ExLlama, and transformers perplexities

To test:

Memory high water mark
Prompt/Inference speed
Perplexity
Standardized benchmark numbers (OpenAPI layer w/ lm-eval?)
- Sweep of fastest tests (Wiki, C4, Winogrande, PIQA, ARCe) - lambada and coqa? - run standard timings first
- EvalPlus (code!)
KL Divergence: https://www.reddit.com/r/LocalLLaMA/comments/1816h1x/how_much_does_quantization_actually_impact_models/

cuDNN 9: https://developer.nvidia.com/blog/accelerating-transformers-with-nvidia-cudnn-9/

Formats

vLLM: AQLM, SqueezeLLM, AWQ, GPTQ https://colab.research.google.com/drive/1GhV5pntgqbiLoefd8nC3060cbhSoiChz?usp=sharing HF: AQLM, HQQ, AWQ, GPTQ llama.cpp: iXL, Quip#

Llama3 8B and 70B - test conversion time, memory, efficiency, perplexity
Use train data or benchmark testsets for calibration if necessary
inference speed testing
Autojudge https://huggingface.co/docs/transformers/main/quantization/overview#hqq 4-bit and 2-bit tests
GGUF
iXL
EXL2

https://github.com/ggerganov/llama.cpp/discussions/5063 https://github.com/ggerganov/llama.cpp/discussions/5063#discussioncomment-8383732 https://www.reddit.com/r/LocalLLaMA/comments/1clinlb/comment/l2ukxnt/?context=3

AQLM

best performing low-bit rate? better than Quip#?
Integrated into vLLM already
HF: CUDA Graph hack
https://www.reddit.com/r/LocalLLaMA/comments/1clinlb/bringing_2bit_llms_to_production_new_aqlm_models/
https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/aqlm_cuda_graph.ipynb#scrollTo=p4ON1sMP2c2P

vLLM https://docs.vllm.ai/en/latest/quantization/fp8_e4m3_kvcache.html

Sparsification https://docs.neuralmagic.com/get-started/optimize/ https://github.com/neuralmagic/sparseml https://arxiv.org/pdf/2301.00774 https://github.com/IST-DASLab/sparsegpt https://docs.neuralmagic.com/products/nm-vllm/

SqueezeLLM

https://arxiv.org/pdf/2306.07629
Also in vLLM https://github.com/vllm-project/vllm/blob/a377f0bd5e1fa0ca069e3dbf28f4de5af64d0bb1/vllm/model_executor/layers/quantization/squeezellm.py#L14

QuIP#

https://cornell-relaxml.github.io/quip-sharp/
https://github.com/turboderp/exllamav2/issues/176
https://github.com/ggerganov/llama.cpp/pull/4773

HQQ

https://mobiusml.github.io/hqq_blog/ https://github.com/mobiusml/hqq https://www.reddit.com/r/LocalLLaMA/comments/1b33rj8/better_hqq_quantized_mixtral_models_2bit_and_3bit/ https://huggingface.co/docs/transformers/main/en/quantization/hqq

EXL2 (ExLlamaV2)

https://github.com/turboderp/exllamav2
Based off of GPTQ but iteratively selects from quants vs calibration and averages bit depth to target an arbitrary bit-weight

OmniQuant

Here are my docs for how run it: OmniQuant

https://github.com/OpenGVLab/OmniQuant
https://arxiv.org/abs/2308.13137
Better than GPTQ (OPTQ), AWQ, SmoothQuant
MLC compatible

QuIP

https://github.com/jerry-chee/QuIP
https://github.com/AlpinDale/QuIP-for-Llama
https://arxiv.org/abs/2307.13304
Not just PPL but also benchmark accuracy tests
3-bit almost matches FP1

SqueezeLM

https://github.com/SqueezeAILab/SqueezeLLM
https://arxiv.org/abs/2306.07629

AWQ

https://github.com/mit-han-lab/llm-awq
https://arxiv.org/abs/2306.00978
https://github.com/casper-hansen/AutoAWQ/

GGML k-quants

https://github.com/ggerganov/llama.cpp/pull/1684

SmoothQuant

https://github.com/mit-han-lab/smoothquant
https://arxiv.org/abs/2211.10438

SpQR

https://github.com/Vahe1994/SpQR
https://arxiv.org/abs/2306.03078

GPTQ/OPTQ

https://github.com/IST-DASLab/gptq
https://arxiv.org/abs/2210.17323
https://openreview.net/forum?id=tcbBPnfwxS

Tests
Formats
AQLM
SqueezeLLM
QuIP#
HQQ
EXL2 (ExLlamaV2)
OmniQuant
QuIP
SqueezeLM
AWQ
GGML k-quants
SmoothQuant
SpQR
GPTQ/OPTQ

Backlinks

No backlinks found

Created with Quartz © 2025