Tests
- How does quantisation affect model output? - 15 basic tests on different quant levels
- A detailed comparison between GPTQ, AWQ, EXL2, q4_K_M, q4_K_S, and load_in_4bit: perplexity, VRAM, speed, model size, and loading time.
- A direct comparison between llama.cpp, AutoGPTQ, ExLlama, and transformers perplexities
To test:
- Memory high water mark
- Prompt/Inference speed
- Perplexity
- Standardized benchmark numbers (OpenAPI layer w/ lm-eval?)
- Sweep of fastest tests (Wiki, C4, Winogrande, PIQA, ARCe) - lambada and coqa? - run standard timings first
- EvalPlus (code!)
- KL Divergence: https://www.reddit.com/r/LocalLLaMA/comments/1816h1x/how_much_does_quantization_actually_impact_models/
Formats
QuIP#
EXL2 (ExLlamaV2)
- https://github.com/turboderp/exllamav2
- Based off of GPTQ but iteratively selects from quants vs calibration and averages bit depth to target an arbitrary bit-weight
OmniQuant
Here are my docs for how run it: OmniQuant
- https://github.com/OpenGVLab/OmniQuant
- https://arxiv.org/abs/2308.13137
- Better than GPTQ (OPTQ), AWQ, SmoothQuant
- MLC compatible
QuIP
- https://github.com/jerry-chee/QuIP
- https://github.com/AlpinDale/QuIP-for-Llama
- https://arxiv.org/abs/2307.13304
- Not just PPL but also benchmark accuracy tests
- 3-bit almost matches FP1
SqueezeLM
AWQ
- https://github.com/mit-han-lab/llm-awq
- https://arxiv.org/abs/2306.00978
- https://github.com/casper-hansen/AutoAWQ/