Tests

To test:

cuDNN 9: https://developer.nvidia.com/blog/accelerating-transformers-with-nvidia-cudnn-9/

Formats

vLLM: AQLM, SqueezeLLM, AWQ, GPTQ https://colab.research.google.com/drive/1GhV5pntgqbiLoefd8nC3060cbhSoiChz?usp=sharing HF: AQLM, HQQ, AWQ, GPTQ llama.cpp: iXL, Quip#

https://github.com/ggerganov/llama.cpp/discussions/5063 https://github.com/ggerganov/llama.cpp/discussions/5063#discussioncomment-8383732 https://www.reddit.com/r/LocalLLaMA/comments/1clinlb/comment/l2ukxnt/?context=3

AQLM

vLLM https://docs.vllm.ai/en/latest/quantization/fp8_e4m3_kvcache.html

Sparsification https://docs.neuralmagic.com/get-started/optimize/ https://github.com/neuralmagic/sparseml https://arxiv.org/pdf/2301.00774 https://github.com/IST-DASLab/sparsegpt https://docs.neuralmagic.com/products/nm-vllm/

SqueezeLLM

QuIP#

HQQ

https://mobiusml.github.io/hqq_blog/ https://github.com/mobiusml/hqq https://www.reddit.com/r/LocalLLaMA/comments/1b33rj8/better_hqq_quantized_mixtral_models_2bit_and_3bit/ https://huggingface.co/docs/transformers/main/en/quantization/hqq

EXL2 (ExLlamaV2)

OmniQuant

Here are my docs for how run it: OmniQuant

QuIP

SqueezeLM

AWQ

GGML k-quants

SmoothQuant

SpQR

GPTQ/OPTQ