https://github.com/mit-han-lab/smoothquant https://neuralmagic.com/blog/fast-llama-2-on-cpus-with-sparse-fine-tuning-and-deepsparse/
Future Project
- for different quants
Perplexity https://oobabooga.github.io/blog/posts/perplexities/ https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/ https://www.reddit.com/r/LocalLLaMA/comments/145tf00/analysis_of_sizeperplexity_tradeoff_for/ https://www.reddit.com/r/LocalLLaMA/comments/16nmyqq/apples_to_apples_comparison_for_quantizations_of/ https://www.reddit.com/r/LocalLLaMA/comments/13l0j7m/a_comparative_look_at_ggml_quantization_and/
KL Divergence https://www.reddit.com/r/LocalLLaMA/comments/1816h1x/how_much_does_quantization_actually_impact_models/
Performance Benchmarking
- HumanEval
- lm-eval-harness
- https://www.reddit.com/r/LocalLLaMA/comments/13yehfn/new_quantization_method_awq_outperforms_gptq_in/
GPTQ AWQ K-Quant EXL2 Squeeze Omniquant QuIP SpQR HQQ
BitNet
Q3_K_M is 3.91 bpw
Q4_K_M is 4.85bpw
Q5_K_M is 5.69 bpw Q6_K is 6.59 bpw Q8_0 is 8.50 bpw