To make a SOTA, fast model:
- Distillation
- https://arxiv.org/abs/2305.02301
- https://github.com/google-research/distilling-step-by-step
- https://snorkel.ai/llm-distillation-techniques-to-explode-in-importance-in-2024/
- https://snorkel.ai/llm-distillation-demystified-a-complete-guide/
- https://github.com/Tebmer/Awesome-Knowledge-Distillation-of-LLMs
- https://github.com/predibase/llm_distillation_playbook
- https://arxiv.org/abs/2306.08543
- https://arxiv.org/abs/2402.13116
- https://arxiv.org/pdf/2402.15000
- Sparsification
- Speculative Decoding Techniques
- https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/
- https://arxiv.org/abs/2211.17192
- https://github.com/FasterDecoding/Medusa
- https://arxiv.org/abs/2401.10774
- https://arxiv.org/abs/2402.05109
- https://arxiv.org/pdf/2403.09919
- S3D (Skippy Simultaneous Speculative Decoding)
- Lookahead Decoding
- Ouraboros (Speculative + Lookahead Decoding)