Speed collection: https://huggingface.co/collections/leonardlin/speed-6583d7a3b02f38ef348139ef
Good recent summary of techniques: https://vgel.me/posts/faster-inference/
Exponentially Faster Language Modelling
- https://arxiv.org/abs/2311.10770
- “we provide high-level CPU code achieving 78x speedup over the optimized baseline feedforward implementation, and a PyTorch implementation delivering 40x speedup over the equivalent batched feedforward inference”
Lookahead Decoding https://lmsys.org/blog/2023-11-21-lookahead-decoding/
Medusa https://sites.google.com/view/medusa-llm
Groq https://news.ycombinator.com/item?id=38739199 https://chat.groq.com/ https://groq.com/wp-content/uploads/2023/05/GroqISCAPaper2022_ASoftwareDefinedTensorStreamingMultiprocessorForLargeScaleMachineLearning-1.pdf https://www.bittware.com/products/groq/