See: https://huggingface.co/collections/leonardlin/speed-6583d7a3b02f38ef348139ef
- 2024-03-19 Exploring LLMs Speed Benchmarks: Independent Analysis
- Simple vLLM, TGI, TensorRT, DeepSpeed Mii, CTranslate2 comparison
- 2023-01-07 MK1 Flywheel Unlocks the Full Potential of AMD Instinct for LLM Inference
- Closed source, but interesting analysis of MI210 optimization
- Discussion
- 2023-11 - llama.cpp Mac performance
- 2023-08-14 Aman Sanger (cursor.so) comparing high batch throughput
- 2023-08-11 Optimizing latency
- mlc, ctranslate2, vllm, tgi
- A6000
- batch 1 but focused on serving
- 2023-08-09 [Survey] Supported Hardwares and Speed
- MLC LLM speeds for all their hardware (SOTA batch 1 perf)
- https://github.com/mlc-ai/llm-perf-bench
- MLC LLM vs ExLlama, llama.cpp
- 2023-08-09 Making AMD GPUs competitive for LLM inference
- 2023-07-31 7 Frameworks for Serving LLMs
- vLLM, TGI, CTranslate2, DS, OpenLLM, Ray Serve, MLC LLM
- 2023-07-06 LLaMa 65B GPU benchmarks - great benchmark and writeups
- 3090 v 4090 v A6000 v A6000 ADA
- ExLlama, ExLlama_HF, llama.cpp
My testing:
- 2023-08-16 CPU shootoff
- 7940HS v 5950X v 1260P v M2
- 2023-08-03 Inference Engine Shootout
- MLC v llama.cpp v ExLlama
- 2023-07-28 3090 and 4090 Power Limit performance
- You can shave 50-100W off the PL and retain 97% of performance
More:
- How continuous batching enables 23x throughput in LLM inference while reducing p50 latency
- LMDeploy / Turbomind
For batch=1 performance, you can roughly estimate performance:
- 20:1 flop/byte ratio for compute vs memory bandwidth limitation
- Use MT/s to GB/s Unit Converter to get memory bandwidth (memory channels x MT/s) = GB/s
- You can divide memory bandwidth by memory used for a model to get a ballpark estimate of batch=1 perf
- For a q4 quant, you can also ballpark w/ # of parameters.
- As an example, a 4090 w/ 1,008 GB/s of memory bandwidth would be expected to get around 150 or 144 t/s depending on the which estimate and pretty close to benchmark results