Inference Speed

2023-08-14 Aman Sanger (cursor.so) comparing high batch throughput
2023-08-11 Optimizing latency
- mlc, ctranslate2, vllm, tgi
- A6000
- batch 1 but focused on serving
2023-08-09 [Survey] Supported Hardwares and Speed
- MLC LLM speeds for all their hardware (SOTA batch 1 perf)
- https://github.com/mlc-ai/llm-perf-bench
  - MLC LLM vs ExLlama, llama.cpp
- 2023-08-09 Making AMD GPUs competitive for LLM inference
2023-07-31 7 Frameworks for Serving LLMs
- vLLM, TGI, CTranslate2, DS, OpenLLM, Ray Serve, MLC LLM
2023-07-06 LLaMa 65B GPU benchmarks - great benchmark and writeups
- 3090 v 4090 v A6000 v A6000 ADA
- ExLlama, ExLlama_HF, llama.cpp

My testing:

2023-08-16 CPU shootoff
- 7940HS v 5950X v 1260P v M2
2023-08-03 Inference Engine Shootout
- MLC v llama.cpp v ExLlama
2023-07-28 3090 and 4090 Power Limit performance
- You can shave 50-100W off the PL and retain 97% of performance

For batch=1 performance, you can roughly estimate performance:

20:1 flop/byte ratio for compute vs memory bandwidth limitation
Use MT/s to GB/s Unit Converter to get memory bandwidth (memory channels x MT/s) = GB/s
- You can divide memory bandwidth by memory used for a model to get a ballpark estimate of batch=1 perf
- For a q4 quant, you can also ballpark w/ # of parameters.
- As an example, a 4090 w/ 1,008 GB/s of memory bandwidth would be expected to get around 150 or 144 t/s depending on the which estimate and pretty close to benchmark results

📖 llm-tracker