NOTE: This document tries to avoid using the term “performance” since in ML research the term performance typically refers to measuring model quality/capabilities.

This is a cheat sheet for running a simple benchmark on consumer hardware for LLM inference using the most popular end-user inferencing engine, llama.cpp and its included llama-bench. Feel free to skip to the HOWTO section if you want.

  • If you’re looking for more information and industrial benchmarking, the best place to start is probably Stas Bekman’s Machine Learning Engineering Open Book, particularly the Compute/Accelerator section. Last year I did a long writeup on tuning vLLM for MI300X that links to many resources as a good starting point as well. I recommend using mamf-finder.py for hardware testing, and vLLM/SGLang’s benchmark_serving implementations to generate throughput, TTFT, TPOT measurements at different concurrencies.

Background and Definitions

Large Language Model (LLM) Basics

  • Parameters: LLMs are basically a big pile of numbers (matrices) They have different sizes, which is parameter count - When you see 7B, 8B, 14B, this is an approximate count of how many parameters (in billions) they are. Dense models activate all parameters on every token generated. Mixture of Expert (MoE) models typically only activate a percentage of parameters
  • Quantization - Models weights (the parameters) used to be stored as FP32 (4 bytes), then FP16/BF16 (2 bytes). Commercially, FP8 and INT8 are quite common, and FP4 and INT4 are emerging. At home, “Q4”, which is roughly ~4-bit is used most often used, but there are even smaller quants that are usable these days (down to ~1.58b). Note, performance loss is not linear - a good Q4 quant can be close to or for PTQ quants, can even surpass unquanted performance with the proper calibration set. Most home users will probably be running Q4 quants that generally only lose a few percentage points in quality while taking almost 4 times less memory and running 4 times faster than the FP16/BF16 full precision versions
  • Weights, Activations, Computational Precision - when we talk about precision, there are actually are differences between the weights, activations and the actual computational precision. Just mentioning this to head off some confusion
  • Almost all popular desktop tools like LM Studio, Ollama, Jan, AnythingLLM run llama.cpp as their inference backend. llama.cpp has many backends - Metal for Apple Silicon, CUDA, HIP (ROCm), Vulkan, and SYCL among them (for Intel GPUs, Intel maintains a fork with an IPEX-LLM backend that performs much better than the upstream SYCL version).
    • AMD GPUs - the most comprehensive guide on running AI/ML software on AMD GPUs
    • Intel GPUs - some notes and testing w/ Intel Arc GPUs

HOWTO

Testing Notes

Simple Benchmark

# Default
build/bin/llama-bench -m $MODEL_PATH

# w/ Flash Attention enabled
build/bin/llama-bench -m $MODEL_PATH -fa 1
  • This runs with prompt processing (compute limited) pp512 and token generation (memory limited) tg128 5 times and outputs a nice table for you - with this you can repeatedly test and compare these numbers in a much more reliable fashion
    • pp512 measures how fast you will process context/existing conversation history - eg, if you have 4000 tokens of context and your pp is 100 tok/s, you will have to wait 40s before you start generating any tokens.
    • tg128 is a measure of how fast your device will generate new tokens for a single user (batch size = 1, bs=1)
  • Generally GGUF models take up the model size in memory for weights with additional memory required for the kvcache depending on your max context size. You can therefore do a rough speed calculation of tok/s by dividing MBW/GGUF size
  • There are a lot of additional options - the big ones are if you have limited device memory and need to load specific layers with -ngl - if you get errors from running out of memory, lower the layer count until it all fits. The rest will be offloaded to system memory
  • If you have multiple devices, you may need to set GGML_VK_VISIBLE_DEVICES or CUDA_VISIBLE_DEVICES to select which ones you want to use/test. Typically inference will run as an average of your different device speeds
  • It’s best to benchmark headless devices of course, but if you have to use a GPU for display, try to have it running as little as possible. It’d be best to SSH in remotely to do your testing.
  • nvtop (which works w/ Nvidia and AMD GPUs) is a useful tool for realtime debugging. nvidia-smi and rocm-smi can be used for logging runs (especially memory highwater marks, power consumption)
  • Flash Attention lowers memory usage for context and it also slightly increases speed for CUDA, but can make other devices dramatically slower (due to limitations of llama.cpp’s current FA implementation, not intrinsic to hardware or the Flash Attention algorithm)

More comprehensive benchmarks

  • use nvidia-smi and rocm-smi to track power usage
  • You can adjust power limits and see how inference performance is affected
  • llama-bench is good for a basic repeatable test for max throughput, but you probably want to use something like vLLM’s benchmark for testing Time To First Token (TTFT) and Time Per Output Token (TPOT)
  • As mentioned, llama.cpp supports speculative decode
  • If you want to test multi-user or batched output (eg, you want to just process a lot of text), the llama.cpp kernels are usually not very good and you’ll want to check out vLLM, SGLang, TGI, etc.
  • If you have >130GB of memory, you can give some of these R1 quants a try: https://unsloth.ai/blog/deepseekr1-dynamic