See: https://huggingface.co/collections/leonardlin/speed-6583d7a3b02f38ef348139ef

My testing:

More:

For batch=1 performance, you can roughly estimate performance:

  • 20:1 flop/byte ratio for compute vs memory bandwidth limitation
  • Use MT/s to GB/s Unit Converter to get memory bandwidth (memory channels x MT/s) = GB/s
    • You can divide memory bandwidth by memory used for a model to get a ballpark estimate of batch=1 perf
    • For a q4 quant, you can also ballpark w/ # of parameters.
    • As an example, a 4090 w/ 1,008 GB/s of memory bandwidth would be expected to get around 150 or 144 t/s depending on the which estimate and pretty close to benchmark results

https://github.com/apoorvumang/prompt-lookup-decoding