My testing:


For batch=1 performance, you can roughly estimate performance:

  • 20:1 flop/byte ratio for compute vs memory bandwidth limitation
  • Use MT/s to GB/s Unit Converter to get memory bandwidth (memory channels x MT/s) = GB/s
    • You can divide memory bandwidth by memory used for a model to get a ballpark estimate of batch=1 perf
    • For a q4 quant, you can also ballpark w/ # of parameters.
    • As an example, a 4090 w/ 1,008 GB/s of memory bandwidth would be expected to get around 150 or 144 t/s depending on the which estimate and pretty close to benchmark results