Ubuntu 24.04 LTS, ROCm 6.2

(base) lhl@rocm:~/llama.cpp$ CUDA_VISIBLE_DEVICES=0 ./llama-bench -m /models/gguf/llama-2-7b.Q4_K_M.gguf 
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon PRO W7900, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 7B Q4_K - Medium         |   3.80 GiB |     6.74 B | ROCm       |  99 |         pp512 |  2845.90 ± 11.00 |
| llama 7B Q4_K - Medium         |   3.80 GiB |     6.74 B | ROCm       |  99 |         tg128 |     78.92 ± 0.12 |

build: 96355290 (3141)
(base) lhl@rocm:~/llama.cpp$ CUDA_VISIBLE_DEVICES=0 ./llama-bench -m /models/gguf/llama-2-7b.Q4_0.gguf 
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon PRO W7900, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | ROCm       |  99 |         pp512 | 2837.83 ± 136.68 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | ROCm       |  99 |         tg128 |     94.46 ± 0.07 |

build: 96355290 (3141)

w/ latest build

(base) lhl@rocm:~/llama.cpp$ CUDA_VISIBLE_DEVICES=0 ./llama-bench -m /models/gguf/llama-2-7b.Q4_K_M.gguf 
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon PRO W7900, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 7B Q4_K - Medium         |   3.80 GiB |     6.74 B | ROCm       |  99 |         pp512 |  2877.31 ± 14.47 |
| llama 7B Q4_K - Medium         |   3.80 GiB |     6.74 B | ROCm       |  99 |         tg128 |     79.44 ± 0.13 |

build: c02b0a8a (3512)
(base) lhl@rocm:~/llama.cpp$ CUDA_VISIBLE_DEVICES=0 ./llama-bench -m /models/gguf/llama-2-7b.Q4_0.gguf 
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon PRO W7900, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | ROCm       |  99 |         pp512 |  2907.80 ± 22.70 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | ROCm       |  99 |         tg128 |     95.01 ± 0.05 |

build: c02b0a8a (3512)