Skip to main content

AMD GPUs

As of August 2023, AMD's ROCm GPU compute software stack is available for Linux or Windows.

Linux

Testing was done with a Radeon VII (16GB HBM2 VRAM, gfx906) on Arch Linux

Officially Supported GPUs for ROCm 5.6 are: Radeon VII, Radeon Pro VII, V620, W6800, and MI Instinct MI50, MI100, MI210, MI250, MI250X

RDNA3 (eg 7900 XT, XTX)

These are not officially supported w/ ROCm 5.6 (but some support coming in 5.7 in Fall 2023), however you might be able to get it work for certain tasks (SD, LLM inferencing):

AMD APU

Performance 65W 7940HS w/ 64GB of DDR5-5600 (83GB/s theoretical memory bandwidth): https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYpb63e1ZR3aePczz3zlbJW-Y4/edit#gid=1041125589

  • On small (7B) models that fit within the UMA VRAM, ROCm performance is very similar to my M2 MBA's Metal performance. Inference is barely faster than CLBlast/CPU though (~10% faster).
  • On a big (70B) model that doesn't fit into allocated VRAM, the ROCm inferences slower than CPU w/ -ngl 0 (CLBlast crashes), and CPU perf is about as expected - about 1.3 t/s inferencing a Q4_K_M. Besides being slower, the ROCm version also caused amdgpu exceptions that killed Wayland 2/3 times (I'm running Linux 6.5.4, ROCm 5.6.1, mesa 23.1.8).

Note BIOS allows me to set up to 8GB for VRAM in BIOS (UMA_SPECIFIED GART), ROCm does not support GTT (about 35GB/64GB if it did support it, which is not enough for a 70B Q4_0, not that you'd want to at those speeds).

Vulkan drivers can use GTT memory dynamically, but w/ MLC LLM, Vulkan version is 35% slower than CPU-only llama.cpp. Also, the max GART+GTT is still too small for 70B models.

Arch Linux Setup

Install ROCm:

# all the amd gpu compute stuff
yay -S rocm-hip-sdk rocm-ml-sdk rocm-opencl-sdk

# third party monitoring
yay -S amdgpu_top radeontop

Install conda (mamba)

yay -S mambaforge
/opt/mambaforge/bin/mamba init fish

Create Environment

mamba create -n llm
mamba activate llm

OC

We have some previous known good memory timings for our Radeon VII card:

sudo sh -c 'echo manual > /sys/class/drm/card0/device/power_dpm_force_performance_level'
sudo sh -c 'echo 8 > /sys/class/drm/card0/device/pp_dpm_sclk'
sudo amdmemorytweak --gpu 0 --ref 7500 --rtp 6 --rrds 3 --faw 12 --ras 19 --rc 30 --rcdrd 11 --rp 11

llama.cpp

Let's first try llama.cpp

mkdir ~/llm
cd ~/llm
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
LLAMA_CLBLAST=1 make

We're benchmarking with with a recent llama-13b q4_0 fine tune (Nous Hermes)

Here are the results from 2023-06-29 commit 96a712c

NOTE: We use -ngl 99 to ensure all layers are loaded in memory.

CUDA_VISIBLE_DEVICES=0 ./main -m ../models/nous-hermes-13b.ggmlv3.q4_0.bin -ngl 99 -n 2048 --ignore-eos

main: build = 762 (96a712c)
main: seed  = 1688035176
ggml_opencl: selecting platform: 'AMD Accelerated Parallel Processing'
ggml_opencl: selecting device: 'gfx906:sramecc+:xnack-'
ggml_opencl: device FP16 support: true
llama.cpp: loading model from ../models/nous-hermes-13b.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: using OpenCL for GPU acceleration
llama_model_load_internal: mem required  = 2223.88 MB (+ 1608.00 MB per state)
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 43/43 layers to GPU
llama_model_load_internal: total VRAM used: 8416 MB
llama_new_context_with_model: kv self size  =  400.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 2048, n_keep = 0

...

llama_print_timings:        load time =  6946.39 ms
llama_print_timings:      sample time =  2172.76 ms /  2048 runs   (    1.06 ms per token,   942.58 tokens per second)
llama_print_timings: prompt eval time = 51096.76 ms /  1801 tokens (   28.37 ms per token,    35.25 tokens per second)
llama_print_timings:        eval time = 308604.23 ms /  2040 runs   (  151.28 ms per token,     6.61 tokens per second)
llama_print_timings:       total time = 362807.86 ms

We get 6.61 t/s.

rocm-smi looks something like:

GPU  Temp (DieEdge)  AvgPwr  SCLK     MCLK    Fan     Perf    PwrCap  VRAM%  GPU%
0    59.0c           112.0W  1801Mhz  800Mhz  44.71%  manual  250.0W   49%   40%

llama.cpp HIP fork

Now let's see if HIPified CUDA makes a difference using this fork

Here are the results from a 13b-q4_0 on 2023-06-29, commit 04419f1

git clone https://github.com/SlyEcho/llama.cpp llama.cpp-hip
cd llama.cpp-hip
git fetch
make -j8 LLAMA_HIPBLAS=1

CUDA_VISIBLE_DEVICES=0 ./main -m ../models/nous-hermes-13b.ggmlv3.q4_0.bin -ngl 99 -n 2048 --ignore-eos

main: build = 821 (04419f1)
main: seed  = 1688034262
ggml_init_cublas: found 1 CUDA devices:
  Device 0: AMD Radeon VII
llama.cpp: loading model from ../models/nous-hermes-13b.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 2135.99 MB (+ 1608.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 43/43 layers to GPU
llama_model_load_internal: total VRAM used: 9016 MB
llama_new_context_with_model: kv self size  =  400.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0
 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
 sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 2048, n_keep = 0

...

llama_print_timings:        load time =  4049.27 ms
llama_print_timings:      sample time =  1307.03 ms /  2048 runs   (    0.64 ms per token,  1566.91 tokens per second)
llama_print_timings: prompt eval time = 17486.67 ms /  1801 tokens (    9.71 ms per token,   102.99 tokens per second)
llama_print_timings:        eval time = 157571.58 ms /  2040 runs   (   77.24 ms per token,    12.95 tokens per second)
llama_print_timings:       total time = 176912.26 m

We get 12.95 t/s, almost 2X faster than the OpenCL version. If you are using llama.cpp on AMD GPUs, I think it's safe to say you should definitely use this HIP fork.

Note, the 4C Zen2 Ryzen 2400G CPU gets about 2.2 t/s, so performance is about 6X.

exllama

ROCm support was merged 2023-06-07.

We run a 13B 4-bit GPTQ (Manticore-13B-GPTQ) on 2023-06-29 w/ commit 93d50d1

# make sure we have git-lfs working
yay -S git-lfs
git lfs install
# in models folder
git clone https://huggingface.co/TheBloke/Manticore-13B-GPTQ

git clone https://github.com/turboderp/exllama
cd exllama

# install ROCm PyTorch https://pytorch.org/get-started/locally/
pip3 install torch --index-url https://download.pytorch.org/whl/rocm5.4.2
pip install -r requirements.txt

python test_benchmark_inference.py -d ~/llm/models/Manticore-13B-GPTQ/ -p
Successfully preprocessed all matching files.
 -- Tokenizer: /home/lhl/llm/models/Manticore-13B-GPTQ/tokenizer.model
 -- Model config: /home/lhl/llm/models/Manticore-13B-GPTQ/config.json
 -- Model: /home/lhl/llm/models/Manticore-13B-GPTQ/Manticore-13B-GPTQ-4bit-128g.no-act-order.safetensors
 -- Sequence length: 2048
 -- Tuning:
 -- --matmul_recons_thd: 8
 -- --fused_mlp_thd: 2
 -- --sdp_thd: 8
 -- --rmsnorm_no_half2
 -- --rope_no_half2
 -- --matmul_no_half2
 -- --silu_no_half2
 -- Options: ['perf']
 ** Time, Load model: 6.86 seconds
 ** Time, Load tokenizer: 0.01 seconds
 -- Groupsize (inferred): 128
 -- Act-order (inferred): no
 ** VRAM, Model: [cuda:0] 6,873.52 MB - [cuda:1] 0.00 MB
 -- Warmup pass 1...
 ** Time, Warmup: 0.36 seconds
 -- Warmup pass 2...
 ** Time, Warmup: 4.43 seconds
 -- Inference, first pass.
 ** Time, Inference: 4.52 seconds
 ** Speed: 425.13 tokens/second
 -- Generating 128 tokens, 1920 token prompt...
 ** Speed: 9.92 tokens/second
 -- Generating 128 tokens, 4 token prompt...
 ** Speed: 19.21 tokens/second
 ** VRAM, Inference: [cuda:0] 2,253.20 MB - [cuda:1] 0.00 MB
 ** VRAM, Total: [cuda:0] 9,126.72 MB - [cuda:1] 0.00 MB

These results are actually a regression from commit dd63e07 (which was about 15 t/s). At 9.92 t/s, the llama.cpp HIP{ fork is now 30% faster.

RTX 4090 Comparison

As a point of comparison, running llama.cpp with make LLAMA_CUBLAS=1 runs at about 72 t/s:

./main -m /data/ai/models/llm/manticore/Manticore-13B-Chat-Pyg.ggmlv3.q4_0.bin -ngl 99 -n 2048 --ignore-eos

...

llama_print_timings:        load time =  3569.39 ms
llama_print_timings:      sample time =   930.53 ms /  2048 runs   (    0.45 ms per token,  2200.89 tokens per second)
llama_print_timings: prompt eval time =  2608.07 ms /  1801 tokens (    1.45 ms per token,   690.55 tokens per second)
llama_print_timings:        eval time = 28273.11 ms /  2040 runs   (   13.86 ms per token,    72.15 tokens per second)
llama_print_timings:       total time = 32225.03 ms

exllama, performs about on-par to llama.cpp and we get 74.79 t/s:

python test_benchmark_inference.py -p -d /models/llm/manticore/manticore-13b-chat-pyg-GPTQ
 -- Tokenizer: /data/ai/models/llm/manticore/manticore-13b-chat-pyg-GPTQ/tokenizer.model
 -- Model config: /data/ai/models/llm/manticore/manticore-13b-chat-pyg-GPTQ/config.json
 -- Model: /data/ai/models/llm/manticore/manticore-13b-chat-pyg-GPTQ/Manticore-13B-Chat-Pyg-GPTQ-4bit-128g.no-act-order.safetensors
 -- Sequence length: 2048
 -- Tuning:
 -- --matmul_recons_thd: 8
 -- --fused_mlp_thd: 2
 -- --sdp_thd: 8
 -- Options: ['perf']
 ** Time, Load model: 3.98 seconds
 ** Time, Load tokenizer: 0.01 seconds
 -- Groupsize (inferred): 128
 -- Act-order (inferred): no
 ** VRAM, Model: [cuda:0] 6,873.52 MB
 -- Warmup pass 1...
 ** Time, Warmup: 1.55 seconds
 -- Warmup pass 2...
 ** Time, Warmup: 0.07 seconds
 -- Inference, first pass.
 ** Time, Inference: 0.25 seconds
 ** Speed: 7600.98 tokens/second
 -- Generating 128 tokens, 1920 token prompt...
 ** Speed: 74.79 tokens/second
 -- Generating 128 tokens, 4 token prompt...
 ** Speed: 99.17 tokens/second
 ** VRAM, Inference: [cuda:0] 1,772.79 MB
 ** VRAM, Total: [cuda:0] 8,646.31 MB

Recommendation

Radeon VII 16GB cards are going for about $250-$300 on eBay (equivalent to an Instinct MI50 which range a lot in price; MI60 or MI100 are similar also similar generation cards but with 32GB of RAM).

For the performance, you're much better off paying about $200 (alt) for an Nvidia Tesla P40 24GB (1080Ti class but with more RAM) or about $700 for an RTX 3090 24GB. The P40 can reportedly run 13b models at about 15 tokens/s, over 2X faster than a Radeon VII and with lots more software support. Also, 24GB cards support 30b models, which 16GB cards can't do.

Bonus: GTX 1080 Ti Comparison

I dug out my old GTX 1080 Ti and installed it to get a ballpark vs P40 numbers.

We are running llama.cpp with the same checkout (2023-06-29 commit 96a712c). The GPU refactor no longer has a CUDA kernel for the 1080 Ti, so I've used LLAMA_CLBLAST=1 instead, but it still runs faster than the older (un-optimized) CUDA version (previous tests output at 5.8 t/s).

./main -m /data/ai/models/llm/manticore/Manticore-13B-Chat-Pyg.ggmlv3.q4_0.bin -ngl 99 -n 2048 --ignore-eos                                                (llama) 
main: build = 762 (96a712c)
main: seed  = 1688074299
ggml_opencl: selecting platform: 'NVIDIA CUDA'
ggml_opencl: selecting device: 'NVIDIA GeForce GTX 1080 Ti'
ggml_opencl: device FP16 support: false
llama.cpp: loading model from /data/ai/models/llm/manticore/Manticore-13B-Chat-Pyg.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: using OpenCL for GPU acceleration
llama_model_load_internal: mem required  = 2223.88 MB (+ 1608.00 MB per state)
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 43/43 layers to GPU
llama_model_load_internal: total VRAM used: 8416 MB
llama_new_context_with_model: kv self size  =  400.00 MB

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 2048, n_keep = 0

...

llama_print_timings:        load time =  1459.18 ms
llama_print_timings:      sample time =   884.79 ms /  2048 runs   (    0.43 ms per token,  2314.67 tokens per second)
llama_print_timings: prompt eval time = 31935.74 ms /  1801 tokens (   17.73 ms per token,    56.39 tokens per second)
llama_print_timings:        eval time = 220695.57 ms /  2040 runs   (  108.18 ms per token,     9.24 tokens per second)
llama_print_timings:       total time = 253862.42 ms

Using CLBlast, we get 9.24 t/s, which is a little slower than the Radeon VII.

exllama is no longer very happy with Pascal cards, although reports are that gptq-for-llama/autogptq can output at 20 t/s: https://github.com/turboderp/exllama/issues/75

ROCm Resources

ROCm support is outside the scope of this guide (maybe someone can make a new page if they have experience and can refactor).

Windows

llama.cpp

For an easy time, go to llama.cpp's release page and download a bin-win-clblast version.

In the Windows terminal, run it with -ngl 99 to load all the layers into memory.

.\main.exe -m model.bin -ngl 99

On a Radeon 7900XT, you should get about double the performance of CPU-only execution.

Compile for ROCm

This was last update 2023-09-03 so things might change, but here's how I was able to get things working in Windows.

Requirements

Instructions

First, launch "x64 Native Tools Command Prompt" from the Windows Menu (you can hit the Windows key and just start typing x64 and it should pop up).

# You should probably change to a folder you want first for grabbing the source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Make a build folder
mkdir build
cd build

# Make sure the HIP stuff gets picked up
cmake.exe .. -G "Ninja" -DCMAKE_BUILD_TYPE=Release -DLLAMA_HIPBLAS=on  -DCMAKE_C_COMPILER="clang.exe" -DCMAKE_CXX_COMPILER="clang++.exe" -DAMDGPU_TARGETS="gfx1100" -DCMAKE_PREFIX_PATH="C:\Program Files\AMD\ROCm\5.5"

# This should build binaries in a bin/ folder
cmake.exe --build .

That's it, now you have compiled executables in build/bin.

Start a new terminal to run llama.CPP

# You can do this in the GUI search for "environment variable" as well
setx /M PATH "C:\Program Files\AMD\ROCm\5.5\bin;%PATH%"

# Or for session
set PATH="C:\Program Files\AMD\ROCm\5.5\bin;%PATH%"

If you set just the global you may need to start a new shell before running this in the llama.cpp checkout. You can double check it'S working by outputing the path echo %PATH% or just running hipInfo or another exe in the ROCm bin folder.

NOTE: If your PATH is wonky for some reason you may get missing .dll errors. You can either fix that, or if all else fails, copy the missing files from "C:\Program Files\AMD\ROCm\5.5\bin into the build/bin folder since life is too short.

Results

Here's my llama-bench results running a llama2-7b q4_0 and q4_K_M:

C:\Users\lhl\Desktop\llama.cpp\build\bin>llama-bench.exe -m ..\..\meta-llama-2-7b-q4_0.gguf -p 3968 -n 128 -ngl 99
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XT, compute capability 11.0
| model                      	|   	size | 	params | backend	| ngl | test   	|          	t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| LLaMA v2 7B mostly Q4_0    	|   3.56 GiB | 	6.74 B | ROCm   	|  99 | pp 3968	|	882.92 ± 1.10 |
| LLaMA v2 7B mostly Q4_0    	|   3.56 GiB | 	6.74 B | ROCm   	|  99 | tg 128 	| 	94.55 ± 0.07 |

build: 69fdbb9 (1148)


C:\Users\lhl\Desktop\llama.cpp\build\bin>llama-bench.exe -m ..\..\meta-llama-2-7b-q4_K_M.gguf -p 3968 -n 128 -ngl 99
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XT, compute capability 11.0
| model                      	|   	size | 	params | backend	| ngl | test   	|          	t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| LLaMA v2 7B mostly Q4_K - Medium |   3.80 GiB | 	6.74 B | ROCm   	|  99 | pp 3968	|	858.74 ± 1.32 |
| LLaMA v2 7B mostly Q4_K - Medium |   3.80 GiB | 	6.74 B | ROCm   	|  99 | tg 128 	| 	78.78 ± 0.04 |

build: 69fdbb9 (1148)

Unsupported Architectures

On Windows, it may not be possible to apply an HSA_OVERRIDE_GFX_VERSION override. In that case, these instructions for compiling custom kernels may help: https://www.reddit.com/r/LocalLLaMA/comments/16d1hi0/guide_build_llamacpp_on_windows_with_amd_gpus_and/

Misc

Here's a ROCm fork of DeepSpeed: https://github.com/ascent-tek/rocm_containers/blob/main/README_DeepSpeed.md

Part of a set of ROCm docker instances.