As of August 2023, AMDβs ROCm GPU compute software stack is available for Linux or Windows. Itβs best to check the latest docs for information:
Hardware
These are the latest officially supported cards:
- https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html
- https://rocm.docs.amd.com/projects/install-on-windows/en/latest/reference/system-requirements.html
If you have a supported family, you can usually use set
HSA_OVERRIDE_GFX_VERSION
to the closest supported version (eg, if you have a gfx1031 card you can tryHSA_OVERRIDE_GFX_VERSION=10.3.0
and get things working).
Hereβs also an interesting 2024-06 writeup of supporting mixed architecture ROCm overrides: https://adamniederer.com/blog/rocm-cross-arch.html
RDNA3 (eg 7900 XT, XTX)
As of ROCm 5.7, Radeon RX 7900 XTX, XT, and PRO W7900 are officially supported and many old hacks are no longer necessary:
- https://rocm.docs.amd.com/projects/radeon/en/latest/docs/compatibility.html
- https://are-we-gfx1100-yet.github.io/
- https://news.ycombinator.com/item?id=36574179
- I posted my 7900XT/XTX results on Reddit, some conversation here: https://www.reddit.com/r/LocalLLaMA/comments/191srof/amd_radeon_7900_xtxtx_inference_performance/
AMD APU
Compatible iGPUs include the Radeon 780M (gfx1103) on Phoenix and Hawk Point 7X40 and 8X40 APUs and Radeon 890M (gfx1150) on Strix Point (Ryzen AI) APUs. You typically need to apply a HSA_OVERRIDE_GFX_VERSION=11.0.0
environment variable to make sure that these are using the right kernels. See also:
- https://github.com/lamikr/rocm_sdk_builder - make a custom ROCm build for your GPU
- https://github.com/likelovewant/ROCmLibs-for-gfx1103-AMD780M-APU - for Windows users, there are pre-built ROCmlibs for many officially unsupported architectures here
Performance 65W 7940HS w/ 64GB of DDR5-5600 (83GB/s theoretical memory bandwidth): https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYpb63e1ZR3aePczz3zlbJW-Y4/edit#gid=1041125589
- On small (7B) models that fit within the UMA VRAM, ROCm performance is very similar to my M2 MBAβs Metal performance. Inference is barely faster than CLBlast/CPU though (~10% faster).
- On a big (70B) model that doesnβt fit into allocated VRAM, the ROCm inferences slower than CPU w/ -ngl 0 (CLBlast crashes), and CPU perf is about as expected - about 1.3 t/s inferencing a Q4_K_M. Besides being slower, the ROCm version also caused amdgpu exceptions that killed Wayland 2/3 times (Iβm running Linux 6.5.4, ROCm 5.6.1, mesa 23.1.8).
Note: BIOS allows me to set up to 8GB for VRAM in BIOS (UMA_SPECIFIED GART), ROCm does not support GTT (about 35GB/64GB if it did support it, which is not enough for a 70B Q4_0, not that youβd want to at those speeds).
Vulkan drivers can use GTT memory dynamically, but w/ MLC LLM, Vulkan version is 35% slower than CPU-only llama.cpp. Also, the max GART+GTT is still too small for 70B models.
- It may be possible to unlock more UMA/GART memory: https://winstonhyypia.medium.com/amd-apu-how-to-modify-the-dedicated-gpu-memory-e27b75905056
- There is custom allocator that may allow PyTorch to use GTT memory (only useful for PyTorch inferencing obviously): https://github.com/pomoke/torch-apu-helper
- A writeup of someone playing around w/ ROCm and SD on an older APU: https://www.gabriel.urdhr.fr/2022/08/28/trying-to-run-stable-diffusion-on-amd-ryzen-5-5600g/ I was a bit curious at how performance looks like in 2024-09 - using the same model so you can compare how much a yearβs difference in development makes:
# ROCm
β― ./llama-bench -m /data/ai/models/llm/gguf/meta-llama-2-7b-q4_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon 780M, compute capability 11.0, VMM: no
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | pp512 | 262.87 Β± 1.23 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | tg128 | 19.57 Β± 0.02 |
build: faac0bae (3841)
# CPU
β― ./llama-bench -m /data/ai/models/llm/gguf/meta-llama-2-7b-q4_0.gguf
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CPU | 8 | pp512 | 61.84 Β± 0.72 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CPU | 8 | tg128 | 14.42 Β± 0.02 |
build: faac0bae (3841)
- On my system, OpenBLAS was slower than the regular CPU version
AMD NPU (RyzenAI)
The AMD NPU, starting with the 10 TOPS version in the 7X40 (Phoenix Point), 16 TOPS version in the 8X40 (Hawk Point) and 50 TOPS in the Ryzen AI 3XX (Strix Point) are variants of the Xilinx Vitis platform, which AMD has labeled βRyzen AI.β It has itβs own drivers and software stack (separate from ROCm). Maybe itβll get folded in one day? Who knows.
- https://ryzenai.docs.amd.com/en/latest/
- https://github.com/amd/RyzenAI-SW (includes a list of software projects)
I wonβt be spending too much time on this since my 7940HS that I have is 10 TOPS, which is pretty useless, but here are some links and resources:
- LLMs on RyzenAI with Pytorch
- RyzenAI-SW llama.cpp fork
- For upstream, see: https://github.com/ggerganov/llama.cpp/issues/1499
- Optimum-AMD - a HF package for getting NPU acceleration w/ transformers (and ONNX runtime for ROCm)
- Two Japanese Linux setup blogs (somehow Japanese devs must have more patience than English-speaking ones?)
- AMD implemented an SLM model (AMD-135M) recently (2024-09 announcement) that includes a speculative decode implementation tested on a 7940HS for CPU and NPU. The implementation and benchmarks may be of interest
- https://community.amd.com/t5/ai/amd-unveils-its-first-small-language-model-amd-135m/ba-p/711368
- https://github.com/AMD-AIG-AIMA/AMD-LLM
- https://github.com/AMD-AIG-AIMA/AMD-LLM/blob/main/speculative_decoding/codellama_spec.py
- https://github.com/AMD-AIG-AIMA/AMD-LLM?tab=readme-ov-file#speculative-decoding
Radeon VII
We have some previous known good memory timings for an old Radeon VII card:
While launched relatively recently (2019), the Radeon VII (gfx906; Radeon Pro VII, MI50) has been deprecated in ROCm, which according to AMD means:
The current ROCm release has limited support for this hardware. Existing features and capabilities are maintained, but no new features or optimizations will be added. A future ROCm release will remove support.
This is a shame because while itβs a bit weak on compute (27 FP16 TFLOPS), it has 16GB HBM2 w/ 1.02 TB/s of MBW, which is not too shabby for inference.
RDNA3 (navi3x) on Linux
I have several gfx1100 RDNA3 cards, so this will be the the most detailed section of my guide. Some of this may be applicable to different generation GPUs, likely wonβt be fully tested.
Driver and ROCm Setup
Arch Linux
Arch Linux setup is fairly straightforward (can be easier than the official install!) but is community supported by rocm-arch. If youβre running an Arch system already, this should be fine, but if youβre running a system dedicated to ML, then you should probably prefer Ubuntu LTS for official support.
Install ROCm:
Install conda (mamba)
Create Environment
Ubuntu LTS
Ubuntu is the most well documented of the officially supported distros:
- https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/native-install/index.html
- I recommend using the latest LTS (22.04.4) with the HWE kernel
- The install documents are pretty much complete
- You can now use
apt install rocm
to install βeverythingβ (except the drivers, youβll still needamdgpu-dkms
first). - Be sure also to look at the βpost-install instructionsβ
HWE Kernel
Prereqs
Install
cmath
You may run into some compile errors. You will need libstdc++-12-dev
in Ubuntu:
LLM Inferencing
llama.cpp
llama.cpp has ROCm support built-in now (2023-08):
- https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md#hipblas
- You can use
LLAMA_HIP_UMA=1
for unified memory for APUs but itβll be slower if you donβt use it uname -a
,dkms status
andapt list | grep rocm | grep '\[installed\]'
to get version numbers of kernel and libs- If you canβt get ROCm working, Vulkan is a universal/easy option, but gains and should still give decent gains over CPU inference
2024-09 Update: llama.cpp ROCm inference speeds basically havenβt changed all year so I havenβt gone and done updates. CUDA is a bit faster w/ FA and Graph support, so has an even bigger lead. Thereβs been some discussion/code with optimizations, but so far those havenβt been merged:
I was curious in just how much performance might be available for optimizations, hereβs an analysis of 4090 vs 3090 vs 7900 XTX as of 2024-10-04: https://chatgpt.com/share/66ff502b-72fc-8012-95b4-902be6738665
Letβs run some testing with TheBloke/Llama-2-7B-GGUF (Q4_0).
7900 XT + 7900 XTX used together segfaulted on b7e7982 (1787)
(tested 2024-01-08) but ran with 6db2b41a (1988)
(tested 2024-01-28)
- last tested: 2024-01-28
7900 XT:
- last tested: 2024-01-28
7900 XTX:
While the Radeon 7900 XTX has theoretically competitive memory bandwidth and compute, in practice, with ROCm 6.0, hipBLAS still falls behind cuBLAS in llama.cpp:
7900 XT | 7900 XTX | RTX 3090 | RTX 4090 | |
---|---|---|---|---|
Memory GB | 20 | 24 | 24 | 24 |
Memory BW GB/s | 800 | 960 | 936.2 | 1008 |
Memory BW % | -16.7% | 0% | -2.5% | +5.0% |
FP32 TFLOPS | 51.48 | 61.42 | 35.58 | 82.58 |
FP16 TFLOPS | 103.0 | 122.8 | 71/142* | 165.2/330.3* |
FP16 TFLOPS % | -16.1% | 0% | +15.6%* | +169.0%* |
Prompt tok/s | 2366 | 2576 | 3251 | 5415 |
Prompt % | -8.2% | 0% | +26.2% | +110.2% |
Inference tok/s | 97.2 | 119.1 | 134.5 | 158.4 |
Inference % | -18.4% | 0% | +12.9% | +33.0% |
- Tested 2024-01-28 with llama.cpp
6db2b41a (1988)
and latest ROCm (dkms amdgpu/6.3.6-1697589.22.04
,rocm 6.0.0.60000-91~22.04
) and CUDA (dkms nvidia/545.29.06, 6.7.0-arch3-1
,nvcc cuda_12.3.r12.3/compiler.33492891_0
) on similar platforms (5800X3D for Radeons, 5950X for RTXs) - RTX cards have much better FP16/BF16 Tensor FLOPS performance that the inferencing engines are taking advantage of. FP16 FLOPS (32-bit/16-bit accumulation numbers) sourced from Nvidia docs (3090, 4090_)
Vulkan and CLBlast
5800X3D CPU | 7900 XTX CLBlast | 7900 XTX Vulkan | 7900 XTX ROCm | |
---|---|---|---|---|
Prompt tok/s | 24.5 | 219 | 758 | 2550 |
Inference tok/s | 10.7 | 35.4 | 52.3 | 119.0 |
- Tested 2024-01-29 with llama.cpp
d2f650cb (1999)
and latest on a 5800X3D w/ DDR4-3600 system with CLBlastlibclblast-dev 1.5.2-2
, Vulkanmesa-vulkan-drivers 23.0.4-0ubuntu1~22.04.1
, and ROCm (dkms amdgpu/6.3.6-1697589.22.04
,rocm 6.0.0.60000-91~22.04
)
Radeon VII
The Radeon VII was a Vega 20 XT (GCN 5.1) card that was released in February 2019 at $700. It has 16GB of HDM2 memory with a 1024GB/s of memory bandwidth and 26.88 TFLOPS of FP16. Honestly, while the prefill probably doesnβt have much more that could be squeezed from it, I would expect with optimization, you would be able to double inference performance (if you could use all its memory bandwidth).
Radeon Vega VII
- Tested 2024-02-02 on a Ryzen 5 2400G system with
rocm-core 5.7.1-1
System Info
ExLlamaV2
Weβll use main
on TheBloke/Llama-2-7B-GPTQ for testing (GS128 No Act Order).
Install is straightforward:
7900 XT
7900 XTX
Running with both GPUs work, although it defaults to loading everything onto one. If you force the VRAM, interestingly, you can get batch=1 inference to perform slightly better:
The ROCm kernel is very un-optimized vs the CUDA version, but you can see while inference performance is much lower than llama.cpp, the prompt processing remains ExLlamaβs strength (this is especially important for long context scenarios like long, multi-turn conversations or RAG).
7900 XT | 7900 XTX | RTX 3090 | RTX 4090 | |
---|---|---|---|---|
Memory GB | 20 | 24 | 24 | 24 |
Memory BW GB/s | 800 | 960 | 936.2 | 1008 |
FP32 TFLOPS | 51.48 | 61.42 | 35.58 | 82.58 |
FP16 TFLOPS | 103.0 | 122.8 | 35.58 | 82.58 |
Prompt tok/s | 3457 | 3928 | 5863 | 13955 |
Prompt % | -12.0% | 0% | +49.3% | +255.3% |
Inference tok/s | 57.9 | 61.2 | 116.5 | 137.6 |
Inference % | -5.4% | 0% | +90.4% | +124.8% |
- Tested 2024-01-08 with ExLlamaV2
3b0f523
and latest ROCm (dkms amdgpu/6.3.6-1697589.22.04
,rocm 6.0.0.60000-91~22.04
) and CUDA (dkms nvidia/545.29.06, 6.6.7-arch1-1
,nvcc cuda_12.3.r12.3/compiler.33492891_0
) on similar platforms (5800X3D for Radeons, 5950X for RTXs)
MLC
Setup
- https://llm.mlc.ai/docs/install/mlc_llm.html#install-mlc-packages
- https://llm.mlc.ai/docs/get_started/quick_start.html
Make a model: https://llm.mlc.ai/docs/compilation/compile_models.html
vLLM
vLLM has ROCm support and support for specific hardware (which includes gfx1100 now).
- https://docs.vllm.ai/en/stable/getting_started/amd-installation.html Note: there is a Triton/FA bug:
- https://github.com/vllm-project/vllm/issues/4514 You may be able to work around this with the latest version of PyTorch and Triton (w/ aotriton support) - TBC
# We want the nightly PyTorch
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.2
# May need to copy amd_smi folder locally if you have permission issues installing
pip install /opt/rocm/share/amd_smi
# Dependencies
pip install --upgrade numba scipy huggingface-hub[cli]
pip install "numpy<2"
pip install -r requirements-rocm.txt
# Undocumented
pip install setuptools_scm
# requires newer cmake than Ubuntu 24.04 LTS provides
mamba install cmake -y
# Build vLLM for RDNA3
PYTORCH_ROCM_ARCH="gfx1100" python setup.py develop
# Test
vllm serve facebook/opt-125m
CTranslate2
This is most notably required for faster-whisper (and whisperX)
- Feature request: AMD GPU support with oneDNN AMD supportΒ #1072 - the most detailed discussion for AMD support in the CTranslate2 repo
- https://github.com/arlo-phoenix/CTranslate2-rocm - arlo-phoenix created a hipified fork that can run whisperX. Performance appears about 60% faster than whisper.cpp
- CTranslate2: Efficient Inference with Transformer Models on AMD GPUs - 2024-10-24 recent ROCm Blogs post on how upstream might work?
Training
In Feb 2024 I wrote up some notes:
In June 2024 I did a trainer performance shootoff of torchtune vs axolotl (trl) vs unsloth with a 3090, 4090, and W7900:
I noticed that AMD has added a lot of simple tutorials in the ROCm docs:
- https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/index.html
- https://rocm.docs.amd.com/en/latest/how-to/llm-fine-tuning-optimization/single-gpu-fine-tuning-and-inference.html
- https://rocm.docs.amd.com/en/latest/how-to/llm-fine-tuning-optimization/multi-gpu-fine-tuning-and-inference.html
axolotl
This has been my preferred trainer for a while: https://github.com/axolotl-ai-cloud/axolotl It leverages trl and layers a bunch of optimizations, yaml configs, etc.
lightning
I havenβt used https://github.com/Lightning-AI/pytorch-lightning but hereβs the Lightning example from: https://github.com/Lightning-AI/pytorch-lightning?tab=readme-ov-file#pytorch-lightning-example
Here what the W7900 looked like (after 1 epoch):
See also: https://lightning.ai/docs/pytorch/stable/starter/introduction.html
torchtune
There was an issue w/ hipblaslt in PyTorch when I was trying to get it working that required manual futzing w/ compiles and .so
files, but since PyTorch will auto-fallback now it should run w/o hassle, but hereβs the related issue:
Simple test run:
pip install torchao torchtune
tune download meta-llama/Llama-2-7b-chat-hf --output-dir /tmp/Llama-2-7b-hf --hf-token $(cat ~/.cache/huggingface/token)
TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 tune run lora_finetune_single_device --config llama2/7B_qlora_single_device
# on a W7900 this should take about 6GB of VRAM and about 15h estimated time
unsloth (NOT WORKING)
Unsloth https://github.com/unslothai/unsloth depends on:
- PyTorch
- Triton
- xformers or flash attention
- bitsandbytes
As of 2024-09, there is a working upstream xformers library (see below), however itβs sadly missing support for this function in the ROCm backend:
NotImplementedError: Could not run 'xformers::efficient_attention_forward_ck' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'xformers::efficient_attention_forward_ck' is only available for these backends: [CPU, PrivateUse3, Meta, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, AutocastXPU, AutocastMPS, AutocastCUDA, FuncTorchBatched, BatchedNestedTensor, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].
Libraries and Frameworks
These are probably going to be most useful if you are a developer or training.
AMDβs ROCm docs has a list as well, however the docs donβt necessarily apply to RDNA3 (since itβs AMD CK focused, which has no RDNA3 kernels! *sad trombone*)
PyTorch
PyTorch supports ROCm natively and without code changes (torch.cuda
just uses ROCm instead). It just needs to be instealled with the ROCm platform:
NOTE: if you want aotriton/FA support you will need PyTorch 2.5.0+ so you may need to install the Preview (Nightly) build instead of Stable (2024-09)
Triton
Triton also has native ROCm support and you probably can install it and get everything working.
- https://github.com/triton-lang/triton There is however a ROCm fork where some fixes get upstreamed from:
- https://github.com/ROCm/triton
bitsandbytes
In 2024-08 and official multi-backend-refactor branch had ROCm support
- https://github.com/bitsandbytes-foundation/bitsandbytes/tree/multi-backend-refactor As of the end of 2024-09 it appears ROCm support has been folded into the main branch:
- https://github.com/bitsandbytes-foundation/bitsandbytes
Uh, actually not quite⦠I still had to build my own from the multi-backend branch:
git clone --depth 1 -b multi-backend-refactor https://github.com/bitsandbytes-foundation/bitsandbytes.git bnb && cd bnb
pip install -r requirements-dev.txt
# If you don't do this it won't find the version and build will fail!
git fetch --tags
# We just want gfx1100
cmake -DCOMPUTE_BACKEND=hip -S . -DBNB_ROCM_ARCH="gfx1100"
make
pip install .
# this has to be a bug, tries to use rocm62.so no matter what
ln -s bitsandbytes/libbitsandbytes_rocm62.so bitsandbytes/libbitsandbytes_rocm61.so
# test
cd ..
python -c "import bitsandbytes; print(bitsandbytes.__version__)"
You can see some previous discussion here:
- https://github.com/TimDettmers/bitsandbytes/issues/107
- https://github.com/TimDettmers/bitsandbytes/pull/756
- https://github.com/TimDettmers/bitsandbytes/discussions/990
- https://github.com/arlo-phoenix/bitsandbytes-rocm-5.6/tree/rocm
xformers
The upstream xformers has added experimental ROCm support. Hereβs how I was able to get it working:
# install
pip install -U xformers --index-url https://download.pytorch.org/whl/rocm6.1
# requires https://github.com/ROCm/amdsmi
# you will get an error if you follow the README instructions
apt install amd-smi-lib
# you need to copy since you don't have root instructions to write
cp -r /opt/rocm/share/amd_smi ~/amd_smi
cd ~/amd_smi
pip install .
If everything worked, then you should have a working xformers:
$ python -m xformers.info
xFormers 0.0.28.post1
memory_efficient_attention.ckF: available
memory_efficient_attention.ckB: available
memory_efficient_attention.ck_decoderF: available
memory_efficient_attention.ck_splitKF: available
memory_efficient_attention.cutlassF: unavailable
memory_efficient_attention.cutlassB: unavailable
memory_efficient_attention.fa2F@0.0.0: unavailable
memory_efficient_attention.fa2B@0.0.0: unavailable
memory_efficient_attention.fa3F@0.0.0: unavailable
memory_efficient_attention.fa3B@0.0.0: unavailable
memory_efficient_attention.triton_splitKF: available
indexing.scaled_index_addF: available
indexing.scaled_index_addB: available
indexing.index_select: available
sequence_parallel_fused.write_values: available
sequence_parallel_fused.wait_values: available
sequence_parallel_fused.cuda_memset_32b_async: available
sp24.sparse24_sparsify_both_ways: available
sp24.sparse24_apply: available
sp24.sparse24_apply_dense_output: available
sp24._sparse24_gemm: available
sp24._cslt_sparse_mm@0.0.0: available
swiglu.dual_gemm_silu: available
swiglu.gemm_fused_operand_sum: available
swiglu.fused.p.cpp: available
is_triton_available: True
pytorch.version: 2.4.1+rocm6.1
pytorch.cuda: available
gpu.compute_capability: 11.0
gpu.name: AMD Radeon PRO W7900
dcgm_profiler: unavailable
build.info: available
build.cuda_version: None
build.hip_version: 6.1.40093-bd86f1708
build.python_version: 3.11.10
build.torch_version: 2.4.1+rocm6.1
build.env.TORCH_CUDA_ARCH_LIST:
build.env.PYTORCH_ROCM_ARCH: None
build.env.XFORMERS_BUILD_TYPE: Release
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS: None
build.env.NVCC_FLAGS: -allow-unsupported-compiler
build.env.XFORMERS_PACKAGE_FROM: wheel-v0.0.28.post1
There is a ROCm fork but it does not work w/ RDNA3:
- https://github.com/ROCm/xformers/issues/9
- Depends on CK which does not have RDNA3 support:
pip install -U xformers --index-url https://download.pytorch.org/whl/rocm6.1
pip install amdsmi
2024-02-17: The ROCM/xformers fork defaults to a main
branch, which compiles, but is basically upstream. All the work is done on branches (develop
seems to be the main one), which sadly β¦ doesnβt compile due to mismatching header files from Composable Kernels.
Note: vLLM has itβs own 0.0.23 with a patch to install, but still dies w/ RDNA3
# xformers
git clone https://github.com/ROCm/xformers
cd xformers
git fetch
git branch -a
git checkout develop
git submodule update --init --recursive
python setup.py install
python -c 'import xformers; print(xformers.__version__)'
Notes:
- Discord
- HN
Flash Attention 2
This issue in the ROCm/aotriton project: Memory Efficient Flash Attention for gfx1100 (7900xtx) is probably the best place to read the story on Flash Attention. As of 2024-09, this support has now been upstreamed to PyTorch 2.5.0+ (you may need to use the nightly if the stable version is not there yet). (original pull, merged pull)
You might also need to use the environment variable:
TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
Using the examples/benchmark.py
from pytorch-labs/attention-gym we are able to test this:
$ TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 python attention-gym/examples/benchmark.py
Using the default sparsity block size: 128
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Causal Mask β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Correctness check passed β
+---------------+----------------+-------------------+----------------+-------------------+
| Operation | FW Time (ms) | FW FLOPS (TF/s) | BW Time (ms) | BW FLOPS (TF/s) |
+===============+================+===================+================+===================+
| causal FA2 | 150.677 | 14.59 | 764.289 | 7.19 |
+---------------+----------------+-------------------+----------------+-------------------+
| F.sdpa + mask | 363.346 | 6.15 | 1946.23 | 2.87 |
+---------------+----------------+-------------------+----------------+-------------------+
| flexattention | 245.548 | 9.1 | 428.728 | 13.02 |
+---------------+----------------+-------------------+----------------+-------------------+
Block Mask:
BlockMask(shape=(1, 1, 8192, 8192), sparsity=49.22%,
(0, 0)
ββ
ββββ
ββββββ
ββββββββ
ββββββββββ
ββββββββββββ
ββββββββββββββ
ββββββββββββββββ
ββββββββββββββββββ
ββββββββββββββββββββ
ββββββββββββββββββββββ
ββββββββββββββββββββββββ
ββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββ
)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Alibi Mod β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
+---------------+----------------+-------------------+----------------+-------------------+
| Operation | FW Time (ms) | FW FLOPS (TF/s) | BW Time (ms) | BW FLOPS (TF/s) |
+===============+================+===================+================+===================+
| causal FA2 | 155.3 | 14.16 | 798.569 | 6.88 |
+---------------+----------------+-------------------+----------------+-------------------+
| F.sdpa + mask | 375.784 | 11.7 | 2022.57 | 5.44 |
+---------------+----------------+-------------------+----------------+-------------------+
| flexattention | 561.904 | 7.83 | 740.779 | 14.84 |
+---------------+----------------+-------------------+----------------+-------------------+
Block Mask:
None
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Sliding Window 1024 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Traceback (most recent call last):
File "/home/lhl/xformers/attention-gym/examples/benchmark.py", line 256, in <module>
main(**vars(args))
File "/home/lhl/xformers/attention-gym/examples/benchmark.py", line 234, in main
available_examples[ex]()
File "/home/lhl/xformers/attention-gym/examples/benchmark.py", line 216, in <lambda>
"sliding_window": lambda: test_mask(mask_mod=generate_sliding_window(window_size=1024)),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lhl/xformers/attention-gym/examples/benchmark.py", line 140, in test_mask
torch.testing.assert_close(flex, sdpa_mask, atol=1e-1, rtol=1e-2)
File "/home/lhl/miniforge3/envs/xformers/lib/python3.11/site-packages/torch/testing/_comparison.py", line 1530, in assert_close
raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!
Mismatched elements: 116391936 / 134217728 (86.7%)
Greatest absolute difference: nan at index (0, 0, 1088, 0) (up to 0.1 allowed)
Greatest relative difference: nan at index (0, 0, 1088, 0) (up to 0.01 allowed)
It looks like Liger has been doing some independent work as well with Triton kernels that seem to provide a big speedup as well, so maybe worth taking a look at this at some point: https://github.com/linkedin/Liger-Kernel/pull/275
Working Flash Attention is one of the longest running issues for RDNA3. Here are some issues to peruse for more context:
- https://github.com/vllm-project/vllm/issues/4514
- https://github.com/ROCm/flash-attention/issues/27
- https://github.com/linkedin/Liger-Kernel/issues/126
- https://github.com/pytorch/pytorch/issues/112997
NOTE: ROCm support was merged into the official FA2 implementation in 2024-08 but does not support RDNA3: https://github.com/Dao-AILab/flash-attention/pull/1010
TensorFlow (SHOULD WORK?)
I donβt really use TensorFlow, so this is untested, but recent reports are that it should work:
- https://www.reddit.com/r/ROCm/comments/1ahkay9/tensorflow_on_gfx1101_navi32_7800_xt/
- https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/3rd-party/tensorflow-install.html
- Try out: https://cprimozic.net/notes/posts/machine-learning-benchmarks-on-the-7900-xtx/
- Can run script, says itβs using ROCm Fusion, but runs on CPU?
Apparently you need to build your own TF for gfx1100
supportβ¦
- https://gist.github.com/briansp2020/1e8c3e5735087398ebfd9514f26a0007
- https://cprimozic.net/notes/posts/setting-up-tensorflow-with-rocm-on-7900-xtx/
- https://gist.github.com/BloodBlight/0d36b33d215056395f34db26fb419a63 Life is short, putting for for laterβ¦
vLLM
vLLM has official RDNA3 (gfx1100 at least) support https://docs.vllm.ai/en/latest/getting_started/amd-installation.html#build-from-source-rocm
This seems to run but with some caveats:
- Triton flash should be used by default, but you can use
VLLM_USE_TRITON_FLASH_ATTN=0
if you need to work around this - Basically no quantization works for AMD. FP8 is only for MI300+
Windows
I donβt use Windows for AI/ML, so this doc is going to be rather sporadically updated (if at all).
llama.cpp
For an easy time, go to llama.cppβs release page and download:
- βhipβ version if your GPU is supported (gfx1100, gfx1101, gfx1030, etc)
- βvulkanβ or βopenblasβ version as a fallback if not
Modern versions of llama.cpp should automatically load layers into GPU memory but you can specify something like -ngl 99
to force it if necessary.
Compile for ROCm
This was last update 2023-09-03 so things might change, but hereβs how I was able to do my own compile in Windows.
Requirements
- Youβll need Microsoft Visual Studio installed. Install it with the basic C++ environment.
- Follow AMDβs directions and install the ROCm software for Windows.
- Youβll need
git
if you want to pull the latest from the repo (you can either get the official Windows installer or use a package manager like Chocolatey tochoco install git
) - note, as an alternative, you could just download the Source code.zip from the https://github.com/ggerganov/llama.cpp/releases/
Instructions
First, launch βx64 Native Tools Command Promptβ from the Windows Menu (you can hit the Windows key and just start typing x64 and it should pop up).
Thatβs it, now you have compiled executables in build/bin
.
Start a new terminal to run llama.CPP
If you set just the global you may need to start a new shell before running this in the llama.cpp
checkout. You can double check itβS working by outputing the path echo %PATH%
or just running hipInfo
or another exe in the ROCm bin folder.
NOTE: If your PATH is wonky for some reason you may get missing .dll errors. You can either fix that, or if all else fails, copy the missing files from "C:\Program Files\AMD\ROCm\5.5\bin
into the build/bin
folder since life is too short.
Results
Hereβs my llama-bench
results running a llama2-7b q4_0 and q4_K_M:
Unsupported Architectures
On Windows, it may not be possible to apply an HSA_OVERRIDE_GFX_VERSION
override. In that case, these instructions for compiling custom kernels may help: https://www.reddit.com/r/LocalLLaMA/comments/16d1hi0/guide_build_llamacpp_on_windows_with_amd_gpus_and/
Misc Resources
Hereβs a ROCm fork of DeepSpeed (2023-09):
2023-07 Casey Primozic did some testing/benchmarking of the 7900 XTX (TensorFlow, TinyGrad):
I have a document that updated from April-June 2024 focused on W7900 (RDNA3 gfx1100 workstation version of the 7900 XTX) but Iβm folding all up to date info back to this doc: