For the latest Strix Halo / AMD Ryzen AI Max+ 395 with Radeon 8060S (gfx1151) support, check out:

For some discussion and additional benchmarks on this testing, see:

Testing Checklist

  • ROCm 6.5 (custom compiles)
    • rocBLAS
    • hipBLAS
    • hipBLASLt
    • rocWMMA
    • AOTriton
  • llama.cpp
    • backend/compile scripts - HIP vs HIP WMMA vs Vulkan
    • speculative decoding
      • ShareGPT benchmark scripts
      • 70B
      • 25-32 Dense
  • Torch w/ AOTriton FA w/ hipBLASLt
    • vLLM
    • SGLang
    • trl
    • Axolotl
    • torchtune

System Info

❯ lsb_release -a
LSB Version:    n/a
Distributor ID: Fedora
Description:    Fedora Linux 43 (Workstation Edition Prerelease)
Release:        43
Codename:       n/a

❯ uname -a
Linux cluster1 6.15.0-0.rc3.20250422gita33b5a08cbbd.29.fc43.x86_64

❯ python -c "import torch; print(f'PyTorch version: {torch.__version__}\nCUDA available: {torch.cuda.is_available()}\nDevice count: {torch.cuda.device_count()}')"
PyTorch version: 2.5.0a0
CUDA available: True
Device count: 1

❯ python env-info.py
=== System Information ===
Os Info: Fedora Linux 43 (Workstation Edition Prerelease)
Kernel: Linux cluster1 6.15.0-0.rc3.20250422gita33b5a08cbbd.29.fc43.x86_64
Memory Info: Total Memory: 120554 MB

=== GPU Information ===
CUDA: Not found
ROCm: ROCM-SMI version: 3.0.0+unknown
ROCM-SMI-LIB version: 7.3.0
PyTorch CUDA Available: True
PyTorch CUDA Version: N/A
PyTorch HIP Version: 6.3.42134-0

GPU Count: 1
GPU 0: AMD Radeon Graphics

=== Package Versions ===
triton: 3.3.0
torch: 2.5.0a

❯ hipconfig -l
/usr/lib64/rocm/llvm/bin

❯ hipconfig -R
/usr

PyTorch Setup

Despite the first Ryzen AI Max+ processor launching February 25, 2025 with the Asus ROG Flow Z13, as of May 2025 there is still relatively poor ROCm support (ROCm #4499).

The important components can be built, but there are still performance regressions w/ the gfx1151 kernels:

As I prefer to use Mamba envs if you have a working system PyTorch you can do a slightly (very) janky workaround and symlink the system PyTorch from my venv site-packages:

torch -> /usr/lib64/python3.13/site-packages/torch
torch-2.5.0a0+gitunknown-py3.13.egg-info -> /usr/lib64/python3.13/site-packages/torch-2.5.0a0+gitunknown-py3.13.egg-info
torchgen -> /usr/lib64/python3.13/site-packages/torchgen

It is possible to use these self contained 3.11 wheels w/ AOTriton SDPA (but no hipBLASLt seeminingly): https://github.com/ROCm/TheRock/discussions/655

Docker on Fedora

We can use scottt’s Docker image: https://github.com/ROCm/TheRock/discussions/244

# Grab image
podman pull docker.io/scottt/therock:pytorch-vision-dev-f41

# 
podman run -it --rm \
  --device=/dev/kfd \
  --device=/dev/dri \
  --privileged \
  docker.io/scottt/therock:pytorch-vision-dev-f41

Peak Performance

RDNA3 has a theoretical 512 FP16 FLOPS/clock/CU.

A Ryzen AI Max 395’s Radeon 8060S has 40 CUs at a max clock of 2.9GHz should have a peak 59.392 FP16 TFLOPS:

512 * 40 * 2.9e9 / 1e12 = 59.392 FP16 TFLOPS

This assumes you are using optimized libraries like rocWMMA (requires ROCm 6.4) or hipBLASLt otherwise your peak TFLOPS will likely be half of that.

Currently, my test system’s results are much lower, however.

There is no official ROCm build for gfx1151 so I am benchmarking using a custom Fedora gfx1151 build of PyTorch (2.5) on ROCm 6.3 which only has the rocBLAS TensileLibraries available for gfx1151.

rocm_bandwidth_test

We get about 84GB/s from CPU to GPU and within the GPU, 212GB/s

git clone https://github.com/ROCm/rocm_bandwidth_test
cd rocm_bandwidth_test
cmake -B build && cmake --build build


❯ build/rocm-bandwidth-test
....................
          RocmBandwidthTest Version: 2.6.0

          Launch Command is: build/rocm-bandwidth-test (rocm_bandwidth -a + rocm_bandwidth -A)


          Device: 0,  AMD Eng Sample: 100-000001243-50_Y
          Device: 1,  AMD Radeon Graphics,  GPU-XX,  c2:0.0

          Inter-Device Access

          D/D       0         1

          0         1         1

          1         1         1


          Inter-Device Numa Distance

          D/D       0         1

          0         0         20

          1         20        0


          Unidirectional copy peak bandwidth GB/s

          D/D       0           1

          0         N/A         84.364

          1         84.147      212.419


          Bidirectional copy peak bandwidth GB/s

          D/D       0           1

          0         N/A         83.489

          1         83.489      N/A

mamf-finder

Using my mamf-finder repo to test, it takes about just under 35 hours (!) to test with mamf-finder:

Warming up the accelerator for 30 secs ... /home/lhl/mamf-finder/mamf-finder/./mamf-finder.py:252: UserWarning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (Triggered internally at /builddir/build/BUILD/python-torch-2.5.1-build/pytorch-v2.5.1/aten/src/ATen/
Context.cpp:296.)
  torch.mm(A, B, out=C)
accelerator warmup finished

Tried  3375 shapes => the best outcomes were:
mean:   5.0 TFLOPS @ 4096x9216x1024 (MxNxK)
median: 5.0 TFLOPS @ 12288x3072x1024 (MxNxK)
max:    5.1 TFLOPS @ 11264x3072x1024 (MxNxK)

Elapsed time: 1 day, 10:40:32

As you can see, the max performance is 5.1 BF16 TFLOPS. At the 2.8GHz clock I’m getting, that’s an 8.9% efficiency (57.344 max theoretical).

in Docker

We get much better results using the scottt docker image:

[root@4b8fdc8ee74c mamf-finder]# ./test-node.sh
Starting multi-GPU test at Sat May  3 06:17:41 UTC 2025
Waiting for all GPU tests to complete...
Starting tests for GPU 0
Running bfloat16 test on GPU 0
/share/libdrm/amdgpu.ids: No such file or directory

Benchmark started on 2025-05-03 06:17:43

** Command line:
/usr/bin/python ./mamf-finder.py --dtype bfloat16 --m_range 0 16384 1024 --n_range 0 16384 1024 --k_range 0 16384 1024 --output_file=./gpu0-bfloat16-2025-05-03-06-17-41.txt

** Dtype: torch.bfloat16

** Platform/Device info:
Linux 4b8fdc8ee74c 6.15.0-0.rc3.20250422gita33b5a08cbbd.29.fc43.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Apr 22 15:25:32 UTC 2025 x86_64
_CudaDeviceProperties(name='AMD Radeon Graphics', major=11, minor=5, gcnArchName='gfx1151', total_memory=104906MB, multi_processor_count=20, uuid=58580000-0000-0000-0000-000000000000, L2_cache_size=2MB)

** Critical software versions:
torch=2.6.0a0+git90b83a9
hip=6.4.43480-9f04e2822, cuda=None

** Additional notes:
benchmark version: 2


--------------------------------------------------------------------------------


Warming up the accelerator for 30 secs ... accelerator warmup finished
   2916 |   25.6(mean)   25.6(median)   25.8(max) @ 13312x15360x6144     | best:   34.9(mean)   34.9(median)   36.9(max) TFLOPSPS

Tried  3375 shapes => the best outcomes were:
mean:   35.1 TFLOPS @ 15360x3072x1024 (MxNxK)
median: 35.1 TFLOPS @ 15360x3072x1024 (MxNxK)
max:    36.9 TFLOPS @ 6144x3072x3072 (MxNxK)

Elapsed time: 6:04:34

At 2.8GHz clock and a max 36.9 TFLOPS that is a much more respectable 64.4% efficiency.

attention-gym

Install

git clone https://github.com/pytorch-labs/attention-gym
cd attention-gym
pip install -e ".[dev]"
pip install -e ".[viz]"

Performance bug? https://github.com/ROCm/MIOpen/pull/3685

# TORCH_BLAS_PREFER_HIPBLASLT=0 
HIPBLASLT_TENSILE_LIBPATH=/opt/rocm/lib/hipblaslt/library TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 python examples/benchmark.py

aotriton

We need to build and install aotriton:

mkdir -p /share/libdrm
cp /opt/rocm/lib/rocm_sysdeps/share/libdrm/amdgpu.ids /share/libdrm/
dnf install gcc gcc-c++ make cmake
dnf install python3-devel
export HIP_PLATFORM=amd
export GPU_TARGETS=gfx1151
git clone https://github.com/ROCm/aotriton
cd aotriton
git submodule sync && git submodule update --init --recursive --force
mkdir build && cd build

#build
cmake .. -DCMAKE_INSTALL_PREFIX=./install_dir -DCMAKE_BUILD_TYPE=Release -DAOTRITON_GPU_BUILD_TIMEOUT=0 -DAOTRITON_TARGET_ARCH=gfx1151 -G Ninja
ninja install

# make sure pyaotriton linked
ln -s /home/lhl/aotriton/build/install_dir/lib/pyaotriton.cpython-313-x86_64-linux-gnu.so /usr/local/lib/python3.13/site-packages/

# make sure c lib is linked
export LD_LIBRARY_PATH=/opt/rocm/lib:/home/lhl/aotriton/build/install_dir/lib:/opt/rocm/lib:

python -c 'import pyaotriton'
  • takes about 1h wall time to build (27h CPU)

llama.cpp

Efficiency

2025-05-03: Currently, the Vulkan backend seems significantly faster than the HIP/ROCm backend on every single llama-bench tested model.

2025-05-12: In long context, if you can get HIP to build with rocWMMA then token generation stays high performance w/ FA while it drops significantly with Vulkan.

Interestingly, Vulkan w/ or w/o FA seems to use roughly the same reported memory, which doesn’t seem right at all, but that’s what the numbers seem to say.

The HIP version performs far below what you’d expect in terms of tok/TFLOP efficiency for prompt processing even vs other RDNA3 architectures: https://www.reddit.com/r/LocalLLaMA/comments/1ghvwsj/llamacpp_compute_and_memory_bandwidth_efficiency/

  • gfx1103 Radeon 780M iGPU gets 14.51 tok/TFLOP. At that efficiency you’d expect the about 850 tok/s that the Vulkan backend delivers. The HIP backends deliver about 350 tok/s, about 40% of the efficiency.
  • gfx1100 Radeon 7900 XTX gets 25.12 tok/TFLOP. At that efficiency you’d expect almost 1500 tok/s, almost double what the Vulkan backend delivers, and >4X what the current HIP backend delivers.

Testing a similar system with Linux 6.14 vs 6.15 showed a 15% performance difference so it’s possible future driver updates will improve/fix Strix Halo’s ROCm/HIP compute efficiency problems.

Memory bandwidth efficiency seems better. At 50 tok/s with a 3.56 GB quant, that’s about 180 GB/s. This is close to the rocm_bandwidth_test results of a peak 212 GB/s of transfer. For HIP and Vulkan we are seeing 70.8-73.3% MBW bandwidth (vs 256 GB/s theoretical peak), which is actually quite good an inline with previously tested RDNA3 APUs.

How bad is the perf? Testing with the standard TheBloke/Llama-2-7B-GGUF (Q4_0), the HIP backend barely outperforms the CPU backend (!!!) for prompt processing. Interestingly, despite MBW being theoretically the same, the CPU tg is much worse:

Runpp512 (t/s)tg128 (t/s)Max Mem (MiB)
CPU294.64 ± 0.5828.94 ± 0.04
CPU + FA294.36 ± 3.1329.42 ± 0.03
Vulkan881.71 ± 1.7152.22 ± 0.053923
Vulkan + FA884.20 ± 6.2352.73 ± 0.073923
HIP348.96 ± 0.3148.72 ± 0.014219
HIP + FA331.96 ± 0.4145.78 ± 0.024245
WMMA322.63 ± 1.3448.40 ± 0.024218
WMMA + FA343.91 ± 0.6050.88 ± 0.014218

If you have hipBLASLt installed, you should be able to basically match Vulkan pp512 performance with ROCBLAS_USE_HIPBLASLT=1 (as long as rocBLAS is compiled with hipBLASLt support, which it should be by default now):

Building

Vulkan

git clone https://github.com/ggml-org/llama.cpp llama.cpp-vulkan
cmake -B build -DGGML_VULKAN=ON -DGGML_RPC=ON && cmake --build build --config Release -j 32
  • takes about 1.5 minutes to build

Qwen 3 MoE

Currently there is a bug where batch size has to be below 360 to prevent a crash. 256 has been tested as the best performer (multiple of 64):

❯ llama.cpp-vulkan/build/bin/llama-bench -m ~/models/Qwen3-30B-A3B-Q4_K_M.gguf -b 256
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_batch |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan,RPC |  99 |     256 |           pp512 |        144.36 ± 0.54 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan,RPC |  99 |     256 |           tg128 |         74.76 ± 0.07 |

build: d24d5928 (5255)

UPDATE: This no longer crashes, but -b 256 still performs better than on Vulkan, prompt processing is almost 2X faster:

❯ build/bin/llama-bench -b 256 -m ~/models/Qwen3-30B-A3B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coo
pmat
| model                          |       size |     params | backend    | ngl | n_batch |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan,RPC |  99 |     256 |           pp512 |        116.69 ± 0.22 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan,RPC |  99 |     256 |           tg128 |         74.77 ± 0.12 |

build: 43dfd741 (5338)

❯ build/bin/llama-bench -m ~/models/Qwen3-30B-A3B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coo
pmat
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan,RPC |  99 |           pp512 |         69.31 ± 0.10 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan,RPC |  99 |           tg128 |         74.90 ± 0.10 |

build: 43dfd741 (5338)

For the HIP backend -b 256 slows things down though, so this is like a Vulkan only optimization.

Flash Attention

Measuring memory usage with rocm-smi:

initial=$(rocm-smi --showmeminfo vram --csv | awk -F, 'NR==2{print int($3/1048576)}'); max=$initial; while sleep 1; do cur=$(rocm-smi --showmeminfo vram --csv | awk -F, 'NR==2{print int($3/1048576)}'); (( cur > max )) && max=$cur; printf "\r%s  used=%4d MiB  Δ=%4d MiB  peak=%4d MiB  Δpeak=%4d MiB " "$(date +%T)" "$cur" "$((cur-initial))" "$max" "$((max-initial))"; done

And here’s an APU friendly version (measures GTT) using amdgpu_top:

initial=$(amdgpu_top -d | awk '/^[[:space:]]*GTT/{print int($4)}'); max=$initial; while sleep 1; do cur=$(amdgpu_top -d | awk '/^[[:space:]]*GTT/{print int($4)}'); (( cur > max )) && max=$cur; printf "\r%s  used=%4d MiB  Δ=%4d MiB  peak=%4d MiB  Δpeak=%4d MiB " "$(date +%T)" "$cur" "$((cur-initial))" "$max" "$((max-initial))"; done

We compile the latest HEAD b5343 and test as usual with TheBloke/Llama-2-7B-GGUF (Q4_0).

pp512/tg128

At the standard pp512/tg128 tests, we see that as tested before, the Vulkan continues to stomp over the HIP backend, and that WMMA makes basically no difference:

Runpp512 (t/s)tg128 (t/s)Max Mem (MiB)
Vulkan881.71 ± 1.7152.22 ± 0.053923
Vulkan + FA884.20 ± 6.2352.73 ± 0.073923
HIP348.96 ± 0.3148.72 ± 0.014219
HIP + FA331.96 ± 0.4145.78 ± 0.024245
WMMA322.63 ± 1.3448.40 ± 0.024218
WMMA + FA343.91 ± 0.6050.88 ± 0.014218

pp8192/tg8192

But when we switch to longer context, we see something interesting happen. WMMA + FA basically loses no performance at this longer context length!

Vulkan + FA still has better pp but tg is significantly lower. More data points would be better, but seems like Vulkan performance may continue to decrease as context extends while the HIP+rocWMMA backend should perform better.

Runpp8192 (t/s)tg8192 (t/s)Max Mem (MiB)
Normal245.59 ± 0.1012.43 ± 0.006+10591
Normal + FA190.86 ± 0.4930.01 ± 0.007+8089
WMMA230.10 ± 0.7012.37 ± 0.006+10590
WMMA + FA368.77 ± 1.2250.97 ± 0.007+8062
Vulkan487.69 ± 0.837.54 ± 0.027761+1180
Vulkan + FA490.18 ± 4.8932.03 ± 0.017767+1180
  • You need to have rocmwmma installed - Arch has a package or you will need to build it: https://github.com/ROCm/rocWMMA
  • You should then rebuild with -DGGML_HIP_ROCWMMA_FATTN=ON

Building rocWMMA Version

Fetch a gfx1151-aware rocWMMA

This is if we have an old rocWMMA that does not have gfx1151 support merged

git clone https://github.com/ROCm/rocWMMA ~/llama.cpp/rocWMMA   # PR #538 included
  • The Fedora package is too old and aborts at compile-time

Make hipcc prefer the new headers

Since we need to give precedence to the new includes


export CPATH=$HOME/llama.cpp/rocWMMA/library/include:$CPATH
# – or –
export HIPCC_COMPILE_FLAGS_APPEND="-I$HOME/llama.cpp/rocWMMA/library/include"
  • These env-vars are honoured by every hipcc invocation, putting your copy ahead of /usr/include/rocwmma. Can be done w/o root.

Stage the ROCm CMake Build-Tools locally

This is if say the Fedora install you have doesn’t have rocm-cmake (grr)

git clone https://github.com/ROCm/rocm-cmake ~/src/rocm-cmake
cmake -S ~/src/rocm-cmake -B ~/src/rocm-cmake/build \
      -DCMAKE_INSTALL_PREFIX=$HOME/rocm
cmake --install ~/src/rocm-cmake/build
 
export CMAKE_PREFIX_PATH=$HOME/rocm:$CMAKE_PREFIX_PATH
  • Provides ROCmCMakeBuildToolsConfig.cmake, satisfying find_package() without sudo.

Stub out the legacy MFMA Flash-Attention kernel

This isn’t used but causes compile issues, so we zero it out/skip it.

// ggml/src/ggml-cuda/fattn-wmma-f16.cu (replacement)
#include "common.cuh"
#include "fattn-common.cuh"
 
extern "C" __global__ void flash_attn_ext_f16_stub() { /* noop */ }
 
void ggml_cuda_flash_attn_ext_wmma_f16(ggml_backend_cuda_context & ctx,
                                       ggml_tensor * dst) {
    GGML_UNUSED(ctx);
    GGML_UNUSED(dst);
}
  • gfx1151 lacks MFMA; compiling the original file fails. The stub keeps the symbol so the project still links

Configure and build llama.cpp for gfx1151

HIPCXX="$(hipconfig -l)/clang" \
HIP_PATH="$(hipconfig -R)"     \
cmake -S . -B build            \
      -DGGML_HIP=ON            \
      -DGGML_RPC=ON            \
      -DAMDGPU_TARGETS=gfx1151 \
      -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

RPC

Build llama.cpp-hip w/ RPC to run multinode:

HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -S . -B build -DGGML_HIP=ON -DGGML_RPC=ON -DAMDGPU_TARGETS=gfx1151 -DCMAKE_BUILD_TYPE=Release     && cmake --build build --config Release -- -j 32

cmake -B build -DGGML_VULKAN=ON -DGGML_RPC=ON && cmake --build build --config Release -j 32

When running llama-cli by default it will add itself to nodes and you don’t have to run a separate RPC instance. When using llama-bench it does not and you should run an RPC server on the host machine as well.

Vulkan has less default memory available than ROCm for some reason!

amdgpu_top / rocm_smi shows there is 14/110000 MiB , 108000 is probably pretty safe:

lhl@cluster2:~/llama.cpp$ llama.cpp-vulkan/build/bin/rpc-server -p 50052 -H 0.0.0.0

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('0.0.0.0') is != '127.0.0.1'
         Never expose the RPC server to an open network!
         This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

create_backend: using Vulkan backend
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix co
res: KHR_coopmat
Starting RPC server v2.0.0
  endpoint       : 0.0.0.0:50052
  local cache    : n/a
  backend memory : 78794 MB


lhl@cluster2:~/llama.cpp$ llama.cpp-hip/build/bin/rpc-server -p 50052 -H 0.0.0.0

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('0.0.0.0') is != '127.0.0.1'
         Never expose the RPC server to an open network!
         This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
Starting RPC server v2.0.0
  endpoint       : 0.0.0.0:50052
  local cache    : n/a
  backend memory : 104742 MB

# Manually specify memory
lhl@cluster2:~/llama.cpp$ llama.cpp-vulkan/build/bin/rpc-server -p 50052 -H 0.0.0.0 -m 108000

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('0.0.0.0') is != '127.0.0.1'
         Never expose the RPC server to an open network!
         This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

create_backend: using Vulkan backend
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
Starting RPC server v2.0.0
  endpoint       : 0.0.0.0:50052
  local cache    : n/a
  backend memory : 108000 MB

Llama 4 Maverick

This is a big MoE model we run to test: https://huggingface.co/unsloth/Llama-4-Maverick-17B-128E-Instruct-GGUF Q4_K_XL = 243 GB

❯ llama.cpp-vulkan/build/bin/llama-bench -m ~/models/Llama-4-Maverick-17B-128E-Instruct-UD-Q4_K_XL-00001-of-00005.gguf --rpc localhost:50052,192.168.128.12:50052,192.168.128.14:50052
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama4 17Bx128E (Maverick) Q4_K - Medium | 216.18 GiB |   400.71 B | Vulkan,RPC |  99 |           pp512 |         57.93 ± 0.96 |
| llama4 17Bx128E (Maverick) Q4_K - Medium | 216.18 GiB |   400.71 B | Vulkan,RPC |  99 |           tg128 |         16.30 ± 0.14 |

Improving Performance

rocBLAS w/ hipBLASLt

First we test:

rocblas-bench -f gemm_ex -m 1024 -n 1024 -k 1024 \
              --a_type f16_r --b_type f16_r --c_type f16_r \
              --d_type f16_r --compute_type f16_r | grep -i BLASLT

export ROCBLAS_USE_HIPBLASLT=1

rocblas-bench -f gemm_ex -m 1024 -n 1024 -k 1024 \
              --a_type f16_r --b_type f16_r --c_type f16_r \
              --d_type f16_r --compute_type f16_r | grep -i BLASLT

If results deon’t

git clone https://github.com/ROCm/rocBLAS
cd rocBLAS

# May need to edit install.sh to `elevate_if_not_root dnf install -y ${package_dependencies} --skip-unavailable`
# Also if the paths are wrong...
sudo ln -s /opt/rocm/lib/llvm/bin/amdclang++ /opt/rocm/bin/amdclang++
sudo ln -s /opt/rocm/lib/llvm/bin/amdclang     /opt/rocm/bin/amdclang

dnf install libdrm-devel -y
ldconfig -p | grep -E 'libdrm(_amdgpu)?\.so$'

# -c gives still gives errors, so we just skip the client for now
HIP_PLATFORM=amd ./install.sh -id -j$(nproc) -a gfx1151

	# If we want to point ot the new rocblas
export LD_LIBRARY_PATH=/home/lhl/rocBLAS/build/release/rocblas-install/lib:$LD_LIBRARY_PATH

TODO: Speculative Decode

https://github.com/ggml-org/llama.cpp/issues/12968 https://github.com/hjc4869/llama.cpp https://x.com/hjc4869/status/1913562550064799896

RDNA3 gets a sizable performance uplift with speculative decoding on 4bit models (—draft-max 3 —draft-min 3), and you’ll most likely get 8-12 t/s for a 70-72B dense model.

Sweep https://github.com/AUGMXNT/speed-benchmarking/tree/main/llama.cpp-code

TODO: Voicechat

https://github.com/AUGMXNT/speed-benchmarking

Building PyTorch

We want hipBLASLt for general performance and AOTriton for FA2. We should also be able to build with a gfx1151 compatible CK, but that’s probably not so useful

Compile

hipBLASLt

We are in an env where we have hipBLASLt already but if you’re building

git clone https://github.com/ROCm/hipBLASLt
cd hipBLASLt
python3 -m pip install -r tensilelite/requirements.txt
# You may need to comment out the PyYAML install
# Also if the paths are wrong...
sudo ln -s /opt/rocm/lib/llvm/bin/amdclang++ /opt/rocm/bin/amdclang++
sudo ln -s /opt/rocm/lib/llvm/bin/amdclang     /opt/rocm/bin/amdclang
export HIP_PLATFORM=amd
export HIPBLASLT_ENABLE_MARKER=0
./install.sh -idc -a gfx1151

# Test if it's working
./hipblaslt-test

aotriton

See the aotriton section above, this gets built to /home/lhl/aotriton/build/install_dir which you can just point to or you can download the latest release for your version of ROCm: https://github.com/ROCm/aotriton/releases

Composable Kernel (CK)

git clone https://github.com/ROCm/composable_kernel.git
mkdir composable_kernel/build
cd composable_kernel/build

cmake \
        -D CMAKE_PREFIX_PATH=/opt/rocm \
        -D CMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
        -D CMAKE_BUILD_TYPE=Release \
        -D GPU_TARGETS="gfx1151" \
        -D HIP_PLATFORM=amd \
        ..

# About 15 minutes
time make -j

time make -j install

PyTorch

# Enable ROCm (HIP) build and disable CUDA
export USE_ROCM=1
export USE_CUDA=0

# DISABLE KINETO
export USE_KINETO=OFF

# still needed for ROCM_ROCTX_LIB
dnf install roctracer-devel
ln -s /opt/rocm/lib/librocprofiler-sdk-roctx.so /opt/rocm/lib/libroctx64.so

# Will complain about tracing which we're not building...
export BUILD_TEST=OFF

# Needed
dnf install libdrm-devel

# for benchmark.h? - or export BUILD_TEST=OFF
dnf install google-benchmark-devel

# Enable AOTriton integration (FlashAttention kernels) - flag changed w/ 2.8?
export USE_AOTRITON=1
export BUILD_AOTRITON=1

# Specify target GPU architectures for ROCm (limit to gfx1151 for Strix Halo)
export PYTORCH_ROCM_ARCH="gfx1151"

# Point to pre-installed AOTriton (adjust the path to your AOTriton install dir)
export AOTRITON_INSTALLED_PREFIX="/home/lhl/aotriton/build/install_dir"

# Add ROCm and custom library paths to CMake search path
export CMAKE_PREFIX_PATH="/opt/rocm:${CMAKE_PREFIX_PATH}"

# Ensure ROCm libs (and any custom build libs) are in the runtime library path
export LD_LIBRARY_PATH="/opt/rocm/lib:${AOTRITON_INSTALLED_PREFIX}/lib:${LD_LIBRARY_PATH}"

export CXXFLAGS="$CXXFLAGS -Wno-unused-function -Wno-error=unused-function -Wno-error=deprecated-declarations -Wno-error=switch -Wno-error=unused-local-typedefs  -Wno-error=calloc-transposed-args -Wno-array-bound -Wno-error=array-bound"
# export CXXFLAGS="$CXXFLAGS -Wno-error"
# export CCFLAGS="$CFLAGS -Wno-error"
# export HIPCC_FLAGS="$HIPCC_FLAGS -Wno-error"   # for hipcc-compiled kernels


We need to add
defined(__gfx1151__) || 
to
third_party/composable_kernel/include/ck/ck.hpp

# Before we start compiling we need to hipify:
python tools/amd_build/build_amd.py

# see below for rocm-cmake

# see below for rocm-core


# If using CI, modify for STATIC benchmarks OFF
time .ci/pytorch/build.sh

# or just try to directly run:
# cmake3 --build . --target install --config Release

# To get things working/installed properly...
python setup.py develop && python -c "import torch"


# Does this work?
python -c 'import torch,os; print(torch.version.hip, torch.cuda.get_device_name(0))'

# python - <<'PY'
import torch
print("HIP runtime:", torch.version.hip)
print("Device:", torch.cuda.get_device_name(0))
PY
HIP runtime: 6.4.43480-9f04e2822
Device: AMD Radeon Graphics

rocm-cmake

git clone https://github.com/ROCm/rocm-cmake ~/src/rocm-cmake
cmake -S ~/src/rocm-cmake -B ~/src/rocm-cmake/build \
      -DCMAKE_INSTALL_PREFIX=$HOME/rocm
cmake --install ~/src/rocm-cmake/build

export CMAKE_PREFIX_PATH=$HOME/rocm:$CMAKE_PREFIX_PATH

rocm-core

git clone https://github.com/ROCm/rocm-core.git
mkdir -p rocm-core/build
cmake -S rocm-core -B rocm-core/build \
      -DCMAKE_INSTALL_PREFIX=$HOME/rocm \
      -DROCM_VERSION=6.4.0          # match the HIP version you’re using
cmake --build rocm-core/build -j$(nproc)
cmake --install rocm-core/build

Testing PyTorch

# python ../env-info.py
=== System Information ===
Kernel: Linux 2a571ed8a21f 6.15.0-0.rc3.20250422gita33b5a08cbbd.29.fc43.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Apr 22 15:25:32 UTC 2025 x86_64 GNU/Linux
Cpu Info: CPU: AMD Eng Sample (x32)
Memory Info: Total Memory: 120554 MB

=== GPU Information ===
CUDA: Not found
ROCm: ROCM-SMI version: 3.0.0+c865ebb
ROCM-SMI-LIB version: 7.5.0
PyTorch CUDA Available: True
PyTorch CUDA Version: N/A
PyTorch HIP Version: 6.4.43480-9f04e2822

GPU Count: 1
GPU 0: AMD Radeon Graphics

=== Package Versions ===
triton: 3.3.0
torch: 2.8.0a0+git8511d21
torchao: Not installed
transformers: Not installed
flash_attn: Not installed
xformers: Not installed
deepspeed: Not installed
accelerate: Not installed
bitsandbytes: Not installed
axolotl: Not installed
torchtune: Not installed
# python 02-test-aotriton.py
Triton version: 3.3.0
Driver info: ['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__firstlineno__', '__format__', '__ge__', '__getattribute__', '__get
state__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__r
epr__', '__setattr__', '__sizeof__', '__static_attributes__', '__str__', '__subclasshook__', '__weakref__', 'active', 'default', 'reset_active', 'set_activ
e']
PyAOTriton imported successfully!
PyAOTriton contents: ['DType', 'HipMemory', 'Stream', 'T0', 'T1', 'T2', 'T4', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', '
get_name_suffix', 'hipDeviceSynchronize', 'hipErrorAlreadyAcquired', 'hipErrorAlreadyMapped', 'hipErrorArrayIsMapped', 'hipErrorAssert', 'hipErrorCapturedE
vent', 'hipErrorContextAlreadyCurrent', 'hipErrorContextAlreadyInUse', 'hipErrorContextIsDestroyed', 'hipErrorCooperativeLaunchTooLarge', 'hipErrorDeinitia
lized', 'hipErrorECCNotCorrectable', 'hipErrorFileNotFound', 'hipErrorGraphExecUpdateFailure', 'hipErrorHostMemoryAlreadyRegistered', 'hipErrorHostMemoryNo
tRegistered', 'hipErrorIllegalAddress', 'hipErrorIllegalState', 'hipErrorInitializationError', 'hipErrorInsufficientDriver', 'hipErrorInvalidConfiguration'
, 'hipErrorInvalidContext', 'hipErrorInvalidDevice', 'hipErrorInvalidDeviceFunction', 'hipErrorInvalidDevicePointer', 'hipErrorInvalidGraphicsContext', 'hi
pErrorInvalidHandle', 'hipErrorInvalidImage', 'hipErrorInvalidKernelFile', 'hipErrorInvalidMemcpyDirection', 'hipErrorInvalidPitchValue', 'hipErrorInvalidR
esourceHandle', 'hipErrorInvalidSource', 'hipErrorInvalidSymbol', 'hipErrorInvalidValue', 'hipErrorLaunchFailure', 'hipErrorLaunchOutOfResources', 'hipErro
rLaunchTimeOut', 'hipErrorMapBufferObjectFailed', 'hipErrorMapFailed', 'hipErrorMemoryAllocation', 'hipErrorMissingConfiguration', 'hipErrorNoBinaryForGpu'
, 'hipErrorNoDevice', 'hipErrorNotFound', 'hipErrorNotInitialized', 'hipErrorNotMapped', 'hipErrorNotMappedAsArray', 'hipErrorNotMappedAsPointer', 'hipErro
rNotReady', 'hipErrorNotSupported', 'hipErrorOperatingSystem', 'hipErrorOutOfMemory', 'hipErrorPeerAccessAlreadyEnabled', 'hipErrorPeerAccessNotEnabled', '
hipErrorPeerAccessUnsupported', 'hipErrorPriorLaunchFailure', 'hipErrorProfilerAlreadyStarted', 'hipErrorProfilerAlreadyStopped', 'hipErrorProfilerDisabled
', 'hipErrorProfilerNotInitialized', 'hipErrorRuntimeMemory', 'hipErrorRuntimeOther', 'hipErrorSetOnActiveProcess', 'hipErrorSharedObjectInitFailed', 'hipE
rrorSharedObjectSymbolNotFound', 'hipErrorStreamCaptureImplicit', 'hipErrorStreamCaptureInvalidated', 'hipErrorStreamCaptureIsolation', 'hipErrorStreamCapt
ureMerge', 'hipErrorStreamCaptureUnjoined', 'hipErrorStreamCaptureUnmatched', 'hipErrorStreamCaptureUnsupported', 'hipErrorStreamCaptureWrongThread', 'hipE
rrorTbd', 'hipErrorUnknown', 'hipErrorUnmapFailed', 'hipErrorUnsupportedLimit', 'hipError_t', 'hipSuccess', 'kBFloat16', 'kFloat16', 'kFloat32', 'kInt16',
'kInt32', 'kInt64', 'kInt8', 'kUInt16', 'kUInt32', 'kUInt64', 'kUInt8', 'kUnknown', 'v2']
Using device: cuda
Kernel executed successfully: True
LD_LIBRARY_PATH: /opt/rocm/lib:/home/lhl/aotriton/build/install_dir/lib:/opt/rocm/lib:
# python 03-test_aotriton_pytorch.py
PyTorch version: 2.8.0a0+git8511d21
CUDA available: True
ROCm version: 6.4.43480-9f04e2822

Environment variables:
TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL: Not set
LD_LIBRARY_PATH: /opt/rocm/lib:/home/lhl/aotriton/build/install_dir/lib:/opt/rocm/lib:
PYTORCH_ROCM_ARCH: gfx1151

pyaotriton imported successfully
torch.ops.aotriton is available
Registered aten ops: 834

Testing scaled_dot_product_attention...
Success! Result shape: torch.Size([1, 1, 128, 64])
# python 04-test_attention_small.py
Testing with sizes: batch=1, heads=1, seq_len=128, head_dim=64
Basic attention success! Result shape: torch.Size([1, 1, 128, 64])
AOTriton attention success! Result shape: torch.Size([1, 1, 128, 64])
# python 05-attention-bench.py
╔═════════════════════════════════════════════════════════════════════════════════════════╗
║                                  AOTriton Status Check                                  ║
╚═════════════════════════════════════════════════════════════════════════════════════════╝
PyTorch version: 2.8.0a0+git8511d21
CUDA available: True
ROCm version: 6.4.43480-9f04e2822
TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL: 1
pyaotriton imported successfully
torch.ops.aotriton is available

╔═════════════════════════════════════════════════════════════════════════════════════════╗
║                           Testing Tiny: B=1, H=1, S=128, D=64                           ║
╚═════════════════════════════════════════════════════════════════════════════════════════╝
Estimated memory per QKV tensor: 0.00 GB
Total QKV memory: 0.00 GB
+--------------+----------------+-------------------+----------------+-------------------+
| Operation    |   FW Time (ms) |   FW FLOPS (TF/s) |   BW Time (ms) |   BW FLOPS (TF/s) |
+==============+================+===================+================+===================+
| Causal FA2   |         0.0886 |              0.02 |         0.125  |              0.04 |
+--------------+----------------+-------------------+----------------+-------------------+
| Regular SDPA |         0.0689 |              0.03 |         0.1241 |              0.04 |
+--------------+----------------+-------------------+----------------+-------------------+

╔═════════════════════════════════════════════════════════════════════════════════════════╗
║                           Testing Small: B=2, H=4, S=512, D=64                          ║
╚═════════════════════════════════════════════════════════════════════════════════════════╝
Estimated memory per QKV tensor: 0.00 GB
Total QKV memory: 0.00 GB
+--------------+----------------+-------------------+----------------+-------------------+
| Operation    |   FW Time (ms) |   FW FLOPS (TF/s) |   BW Time (ms) |   BW FLOPS (TF/s) |
+==============+================+===================+================+===================+
| Causal FA2   |         0.5034 |              0.53 |         0.6336 |              1.06 |
+--------------+----------------+-------------------+----------------+-------------------+
| Regular SDPA |         0.4589 |              0.58 |         0.6298 |              1.07 |
+--------------+----------------+-------------------+----------------+-------------------+

╔═════════════════════════════════════════════════════════════════════════════════════════╗
║                          Testing Medium: B=4, H=8, S=1024, D=64                         ║
╚═════════════════════════════════════════════════════════════════════════════════════════╝
Estimated memory per QKV tensor: 0.00 GB
Total QKV memory: 0.01 GB
+--------------+----------------+-------------------+----------------+-------------------+
| Operation    |   FW Time (ms) |   FW FLOPS (TF/s) |   BW Time (ms) |   BW FLOPS (TF/s) |
+==============+================+===================+================+===================+
| Causal FA2   |        16.2504 |              0.26 |        16.0349 |              0.67 |
+--------------+----------------+-------------------+----------------+-------------------+
| Regular SDPA |        15.5982 |              0.28 |        16.0953 |              0.67 |
+--------------+----------------+-------------------+----------------+-------------------+

╔═════════════════════════════════════════════════════════════════════════════════════════╗
║                          Testing Large: B=8, H=16, S=2048, D=64                         ║
╚═════════════════════════════════════════════════════════════════════════════════════════╝
Estimated memory per QKV tensor: 0.03 GB
Total QKV memory: 0.09 GB
+--------------+----------------+-------------------+----------------+-------------------+
| Operation    |   FW Time (ms) |   FW FLOPS (TF/s) |   BW Time (ms) |   BW FLOPS (TF/s) |
+==============+================+===================+================+===================+
| Causal FA2   |        151.853 |              0.45 |        131.531 |              1.31 |
+--------------+----------------+-------------------+----------------+-------------------+
| Regular SDPA |        120.143 |              0.57 |        131.255 |              1.31 |
+--------------+----------------+-------------------+----------------+-------------------+

╔═════════════════════════════════════════════════════════════════════════════════════════╗
║                         Testing XLarge: B=16, H=16, S=4096, D=64                        ║
╚═════════════════════════════════════════════════════════════════════════════════════════╝
Estimated memory per QKV tensor: 0.12 GB
Total QKV memory: 0.38 GB
Memory access fault by GPU node-1 (Agent handle: 0x55b017570c40) on address 0x7fcd499e6000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)

# python 06-check-hipblaslt.py
=== Environment ===
LD_LIBRARY_PATH: /opt/rocm/lib:/home/lhl/aotriton/build/install_dir/lib:/opt/rocm/lib:
PYTORCH_ROCM_ARCH: gfx1151
HSA_OVERRIDE_GFX_VERSION: Not set

=== PyTorch Info ===
PyTorch version: 2.8.0a0+git8511d21
CUDA available: True
ROCm version: 6.4.43480-9f04e2822

=== GPU Info ===
GPU: AMD Radeon Graphics
GPU Architecture: ['gfx1151']
rocminfo: Name:                    gfx1151
rocminfo: Name:                    amdgcn-amd-amdhsa--gfx1151
rocminfo: Name:                    amdgcn-amd-amdhsa--gfx11-generic

=== hipBLASLt Check ===
hipBLASLt linking: libhipblaslt.so.0 => /opt/rocm/lib/libhipblaslt.so.0 (0x00007f14e2e05000)
hipBLASLt linking: libhipblaslt-d.so.0 => /opt/rocm/lib/libhipblaslt-d.so.0 (0x00007f14dbdcb000)

hipBLASLt directory contents:
  library/
    TensileLibrary_BB_BB_HA_Bias_Aux_SAV_UA_Type_BB_HPA_Contraction_l_Ailk_Bjlk_Cijk_Dijk_gfx1151.dat
    TensileLibrary_BB_BB_HA_Bias_Aux_SAV_UA_Type_BB_HPA_Contraction_l_Ailk_Bljk_Cijk_Dijk_gfx1151.dat
    TensileLibrary_BB_BB_HA_Bias_Aux_SAV_UA_Type_BB_HPA_Contraction_l_Alik_Bjlk_Cijk_Dijk_gfx1151.dat
    TensileLibrary_BB_BB_HA_Bias_Aux_SAV_UA_Type_BB_HPA_Contraction_l_Alik_Bljk_Cijk_Dijk_gfx1151.dat
    TensileLibrary_BB_BB_HA_Bias_SAV_UA_Type_BB_HPA_Contraction_l_Ailk_Bjlk_Cijk_Dijk_gfx1151.dat
    ... and 43 more files
# python 08-test-hipblaslt-perf.py
Environment check:
PYTORCH_ROCM_ARCH: gfx1151
HIPBLASLT_TENSILE_LIBPATH: /opt/rocm/lib/hipblaslt/library
TORCH_BLAS_PREFER_HIPBLASLT: 1
Testing GEMM performance...
GEMM 4096x4096x4096: 21.613 ms, 6.36 TFLOPS

Testing Attention performance...
Attention 8x16x2048x64: 150.233 ms, 0.91 TFLOPS

WARNING: GEMM performance is low. hipBLASLt may not be properly configured.
Check that:
1. The correct architecture kernels are in /opt/rocm/lib/hipblaslt/library
2. HIPBLASLT_TENSILE_LIBPATH is set correctly
3. Your GPU architecture matches the available kernels

### They are...
# /opt/rocm/lib/hipblaslt/library/TensileLibrary_lazy_gfx1151.dat
# /opt/rocm/lib/hipblaslt/library/extop_gfx1151.co
# python 09-test-attention-backend.py
=== Environment ===
PYTORCH_ROCM_ARCH: gfx1151
HIPBLASLT_TENSILE_LIBPATH: /opt/rocm/lib/hipblaslt/library
TORCH_BLAS_PREFER_HIPBLASLT: 1
TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL: 1

=== Backend Check ===
PyTorch version: 2.8.0a0+git8511d21
CUDA available: True
Current device: 0
Device name: AMD Radeon Graphics

=== AOTriton Check ===
pyaotriton imported successfully
torch.ops.aotriton is available
AOTriton ops: ['__doc__', '__loader__', '__name__', '__package__', '__spec__', '_dir', 'name']

=== SDPA Backends ===
Flash SDPA enabled: True
Memory efficient SDPA enabled: True
Math SDPA enabled: True

=== Testing Attention Variants ===

1. Standard SDPA (no causal):
Standard SDPA: 118.740 ms, 1.16 TFLOPS

2. Causal SDPA:
Causal SDPA: 149.004 ms, 0.92 TFLOPS

3. SDPA with attn_mask:
SDPA with mask: 149.673 ms, 0.92 TFLOPS

4. Force Flash Attention backend:
/usr/lib64/python3.13/contextlib.py:109: FutureWarning: `torch.backends.cuda.sdp_kernel()` is deprecated. In the future, this context manager will be removed. Please see `torch.nn.attention.sdpa_kernel()` for the new context manager, with updated signature.
  self.gen = func(*args, **kwds)
/home/lhl/strix-halo-testing/flash-attention/09-test-attention-backend.py:95: UserWarning: Memory efficient kernel not used because: (Triggered internally at /home/lhl/torch/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:859.)
  lambda q, k, v: F.scaled_dot_product_attention(q, k, v, is_causal=True)
/home/lhl/strix-halo-testing/flash-attention/09-test-attention-backend.py:95: UserWarning: Memory Efficient attention has been runtime disabled. (Triggered internally at /home/lhl/torch/pytorch/aten/src/ATen/native/transformers/sdp_utils_cpp.h:550.)
  lambda q, k, v: F.scaled_dot_product_attention(q, k, v, is_causal=True)
/home/lhl/strix-halo-testing/flash-attention/09-test-attention-backend.py:95: UserWarning: Flash attention kernel not used because: (Triggered internally at /home/lhl/torch/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:861.)
  lambda q, k, v: F.scaled_dot_product_attention(q, k, v, is_causal=True)
/home/lhl/strix-halo-testing/flash-attention/09-test-attention-backend.py:95: UserWarning: Flash attention was not compiled for current AMD GPU architecture. Attempting to run on architecture gfx1151 (Triggered internally at /home/lhl/torch/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:241.)
  lambda q, k, v: F.scaled_dot_product_attention(q, k, v, is_causal=True)
/home/lhl/strix-halo-testing/flash-attention/09-test-attention-backend.py:95: UserWarning: CuDNN attention kernel not used because: (Triggered internally at /home/lhl/torch/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:863.)
  lambda q, k, v: F.scaled_dot_product_attention(q, k, v, is_causal=True)
/home/lhl/strix-halo-testing/flash-attention/09-test-attention-backend.py:95: UserWarning: Torch was not compiled with cuDNN attention. (Triggered internally at /home/lhl/torch/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:618.)
  lambda q, k, v: F.scaled_dot_product_attention(q, k, v, is_causal=True)
Traceback (most recent call last):
  File "/home/lhl/strix-halo-testing/flash-attention/09-test-attention-backend.py", line 93, in <module>
    test_attention_variant(
    ~~~~~~~~~~~~~~~~~~~~~~^
        "Flash Attention",
        ^^^^^^^^^^^^^^^^^^
        lambda q, k, v: F.scaled_dot_product_attention(q, k, v, is_causal=True)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/lhl/strix-halo-testing/flash-attention/09-test-attention-backend.py", line 58, in test_attention_variant
    _ = func(q, k, v)
  File "/home/lhl/strix-halo-testing/flash-attention/09-test-attention-backend.py", line 95, in <lambda>
    lambda q, k, v: F.scaled_dot_product_attention(q, k, v, is_causal=True)
                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: No available kernel. Aborting execution.

# python 10-test_aotriton_direct.py
=== AOTriton Direct Test ===
AOTriton module: <module 'pyaotriton' from '/usr/local/lib/python3.13/site-packages/pyaotriton.cpython-313-x86_64-linux-gnu.so'>
AOTriton directory: ['DType', 'HipMemory', 'Stream', 'T0', 'T1', 'T2', 'T4', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'get_name_suffix', 'hipDeviceSynchronize', 'hipErrorAlreadyAcquired', 'hipErrorAlreadyMapped', 'hipErrorArrayIsMapped', 'hipErrorAssert', 'hipErrorCapturedEvent', 'hipErrorContextAlreadyCurrent', 'hipErrorContextAlreadyInUse', 'hipErrorContextIsDestroyed', 'hipErrorCooperativeLaunchTooLarge', 'hipErrorDeinitialized', 'hipErrorECCNotCorrectable', 'hipErrorFileNotFound', 'hipErrorGraphExecUpdateFailure', 'hipErrorHostMemoryAlreadyRegistered', 'hipErrorHostMemoryNotRegistered', 'hipErrorIllegalAddress', 'hipErrorIllegalState', 'hipErrorInitializationError', 'hipErrorInsufficientDriver', 'hipErrorInvalidConfiguration', 'hipErrorInvalidContext', 'hipErrorInvalidDevice', 'hipErrorInvalidDeviceFunction', 'hipErrorInvalidDevicePointer', 'hipErrorInvalidGraphicsContext', 'hipErrorInvalidHandle', 'hipErrorInvalidImage', 'hipErrorInvalidKernelFile', 'hipErrorInvalidMemcpyDirection', 'hipErrorInvalidPitchValue', 'hipErrorInvalidResourceHandle', 'hipErrorInvalidSource', 'hipErrorInvalidSymbol', 'hipErrorInvalidValue', 'hipErrorLaunchFailure', 'hipErrorLaunchOutOfResources', 'hipErrorLaunchTimeOut', 'hipErrorMapBufferObjectFailed', 'hipErrorMapFailed', 'hipErrorMemoryAllocation', 'hipErrorMissingConfiguration', 'hipErrorNoBinaryForGpu', 'hipErrorNoDevice', 'hipErrorNotFound', 'hipErrorNotInitialized', 'hipErrorNotMapped', 'hipErrorNotMappedAsArray', 'hipErrorNotMappedAsPointer', 'hipErrorNotReady', 'hipErrorNotSupported', 'hipErrorOperatingSystem', 'hipErrorOutOfMemory', 'hipErrorPeerAccessAlreadyEnabled', 'hipErrorPeerAccessNotEnabled', 'hipErrorPeerAccessUnsupported', 'hipErrorPriorLaunchFailure', 'hipErrorProfilerAlreadyStarted', 'hipErrorProfilerAlreadyStopped', 'hipErrorProfilerDisabled', 'hipErrorProfilerNotInitialized', 'hipErrorRuntimeMemory', 'hipErrorRuntimeOther', 'hipErrorSetOnActiveProcess', 'hipErrorSharedObjectInitFailed', 'hipErrorSharedObjectSymbolNotFound', 'hipErrorStreamCaptureImplicit', 'hipErrorStreamCaptureInvalidated', 'hipErrorStreamCaptureIsolation', 'hipErrorStreamCaptureMerge', 'hipErrorStreamCaptureUnjoined', 'hipErrorStreamCaptureUnmatched', 'hipErrorStreamCaptureUnsupported', 'hipErrorStreamCaptureWrongThread', 'hipErrorTbd', 'hipErrorUnknown', 'hipErrorUnmapFailed', 'hipErrorUnsupportedLimit', 'hipError_t', 'hipSuccess', 'kBFloat16', 'kFloat16', 'kFloat32', 'kInt16', 'kInt32', 'kInt64', 'kInt8', 'kUInt16', 'kUInt32', 'kUInt64', 'kUInt8', 'kUnknown', 'v2']

AOTriton v2 available: ['CppTune', 'CppTuneSpecialKernelIndex', '__doc__', '__loader__', '__name__', '__package__', '__spec__', 'flash', 'kDefault', 'kSkipGPUCall']

=== Torch Ops ===
torch.ops.aotriton: ['__doc__', '__loader__', '__name__', '__package__', '__spec__', '_dir', 'name']
No flash_attention found in aotriton ops

=== Testing AOTriton Functions ===
Found AOTriton flash attention functions!
Direct AOTriton call failed: attn_fwd(): incompatible function arguments. The following argument types are supported:
    1. (q: pyaotriton.T4, k: pyaotriton.T4, v: pyaotriton.T4, b: pyaotriton.T4, sm_scale: float, softmax_lse: pyaotriton.T2, out: pyaotriton.T4, dropout_p: float, philox_seed: pyaotriton.T0, philox_offset1: pyaotriton.T0, philox_offset2: int, philox_seed_output: pyaotriton.T0, philox_offset_output: pyaotriton.T0, encoded_softmax: pyaotriton.T4, is_causal: bool, atomic_for_causal: pyaotriton.T0, stream: pyaotriton.Stream = None, extargs: pyaotriton.v2.flash.FwdExtraArguments = <pyaotriton.v2.flash.FwdExtraArguments object at 0x7f1e90857470>) -> pyaotriton.hipError_t

Invoked with: tensor([[[[ 8.7842e-01,  2.9517e-01, -1.3975e+00,  ..., -5.7520e-01,
            7.1777e-01, -5.8887e-01],
          [-3.4741e-01, -1.2568e+00,  8.9941e-01,  ...,  4.9902e-01,
           -4.6289e-01, -1.2549e+00],
          [-1.1465e+00,  2.9297e-01, -4.3457e-02,  ...,  3.3594e-01,
           -1.1025e+00,  6.3553e-03],
          ...,
          [ 1.4199e+00, -2.2656e-01, -9.6582e-01,  ...,  3.2617e-01,
            1.8823e-01, -1.8184e+00],
          [ 1.1436e+00,  8.3191e-02, -8.3740e-01,  ...,  9.8022e-02,
           -1.7744e+00, -4.8975e-01],
          [-7.4365e-01,  2.8174e-01,  3.2153e-01,  ..., -1.7080e+00,
            9.6436e-02,  2.3965e+00]],

         [[-4.2773e-01, -2.1094e+00, -1.8096e+00,  ...,  1.0994e-02,
            1.0537e+00,  4.5361e-01],
          [ 2.0176e+00, -1.3223e+00,  6.1719e-01,  ...,  1.4807e-01,
           -1.4102e+00,  1.3213e+00],
          [ 2.0801e-01,  7.6562e-01, -3.2593e-02,  ..., -7.0654e-01,
            4.9561e-01,  2.1074e+00],
          ...,
          [-1.0483e-02,  8.0908e-01, -5.3857e-01,  ...,  8.7256e-01,
           -1.1055e+00, -1.9568e-01],
          [ 8.2812e-01, -3.1641e-01,  1.1484e+00,  ..., -1.0352e+00,
            7.2266e-01,  1.4404e+00],
          [-6.9287e-01, -7.0801e-01,  2.2925e-01,  ...,  1.6025e+00,
           -4.2773e-01, -1.7559e+00]],

         [[ 8.8672e-01, -3.6475e-01, -7.0508e-01,  ...,  4.6167e-01,
            1.6514e+00,  2.6538e-01],
          [-5.7422e-01,  1.0586e+00,  1.8730e+00,  ...,  3.6072e-02,
            2.4688e+00, -2.4688e+00],
          [-7.0459e-01, -9.9609e-01, -8.4033e-01,  ..., -3.7109e-02,
            6.6528e-02,  9.9512e-01],
          ...,
          [ 2.2754e-01, -1.9375e+00,  6.6699e-01,  ..., -1.4450e-02,
            1.2119e+00,  7.3926e-01],
          [-4.3262e-01, -2.5269e-01,  3.5797e-02,  ...,  1.3557e-02,
           -1.0098e+00,  1.3408e+00],
          [ 4.4604e-01, -4.1992e-01,  4.1821e-01,  ...,  2.3413e-01,
           -5.5957e-01, -1.1133e+00]],

         [[-1.1090e-01,  6.1230e-01, -6.0107e-01,  ..., -1.9795e+00,
            2.4780e-02,  1.0785e-01],
          [-9.0674e-01, -1.4087e-01, -1.0869e+00,  ...,  5.8740e-01,
           -1.1719e+00, -2.7979e-01],
          [ 1.1426e+00, -8.3350e-01,  1.6377e+00,  ..., -1.2280e-01,
            5.8008e-01,  9.3359e-01],
          ...,
          [ 7.6807e-01,  1.9608e-03,  3.0493e-01,  ...,  1.7773e-01,
            1.2383e+00,  2.2090e+00],
          [-1.1162e+00,  4.3408e-01,  1.3879e-01,  ...,  1.3301e+00,
            1.6143e+00,  1.6689e+00],
          [ 8.2959e-01, -1.4392e-01, -2.4707e-01,  ...,  5.9375e-01,
            7.0605e-01, -7.5537e-01]]],


        [[[ 5.5957e-01,  4.5825e-01, -3.5815e-01,  ..., -2.3206e-01,
           -1.0293e+00,  7.8125e-01],
          [-1.6687e-01,  7.3193e-01,  1.3896e+00,  ..., -1.1719e+00,
           -7.9883e-01, -1.1494e+00],
          [ 9.3689e-02, -7.2559e-01,  4.5532e-01,  ..., -1.9805e+00,
           -8.4424e-01,  6.2012e-01],
          ...,
          [-5.9662e-03, -1.2627e+00,  4.3286e-01,  ..., -4.6021e-01,
            5.0928e-01,  6.2256e-01],
          [-5.1416e-01, -3.2861e-01, -3.0688e-01,  ...,  8.9600e-01,
           -9.2236e-01, -2.7002e-01],
          [-1.0566e+00, -3.1812e-01,  9.8633e-02,  ..., -7.6562e-01,
            1.7852e+00, -7.9541e-01]],

         [[-1.4434e+00,  1.0430e+00, -1.0938e+00,  ...,  1.1660e+00,
            2.2422e+00,  4.8462e-01],
          [ 1.5576e+00, -8.8037e-01,  4.7461e-01,  ..., -1.4004e+00,
            2.4988e-01, -1.2832e+00],
          [ 5.4248e-01,  1.8347e-01, -7.0264e-01,  ..., -5.2148e-01,
           -3.9502e-01,  5.5371e-01],
          ...,
          [ 1.4580e+00,  1.0361e+00,  6.7041e-01,  ...,  2.2888e-01,
           -1.0117e+00, -1.8037e+00],
          [-7.1240e-01, -7.4609e-01,  1.0928e+00,  ..., -1.4316e+00,
           -2.5635e-01,  1.6465e+00],
          [ 1.5051e-01,  3.1274e-01,  2.0996e+00,  ...,  3.6652e-02,
            6.4062e-01,  2.1211e+00]],

         [[-1.3623e-01, -1.4346e+00, -1.9153e-01,  ...,  1.0479e+00,
           -7.0312e-02, -1.6904e+00],
          [ 1.3721e+00,  1.9165e-02, -4.4751e-01,  ..., -1.4023e+00,
           -1.8967e-02,  9.0332e-01],
          [ 7.9297e-01,  1.4658e+00, -3.4155e-01,  ..., -3.9819e-01,
            7.0801e-01, -1.2256e+00],
          ...,
          [-5.3076e-01, -1.3525e+00,  4.8096e-01,  ..., -3.5352e-01,
           -1.6345e-01,  4.5679e-01],
          [ 2.7246e-01,  4.0112e-01, -1.6904e+00,  ..., -2.9028e-01,
            1.5449e+00,  1.3977e-01],
          [-1.2422e+00,  7.9004e-01, -5.9961e-01,  ..., -2.0527e+00,
            1.8237e-01, -2.5220e-01]],

         [[-3.9185e-01, -2.2888e-01, -1.1895e+00,  ...,  7.0215e-01,
           -3.1396e-01, -3.5522e-01],
          [-7.6367e-01,  4.6582e-01, -8.5889e-01,  ...,  1.1104e+00,
           -4.3896e-01, -6.6064e-01],
          [ 1.0566e+00,  7.0618e-02,  3.1665e-01,  ..., -2.7319e-01,
            8.6133e-01,  1.0664e+00],
          ...,
          [ 1.2998e+00, -8.3691e-01,  1.7021e+00,  ...,  7.0166e-01,
           -1.6367e+00, -7.6465e-01],
          [-5.7764e-01, -7.0020e-01,  1.7432e-01,  ...,  1.4697e+00,
           -4.1016e-01,  1.0039e+00],
          [ 6.3623e-01,  8.3203e-01, -1.3464e-01,  ...,  1.3125e+00,
            9.5947e-01,  4.6582e-01]]]], device='cuda:0', dtype=torch.float16), tensor([[[[-1.7578e-01,  2.1699e+00,  7.4854e-01,  ...,  5.2051e-01,
           -4.5752e-01, -4.1040e-01],
          [ 7.8516e-01, -1.7310e-01,  6.5869e-01,  ..., -1.4473e+00,
           -6.5723e-01,  4.9341e-01],
          [ 1.4783e-01, -1.1660e+00,  1.4868e-01,  ...,  5.3174e-01,
           -8.9209e-01, -9.1162e-01],
          ...,
          [ 1.1729e+00,  7.5439e-01,  1.3047e+00,  ..., -5.7251e-02,
            5.2881e-01, -1.4259e-02],
          [-6.9971e-01,  1.8286e-01, -1.0681e-01,  ...,  1.9883e+00,
            1.4336e+00, -2.6050e-01],
          [ 5.5127e-01,  1.2772e-02,  7.4658e-01,  ..., -6.1182e-01,
            8.0322e-01, -1.7549e+00]],

         [[ 1.0479e+00,  1.2666e+00, -2.5366e-01,  ..., -1.7224e-01,
           -9.3213e-01,  1.0771e+00],
          [-1.9756e+00,  9.9268e-01,  8.6133e-01,  ..., -8.7061e-01,
           -8.6230e-01, -2.1328e+00],
          [ 2.9468e-01, -2.9834e-01, -1.2578e+00,  ..., -9.5557e-01,
            8.8965e-01, -5.5225e-01],
          ...,
          [-2.1133e+00, -7.0459e-01,  5.7910e-01,  ...,  1.0146e+00,
           -1.3398e+00, -1.6431e-01],
          [-1.5439e+00,  1.6592e+00,  1.0566e+00,  ..., -4.8047e-01,
           -5.4639e-01,  5.9473e-01],
          [ 7.8674e-02,  2.7578e+00, -1.1768e+00,  ...,  3.1860e-01,
            4.1351e-02,  2.5122e-01]],

         [[-1.6963e+00, -3.3301e-01,  3.6719e-01,  ..., -5.6348e-01,
            1.2158e+00,  8.4814e-01],
          [-1.0625e+00, -9.1406e-01,  6.3574e-01,  ...,  5.6006e-01,
           -1.5352e+00,  2.1621e+00],
          [-1.4854e+00,  6.2842e-01, -4.4189e-01,  ..., -7.7783e-01,
            9.9658e-01, -8.3691e-01],
          ...,
          [ 2.4890e-01, -1.5156e+00,  6.2500e-02,  ...,  8.9111e-02,
           -1.2373e+00,  3.7659e-02],
          [-6.6162e-01,  7.4268e-01,  1.8970e-01,  ..., -1.2646e+00,
           -9.0674e-01, -4.1138e-01],
          [-1.3770e+00,  2.2188e+00, -3.4106e-01,  ..., -6.9531e-01,
           -1.3320e+00,  9.1736e-02]],

         [[ 5.6934e-01,  1.3562e-01,  1.3818e-01,  ...,  3.2056e-01,
           -4.4385e-01,  1.8145e+00],
          [ 2.7075e-01,  9.7803e-01,  1.1487e-01,  ...,  1.5508e+00,
           -2.4341e-01, -5.5127e-01],
          [-4.8975e-01,  8.2959e-01, -9.2822e-01,  ..., -1.0527e+00,
           -1.1338e+00, -4.3335e-01],
          ...,
          [-1.7207e+00, -1.1602e+00, -9.7998e-01,  ..., -9.1504e-01,
           -3.8794e-01, -1.0205e-01],
          [ 5.9570e-01, -2.0081e-01,  8.7598e-01,  ...,  5.4736e-01,
            1.0400e+00, -1.6797e+00],
          [-1.3467e+00,  5.8887e-01,  1.9668e+00,  ...,  8.3447e-01,
           -1.0645e+00, -4.0698e-01]]],


        [[[ 4.5972e-01,  1.7676e+00,  9.1309e-01,  ..., -6.0010e-01,
            1.4675e-04, -4.4458e-01],
          [-1.0820e+00, -7.3096e-01, -4.6777e-01,  ...,  1.5605e+00,
           -6.9971e-01, -2.2876e-01],
          [-6.4648e-01,  9.4385e-01,  1.1292e-01,  ...,  2.1057e-01,
           -6.1328e-01,  2.9766e+00],
          ...,
          [ 5.8887e-01,  7.1338e-01,  1.4434e+00,  ...,  1.0967e+00,
           -1.2139e+00,  7.0557e-01],
          [ 1.1865e-01, -8.7354e-01, -6.0822e-02,  ...,  1.7981e-01,
           -2.3730e-01,  7.0166e-01],
          [ 8.0762e-01, -5.6689e-01, -1.7444e-01,  ...,  2.6636e-01,
           -6.2598e-01, -6.0449e-01]],

         [[ 3.7158e-01,  1.1383e-01, -3.1323e-01,  ...,  1.1396e+00,
           -1.9421e-01,  2.6221e-01],
          [-6.1914e-01, -1.7334e+00,  7.9346e-01,  ...,  8.4473e-01,
            5.6006e-01,  3.4497e-01],
          [ 3.9038e-01, -8.4229e-01, -1.3887e+00,  ...,  1.3994e+00,
            7.3682e-01,  8.6328e-01],
          ...,
          [ 7.3145e-01, -3.2520e-01,  3.9819e-01,  ..., -9.9854e-01,
            5.2948e-02,  1.0664e+00],
          [-6.7920e-01,  5.7812e-01, -1.1426e+00,  ...,  2.2676e+00,
            9.6338e-01, -1.5613e-01],
          [-6.2451e-01, -4.3408e-01,  1.9043e-01,  ..., -6.8506e-01,
            6.0254e-01, -8.7769e-02]],

         [[ 8.0029e-01,  1.4062e+00, -1.1074e+00,  ..., -1.1953e+00,
           -2.3132e-01, -1.0968e-01],
          [-1.3757e-01,  7.9199e-01, -2.6343e-01,  ...,  2.0947e-01,
            1.0297e-01, -1.9055e-01],
          [ 1.6533e+00,  4.4849e-01, -8.2642e-02,  ..., -5.7080e-01,
           -1.4355e+00, -7.6318e-01],
          ...,
          [-3.0060e-02,  5.6445e-01,  1.0264e+00,  ...,  1.1499e-01,
           -2.6025e-01,  1.6768e+00],
          [-6.1914e-01, -5.6201e-01,  1.2959e+00,  ..., -5.4102e-01,
           -7.8027e-01,  1.1582e+00],
          [-7.7490e-01, -2.6831e-01,  1.3232e+00,  ...,  3.0396e-01,
            6.5918e-01, -6.2500e-01]],

         [[-7.7881e-01, -1.1006e+00, -1.1699e+00,  ...,  1.6514e+00,
           -1.1064e+00,  7.5635e-01],
          [-7.3584e-01,  2.8784e-01,  6.8848e-01,  ...,  1.1689e+00,
            1.2148e+00,  2.2113e-04],
          [-9.3408e-01,  2.4841e-01,  8.1738e-01,  ...,  3.8940e-01,
           -1.0420e+00,  2.8052e-01],
          ...,
          [ 3.6084e-01, -2.6147e-01,  9.4971e-01,  ..., -5.0146e-01,
           -2.0195e+00,  4.0527e-01],
          [ 7.2949e-01,  1.1533e+00, -2.4475e-01,  ...,  1.2031e+00,
            6.4160e-01,  8.6768e-01],
          [ 1.1504e+00, -5.8350e-02, -1.0166e+00,  ..., -2.5073e-01,
           -2.9834e-01, -2.4551e+00]]]], device='cuda:0', dtype=torch.float16), tensor([[[[-1.6504,  0.7422,  0.1641,  ...,  2.1914,  1.8896, -2.5234],
          [ 0.7729,  0.2445, -0.5029,  ...,  2.5215,  1.0293, -1.5850],
          [-0.2517, -1.3340,  0.1346,  ..., -0.3223, -1.6475,  0.7090],
          ...,
          [-0.9243, -1.0488,  1.1738,  ..., -0.4255, -1.8105,  0.2074],
          [ 1.6553,  0.5649, -0.2566,  ..., -0.7725,  0.5547,  1.0850],
          [ 0.8730, -0.1299,  0.5757,  ..., -0.5127, -0.7134, -0.0875]],

         [[ 2.5957, -0.6538, -1.0029,  ..., -1.1846,  0.3379,  1.1982],
          [-0.1344, -1.3779, -0.4507,  ..., -2.1953,  0.7969, -0.6162],
          [ 0.4790,  0.1487,  0.4922,  ...,  0.4131, -1.1133,  1.8477],
          ...,
          [-0.9316, -0.4924, -0.7290,  ...,  0.0494,  0.6860,  3.3613],
          [ 0.4219, -0.3130, -0.8701,  ..., -1.7520,  0.8677,  0.9678],
          [-0.7227, -0.6245, -1.9268,  ..., -0.0322, -0.8569, -1.7148]],

         [[-0.2056, -0.3342, -0.2400,  ...,  0.0228,  1.9404, -1.2969],
          [ 0.4846,  0.0338, -0.6626,  ...,  0.5244,  1.4736, -0.3696],
          [-0.9194,  0.3132,  0.7432,  ..., -0.0458,  1.2246, -0.5054],
          ...,
          [-1.0459,  0.7080, -1.1611,  ...,  0.5557,  0.9355, -0.0980],
          [ 1.0459,  0.7290,  1.0410,  ...,  1.6309,  1.3203,  1.1426],
          [ 0.6577, -0.9009,  0.0431,  ...,  0.4414,  0.7822, -1.0596]],

         [[ 0.0640,  1.2666, -0.0655,  ..., -2.4824,  0.1224, -1.5479],
          [ 0.0494,  2.7891,  0.0605,  ..., -0.2932,  0.2380, -0.0081],
          [-1.5420, -0.7515, -0.6421,  ...,  0.9141,  0.7910,  1.0400],
          ...,
          [ 0.0831,  0.4316,  0.2251,  ..., -0.2517, -1.1299, -0.5386],
          [-0.0611,  0.0428, -0.3496,  ..., -0.2465, -0.4312,  0.0789],
          [ 0.2905,  1.0977, -0.7837,  ..., -2.0547, -1.4258, -0.0647]]],


        [[[ 0.3408, -2.1738,  0.1917,  ...,  0.0419,  0.7139, -0.0323],
          [ 1.1445, -1.4463, -1.6777,  ..., -0.3076,  0.6929, -2.5039],
          [ 0.0489, -0.7905, -0.7964,  ..., -0.3015, -0.5513,  1.1885],
          ...,
          [ 0.3604,  0.9102,  1.0771,  ..., -0.9785,  0.6860, -0.3896],
          [-0.6543,  0.8101, -0.1576,  ..., -1.7715,  0.9971, -2.1387],
          [-0.6553,  1.9863, -0.8545,  ..., -0.7451,  0.2363,  1.3330]],

         [[ 2.1836, -0.3901, -0.4858,  ..., -0.5479, -0.6724,  1.6904],
          [ 0.3455, -1.0078, -2.2500,  ..., -0.1362,  0.4175, -2.5391],
          [-0.2849, -0.9790, -0.1232,  ...,  0.7690, -0.0687,  0.6992],
          ...,
          [ 1.5488, -0.9888, -0.7573,  ..., -0.6802,  1.7705,  0.5068],
          [-0.1108,  1.5000,  0.1373,  ..., -0.7476, -0.0363, -1.0791],
          [ 0.2421,  1.4014, -0.7021,  ...,  0.3547, -0.5068, -0.7393]],

         [[ 0.2507,  0.5610, -0.3145,  ...,  0.1754,  0.6875, -1.1465],
          [ 2.3809, -0.4871, -0.3687,  ...,  0.0426,  0.3660,  0.5264],
          [ 0.5962,  0.0942,  0.3777,  ...,  1.1758,  1.0400, -0.9888],
          ...,
          [-0.0923,  1.1201, -0.6548,  ...,  1.0742,  2.3008, -0.0726],
          [ 0.5532,  0.7227,  0.7153,  ..., -0.6240,  0.0735, -0.2072],
          [-0.0040,  1.7100,  0.3816,  ..., -0.8926, -1.1201,  0.3584]],

         [[ 0.6475,  0.0412, -0.1018,  ..., -1.2344, -0.8154,  1.0732],
          [-1.5967,  0.4148, -0.4739,  ...,  1.2676,  0.1183,  0.3384],
          [ 0.6641,  0.4724,  0.0482,  ...,  2.4902,  0.8418,  0.8423],
          ...,
          [ 0.3525,  0.5088,  1.2031,  ..., -0.1844,  0.7842,  0.9062],
          [-0.1414, -0.3110,  0.0466,  ...,  1.2637, -0.1755, -0.0563],
          [ 1.0723, -2.1602,  0.7920,  ...,  1.0166,  0.1201,  1.6133]]]],
       device='cuda:0', dtype=torch.float16), None, 0.0, True, None

=== Checking PyTorch Backend Registration ===
No AOTriton attention ops found in PyTorch registry

=== Relevant Environment Variables ===
TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL: 1
PYTORCH_ROCM_ARCH: gfx1151
HSA_OVERRIDE_GFX_VERSION: Not set
ROCM_ARCH: Not set
HIP_VISIBLE_DEVICES: Not set
PYTORCH_ROCM_AOTRITON_PREFER_DEFAULT: Not set

=== Debugging PyTorch SDPA Selection ===
SDPA execution completed

=== Architecture Detection ===
Device properties: _CudaDeviceProperties(name='AMD Radeon Graphics', major=11, minor=5, gcnArchName='gfx1151', total_memory=104906MB, multi_processor_count=20, uuid=58580000-0000-0000-0000-000000000000, pci_bus_id=194, pci_device_id=0, pci_domain_id=0, L2_cache_size=2MB)
rocminfo: Name:                    gfx1151
rocminfo: Name:                    amdgcn-amd-amdhsa--gfx1151
rocminfo: Name:                    amdgcn-amd-amdhsa--gfx11-generic
# python 11-test_manual_aotriton.py
=== Manual AOTriton Test ===
AOTriton functions:
v2 module: ['CppTune', 'CppTuneSpecialKernelIndex', '__doc__', '__loader__', '__name__', '__package__', '__spec__', 'flash', 'kDefault', 'kSkipGPUCall']
flash module: ['BwdExtraArguments', 'FusedBwdExtraArguments', 'FwdExtraArguments', '__doc__', '__loader__', '__name__', '__package__', '__spec__', 'attn_bw
d', 'attn_bwd_compact_varlen', 'attn_bwd_fused', 'attn_fwd', 'attn_fwd_compact_varlen', 'check_gpu', 'debug_fill_dropout_rng', 'debug_fill_dropout_rng_tens
or', 'debug_simulate_encoded_softmax']
Successfully imported attention functions!

Testing AOTriton attention...
AOTriton call failed: no signature found for builtin <built-in method attn_fwd of PyCapsule object at 0x7fa509606840>

=== PyTorch Integration Check ===
torch.ops.aotriton exists
Available ops: ['__doc__', '__loader__', '__name__', '__package__', '__spec__', '_dir', 'name']

=== Manual Registration Check ===
# python 14-simple_aotriton_test.py
=== Simple AOTriton Test ===
Error creating AOTriton types: __init__(): incompatible constructor arguments. The following argument types are supported:
    1. pyaotriton.HipMemory()

Invoked with: kwargs: ptr=140181988442112, size=2048

=== Checking for compiled kernels ===
Error checking GPU: check_gpu(): incompatible function arguments. The following argument types are supported:
    1. (stream: pyaotriton.Stream) -> pyaotriton.hipError_t

Invoked with:

=== Environment Check ===

Docker Files

  • We run our commands from the rocm-TheRock repo root otherwise relative paths are broken

Initial env setup:

mamba activate
mamba install cmake ninja patchelf
pip install uv
uv pip install meson

git clone https://github.com/scottt/rocm-TheRock
python ./build_tools/fetch_sources.py
cmake -B build -GNinja . -DTHEROCK_AMDGPU_TARGETS=gfx1151

rocm-dev

We need to build rocm-dev first:

❯ sh build-rocm-docker.sh
# podman build --build-arg FEDORA_VER=41 -t rocm-dev:41 -f dockerfiles/pytorch-dev/rocm_fedora.Dockerfile .
...

--> 34544a2de4e0
[3/3] STEP 5/5: RUN printf "export PATH=/opt/rocm/bin:$PATH\n" > /etc/profile.d/rocm.sh
[3/3] COMMIT rocm-dev:41
--> 758b36e33cae
Successfully tagged localhost/rocm-dev:41
758b36e33cae4706e6a7030b6ae2604d6655da5c4a6305bfada0ca04339a5f98

lhl in 🌐 cluster1 in rocm-TheRock on  gfx1151 [?] via △ v4.0.2 took 1h1m10s

Tag it for pytorch-dev:

podman tag localhost/rocm-dev:41 rocm-dev-f41:latest

Test it:

podman run -it --rm rocm-dev-f41:latest bash

pytorch-dev

❯ git diff dockerfiles/
diff --git a/dockerfiles/pytorch-dev/pytorch_dev_fedora.Dockerfile b/dockerfiles/pytorch-dev/pytorch_dev_fedora.Dockerfile
index 462af8c..46e58c2 100644
--- a/dockerfiles/pytorch-dev/pytorch_dev_fedora.Dockerfile
+++ b/dockerfiles/pytorch-dev/pytorch_dev_fedora.Dockerfile
@@ -1,6 +1,11 @@
+# https://github.com/scottt/rocm-TheRock/blob/gfx1151/dockerfiles/pytorch-dev/pytorch_dev_fedora.Dockerfile
+
 ARG FEDORA_VER=41
 FROM rocm-dev-f${FEDORA_VER} AS build

+ENV AMDGPU_TARGETS=gfx1151
+ENV AOTRITON_BUILD_FROM_SOURCE=1
+
 # pytorch-fetch
 RUN --mount=type=cache,id=pytorch-f${FEDORA_VER},target=/therock \
        mkdir -p /therock/pytorch
@@ -9,6 +14,7 @@ RUN --mount=type=cache,id=pytorch-f${FEDORA_VER},target=/therock \
        --mount=type=bind,target=/therock/src,rw \
        python3 /therock/src/external-builds/pytorch/ptbuild.py \
                checkout \
+                --pytorch-ref v2.7.0 \
                --repo /therock/pytorch \
                --depth 1 \
                --jobs 10 \
@@ -24,6 +30,7 @@ RUN --mount=type=cache,id=pytorch-f${FEDORA_VER},target=/therock \
        --mount=type=bind,target=/therock/src,rw \
        python3 /therock/src/external-builds/pytorch/ptbuild.py \
                checkout \
+                --pytorch-ref v2.7.0 \
                --repo /therock/pytorch  \
                --depth 1  \
                --jobs 10
❯ sh build-pytorch-dev.sh

Working Notes

cluster1
checkout rock
make changes to rocm, add build scripts, have run scripts

Compare to: https://github.com/ROCm/TheRock/blob/main/dockerfiles/pytorch-dev/pytorch_dev_ubuntu_24.04.Dockerfile#L20

https://github.com/ROCm/TheRock/discussions/244

CK off PR: https://github.com/pytorch/pytorch/pull/152951 -DUSE_ROCM_CK_GEMM=ON

Or use ROCm ≄ 6.5 where CK includes RDNA 3 defines – see ROCm issue #4499 for progress. https://github.com/ROCm/ROCm/issues/4499 https://github.com/ROCm/composable_kernel/issues/775

TODO: PyTorch Dependent

  • vLLM
  • SGLang
  • torchtune

Other Reviews

David Huang maintains a llama.cpp fork which has AMD-specific optimizations before it is upstreamed: https://github.com/hjc4869/llama.cpp and did some sweeps/testing on a 60W Ryzen AI Max+ 395 (HP ZBook Ultra G1a) on a gfx1100 GPU_TARGET:

Jack Stone (Chinese YouTube Hardware Reviewer) did a review of the GMK EVO-X2 MiniPC and had a section running some tests w/ LM Studio on Windows - a lot less technical information, but still maybe of interest:

NPU

See also: https://github.com/ggml-org/llama.cpp/issues/1499

Benchmarks WIP

Runpp512 (t/s)tg128 (t/s)Max Mem (MiB)
CPU294.64 ± 0.5828.94 ± 0.04
CPU + FA294.36 ± 3.1329.42 ± 0.03
HIP348.96 ± 0.3148.72 ± 0.014219
HIP + FA331.96 ± 0.4145.78 ± 0.024245
HIP + FA + hipBLASLt765.54 ± 1.8445.05 ± 0.01
HIP + WMMA322.63 ± 1.3448.40 ± 0.024218
HIP + WMMA + FA343.91 ± 0.6050.88 ± 0.014218
HIP + WMMA + FA + hipBLASLt986.12 ± 1.4450.58 ± 0.01
Vulkan881.71 ± 1.7152.22 ± 0.053923
Vulkan + FA884.20 ± 6.2352.73 ± 0.073923
  • hipBLASLt runs with ROCBLAS_USE_HIPBLASLT=1

Aborted (core dumped)
[root@01880d41c33b llama.cpp]# ROCBLAS_USE_HIPBLASLT=1 llama.cpp-hip/build/bin/llama-bench -m /home/lhl/models/Qwen3-235B-A22B-UD-Q3_K_XL-00001-of-00003.gguf
/share/libdrm/amdgpu.ids: No such file or directory
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
rocBLAS error: No hipBLASLt solution found
This message will be only be displayed once, unless the ROCBLAS_VERBOSE_HIPBLASLT_ERROR environment variable is set.

rocBLAS warning: hipBlasLT failed, falling back to tensile.
This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set.
| qwen3moe 235B.A22B Q3_K - Medium |  96.59 GiB |   235.09 B | ROCm,RPC   |  99 |           pp512 |        122.68 ± 0.58 |
| qwen3moe 235B.A22B Q3_K - Medium |  96.59 GiB |   235.09 B | ROCm,RPC   |  99 |           tg128 |         12.30 ± 0.02 |

build: c753d7be (5392)
[root@01880d41c33b llama.cpp]# time llama.cpp-hip/build/bin/llama-bench -m /home/lhl/models/Qwen3-235B-A22B-UD-Q3_K_XL-00001-of-00003.gguf
/share/libdrm/amdgpu.ids: No such file or directory
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
HW Exception by GPU node-1 (Agent handle: 0x12eed60) reason :GPU Hang
Aborted (core dumped)

real    1m20.327s
user    0m5.927s
sys     0m46.562s

Measuring System Memory Usage
```bash
initial=$(free --mebi | awk '/^Mem:/ {print $3}'); max=$initial; while sleep 1; do cur=$(free --mebi | awk '/^Mem:/ {print $3}'); (( cur > max )) && max=$cur; printf "\r%s  used=%6d MiB  Δ=%6d MiB  peak=%6d MiB  Δpeak=%6d MiB " "$(date +%T)" "$cur" "$((cur-initial))" "$max" "$((max-initial))"; done

CPU

lhl in 🌐 cluster1 in ~/llama.cpp took 3m5s
❯ time ./llama.cpp-cpu/build/bin/llama-bench -fa 1 -m ~/models/Qwen3-235B-A22B-UD-Q3_K_XL-00001-of-00003.gguf
| model                          |       size |     params | backend    | threads | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | --------------: | -------------------: |
| qwen3moe 235B.A22B Q3_K - Medium |  96.59 GiB |   235.09 B | CPU        |      16 |  1 |           pp512 |         32.58 ± 0.13 |
| qwen3moe 235B.A22B Q3_K - Medium |  96.59 GiB |   235.09 B | CPU        |      16 |  1 |           tg128 |          9.03 ± 0.01 |

build: c753d7be (5392)

real    2m49.389s
user    43m50.907s
sys     0m4.578s


01:54:19  used=  2029 MiB  Δ=    55 MiB  peak= 10578 MiB  Δpeak=  8604 MiB

Vulkan - hard crashes (lockup) w/o -b 256

$ llama.cpp-vulkan/build/bin/llama-bench -fa 1 -m ~/models/Qwen3-235B-A22B-UD-Q3_K_XL-00001-of-00003.gguf -b 256
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_batch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | --------------: | -------------------: |
n| qwen3moe 235B.A22B Q3_K - Medium |  96.59 GiB |   235.09 B | Vulkan,RPC |  99 |     256 |  1 |           pp512 |         23.79 ± 0.10 |
| qwen3moe 235B.A22B Q3_K - Medium |  96.59 GiB |   235.09 B | Vulkan,RPC |  99 |     256 |  1 |           tg128 |         16.09 ± 0.01 |

build: 3cc1f1f1 (5393)

01:58:50  used=91437 MiB  Δ=91423 MiB  peak=91437 MiB  Δpeak=91423 MiB
01:58:47  used=7359 MiB  Δ=7204 MiB  peak=8192 MiB  Δpeak=8037 MiB

HIP - also crashes?

...
llama_context: max_nodes = 65536
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context: reserving graph for n_tokens = 1, n_seqs = 1
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context:      ROCm0 compute buffer size =   304.75 MiB
llama_context:  ROCm_Host compute buffer size =     9.01 MiB
llama_context: graph nodes  = 6116
llama_context: graph splits = 2
attach_threadpool: call
set_n_threads: n_threads = 16, n_threads_batch = 16
HW Exception by GPU node-1 (Agent handle: 0xfe79af0) reason :GPU Hang
Aborted (core dumped)

-b 256

| qwen3moe 235B.A22B Q3_K - Medium |  96.59 GiB |   235.09 B | ROCm,RPC   |  99 |     256 |           pp512 |         65.34 ± 0.18 |
| qwen3moe 235B.A22B Q3_K - Medium |  96.59 GiB |   235.09 B | ROCm,RPC   |  99 |     256 |           tg128 |         10.55 ± 0.01 |

hipBLASLt b256

[root@7a4039b9c011 llama.cpp]# ROCBLAS_USE_HIPBLASLT=1 llama.cpp-hip/build/bin/llama-bench -m /home/lhl/models/Qwen3-235B-A22B-UD-Q3_K_XL-00001-of-00003.gguf -b 256
/share/libdrm/amdgpu.ids: No such file or directory
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_batch |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
rocBLAS error: No hipBLASLt solution found
This message will be only be displayed once, unless the ROCBLAS_VERBOSE_HIPBLASLT_ERROR environment variable is set.

rocBLAS warning: hipBlasLT failed, falling back to tensile.
This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set.
| qwen3moe 235B.A22B Q3_K - Medium |  96.59 GiB |   235.09 B | ROCm,RPC   |  99 |     256 |           pp512 |         87.62 ± 0.40 |
| qwen3moe 235B.A22B Q3_K - Medium |  96.59 GiB |   235.09 B | ROCm,RPC   |  99 |     256 |           tg128 |         10.61 ± 0.03 |

build: c753d7be (5392)

hipblaslt b128

[root@7a4039b9c011 llama.cpp]# ROCBLAS_USE_HIPBLASLT=1 llama.cpp-hip/build/bin/llama-bench -m /home/lhl/models/Qwen3-235B-A22B-UD-Q3_K_XL-00001-of-00003.gguf -b 128
/share/libdrm/amdgpu.ids: No such file or directory
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_batch |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
rocBLAS error: No hipBLASLt solution found
This message will be only be displayed once, unless the ROCBLAS_VERBOSE_HIPBLASLT_ERROR environment variable is set.

rocBLAS warning: hipBlasLT failed, falling back to tensile.
This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set.
| qwen3moe 235B.A22B Q3_K - Medium |  96.59 GiB |   235.09 B | ROCm,RPC   |  99 |     128 |           pp512 |         62.87 ± 0.35 |
| qwen3moe 235B.A22B Q3_K - Medium |  96.59 GiB |   235.09 B | ROCm,RPC   |  99 |     128 |           tg128 |         10.61 ± 0.01 |

build: c753d7be (5392)

hipblaslt - best pp

[root@7a4039b9c011 llama.cpp]# ROCBLAS_USE_HIPBLASLT=1 llama.cpp-hip/build/bin/llama-bench -m /home/lhl/models/Qwen3-235B-A22B-UD-Q3_K_XL-00001-of-00003.gguf
/share/libdrm/amdgpu.ids: No such file or directory
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
rocBLAS error: No hipBLASLt solution found
This message will be only be displayed once, unless the ROCBLAS_VERBOSE_HIPBLASLT_ERROR environment variable is set.

rocBLAS warning: hipBlasLT failed, falling back to tensile.
This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set.
| qwen3moe 235B.A22B Q3_K - Medium |  96.59 GiB |   235.09 B | ROCm,RPC   |  99 |           pp512 |        120.46 ± 0.39 |
| qwen3moe 235B.A22B Q3_K - Medium |  96.59 GiB |   235.09 B | ROCm,RPC   |  99 |           tg128 |         10.63 ± 0.03 |

build: c753d7be (5392)

Qwen3-30B-A3B UD-Q4_K_XL:

Runpp512 (t/s)tg128 (t/s)
CPU252.15 ± 2.9544.05 ± 0.08
Vulkan70.03 ± 0.1875.32 ± 0.08
Vulkan b128163.78 ± 1.0369.32 ± 0.05
Vulkan b256118.78 ± 0.6474.76 ± 0.07
HIP + FA357.32 ± 0.8451.26
HIP + FA + hipBLASLt510.86 ± 3.3051.18 ± 0.01
HIP + WMMA + FA + hipBLASLt547.84 ± 3.2360.29 ± 0.03
  • hipBLASLt runs with ROCBLAS_USE_HIPBLASLT=1

Qwen 3 32B Q8_0

(peak GTT 35 MiB, peak GART 33386 MiB):
❯ time llama.cpp-rocwmma/build/bin/llama-bench -fa 1 -m ~/models/Qwen3-32B-Q8_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3 32B Q8_0                 |  32.42 GiB |    32.76 B | ROCm,RPC   |  99 |  1 |           pp512 |         77.43 ± 0.05 |
| qwen3 32B Q8_0                 |  32.42 GiB |    32.76 B | ROCm,RPC   |  99 |  1 |           tg128 |          6.43 ± 0.00 |

build: 09232370 (5348)

real    2m25.304s
user    2m18.208s
sys     0m3.982s

For pp8192 (peak GTT 33 MiB, peak GART 35306 MiB):

❯ time llama.cpp-rocwmma/build/bin/llama-bench -fa 1 -m ~/models/Qwen3-32B-Q8_0.gguf -p 8192
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3 32B Q8_0                 |  32.42 GiB |    32.76 B | ROCm,RPC   |  99 |  1 |          pp8192 |         75.68 ± 0.23 |
| qwen3 32B Q8_0                 |  32.42 GiB |    32.76 B | ROCm,RPC   |  99 |  1 |           tg128 |          6.42 ± 0.00 |

build: 09232370 (5348)

real    12m33.586s
user    11m48.942s
sys     0m4.186s

I won’t wait around for 128K context (at 75 tok/s, a single pass will take 30 minutes) but running it, I can report that memory usage is peak GTT 35 MiB, peak GART 66156 MiB, os it easily fits, but with such poor pp perf, probably it isn’t very pleasant/generally useful.

load_tensors:        ROCm0 model buffer size = 32410.82 MiB
load_tensors:   CPU_Mapped model buffer size =   788.24 MiB
llama_kv_cache_unified:      ROCm0 KV buffer size = 32768.00 MiB
llama_kv_cache_unified: KV self size  = 32768.00 MiB, K (f16): 16384.00 MiB, V (f16): 16384.00 MiB
| qwen3 32B Q8_0                 |  32.42 GiB |    32.76 B | ROCm,RPC   |  99 |   16384 |  1 |        pp131072 |         75.80 ± 0.00 |

70B

❯ time llama.cpp-vulkan/build/bin/llama-bench -fa 1 -m ~/models/shisa-v2-llama3.3-70b.i1-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | Vulkan,RPC |  99 |  1 |           pp512 |         77.28 ± 0.69 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | Vulkan,RPC |  99 |  1 |           tg128 |          5.02 ± 0.00 |

build: 9a390c48 (5349)

real    3m0.783s
user    0m38.376s
sys     0m8.628s

HIP

❯ time llama.cpp-rocwmma/build/bin/llama-bench -fa 1 -m ~/models/shisa-v2-llama3.3-70b.i1-Q4_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | ROCm,RPC   |  99 |  1 |           pp512 |         34.36 ± 0.02 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | ROCm,RPC   |  99 |  1 |           tg128 |          4.70 ± 0.00 |

build: 09232370 (5348)

real    3m53.133s
user    3m34.265s
sys     0m4.752s