README.md
April 17, 2026 · View on GitHub
llama.cpp-hip-turboquant
Arch Linux PKGBUILD and minimal patch to build the
turbo-tan/llama.cpp-tq3 fork
with the HIP/ROCm backend, enabling TurboQuant-quantized models
(TQ3_1S, TQ3_4S) and TQ3_0 KV cache on AMD GPUs.
The upstream fork is CUDA-only — this patch only adds the HIP shims required
for the GPU kernels. No change to the TurboQuant logic itself.
Validated on
- GPU: AMD Radeon RX 7900 XTX (gfx1100)
- ROCm: 7.2.0 / HIP 7.2.26043
- OS: Manjaro Linux (kernel 6.19)
- Model:
YTan2000/Qwen3.6-35B-A3B-TQ3_4S+ BF16 mmproj
Measured performance (llama-server, 65k context)
| Config | Prompt | Generation |
|---|---|---|
-ctk q4_0 -ctv tq3_0 -fa on | 238 t/s | 78.2 t/s |
| Default f16 KV cache | 159 t/s | 89.6 t/s |
VRAM usage: ~15 / 24 GB (full 35B model on GPU).
What the patch changes
112 lines, 4 files in ggml/src/ggml-cuda/:
vendors/hip.h— variadic__shfl_*_syncmacros (3/4-arg),__ballot_sync,
cudaEventCreate/cudaEventElapsedTimeshims.tq3-native.cuh— conditional HIP vs CUDA include forfp16.hand
GGML_COMMON_DECL_*.tq3-native.cu— include order (common.cuhbeforetq3-native.cuh).vecdotq.cuh— replaces 4×__dp4awithggml_cuda_dp4a(portable wrapper that maps tov_dot4con RDNA3).
Install
git clone <this-repo> llama.cpp-hip-turboquant
cd llama.cpp-hip-turboquant
makepkg -si
The package conflicts with llama.cpp and llama.cpp-hip (both ship
/usr/bin/llama-server). pacman will handle the transition.
Limitations
- GGML_RPC=OFF — the RPC backend is disabled because of a static_assert on
GGML_OP_COUNT that changed upstream. Patch separately if you need it.
- llama-cli: add --no-warmup to avoid a rare kernel that hangs during
warmup. llama-server is not affected.
Supported types in this build
┌────────┬─────┬─────────────────┐
│ Type │ ID │ Role │
├────────┼─────┼─────────────────┤
│ TQ3_1S │ 44 │ Weights │
├────────┼─────┼─────────────────┤
│ TQ3_4S │ 46 │ Weights (4 bpw) │
├────────┼─────┼─────────────────┤
│ TQ3_0 │ 200 │ KV cache │
└────────┴─────┴─────────────────┘
The TURBO2_0 / TURBO3_0 / TURBO4_0 / TQ4_1S variants from the
https://github.com/domvox/llama.cpp-turboquant-hip fork are not
supported — the two KV cache designs have diverged.
Credits
- Upstream TurboQuant (CUDA): https://github.com/turbo-tan/llama.cpp-tq3
- Original HIP port (different KV cache design):
https://github.com/domvox/llama.cpp-turboquant-hip
- Base PKGBUILD: llama.cpp-hip by Orion-zhen / txtsd on AUR
- Paper: https://arxiv.org/abs/2504.19874
License
MIT (same as upstream llama.cpp).