Inference
2× Radeon AI PRO R9700 (RDNA4/gfx1201) on vLLM 0.22.1 — how we fixed the long-context decode cliff (and what we learned chasing FP8)
The article discusses the setup and performance improvements achieved using two AMD Radeon AI PRO R9700 GPUs (RDNA4 architecture) with vLLM version 0.22.1. The key technical advancement was addressing the long-context decode performance drop, which previously saw a significant decline in throughput from ~100 tok/s at 8K context to just 14 tok/s at 79K context, attributed to unoptimized ROCm attention paths. By implementing AITER Unified Attention, the authors were able to mitigate this issue, enhancing the decoding efficiency for large context sizes, which is critical for practitioners aiming to optimize LLM performance on AMD hardware.
vllmlong-contextdecode