Inference
2× Radeon R9700 — Qwen 3.6 27B Q8 MTP on llama.cpp
The article discusses the implementation of a multi-GPU setup using two Gigabyte Radeon AI PRO R9700 GPUs to run the Qwen 3.6 model with 27 billion parameters in a llama.cpp environment. Key performance metrics include decoding rates of 46-67 tokens per second for various context sizes up to 102k tokens and prefill throughput of approximately 1,200-1,500 tokens per second for prompts under 10k tokens. This setup is significant for practitioners as it demonstrates effective use of high VRAM GPUs for large context processing in AI applications, alongside insights into optimizing token generation and memory management.
qwenmulti_gpu