Inference
Qwen3.6-35B-A3B APEX on a Single RTX 3090 - Getting the Most Out of It
The article discusses the optimization of the Qwen3.6-35B-A3B model on an RTX 3090 GPU, focusing on achieving high quality and speed with a minimum context length of 128k. Benchmark tests using two forks of llama.cpp (ik_llama and spiritbuun) reveal that the ik_llama with the I-Compact APEX model achieves the highest decoding speed (~146 TPS), while spiritbuun's I-Quality model maintains competitive speeds (~137 TPS) with better quality metrics. This information is crucial for practitioners looking to maximize performance and efficiency when deploying large language models on limited hardware.
QwenRTX_3090optimization