InferenceReddit r/LocalLLaMA — 13 d ago

Qwen3.6-35B-A3B APEX on a Single RTX 3090 - Getting the Most Out of It

The article discusses the optimization of the Qwen3.6-35B-A3B model on an RTX 3090 GPU, focusing on achieving high quality and speed with a minimum context length of 128k. Benchmark tests using two forks of llama.cpp (ik_llama and spiritbuun) reveal that the ik_llama with the I-Compact APEX model achieves the highest decoding speed (~146 TPS), while spiritbuun's I-Quality model maintains competitive speeds (~137 TPS) with better quality metrics. This information is crucial for practitioners looking to maximize performance and efficiency when deploying large language models on limited hardware.

QwenRTX_3090optimizationrelevance 0.00 · engagement 0.00

Read at source ↗← all news