Inference
This is amazing. Token speed doubled + kv cache now need low vram - qwen 27b
The Qwen3.6-27B model has achieved a significant increase in token generation speed, doubling from previous benchmarks to 38.6 tokens per second on an RTX 3090, while also reducing VRAM usage from 21GB to 17.5GB. This optimization includes a native 256K context and maintains high accuracy across various tasks with a needle recall of 88-100% at 6% residency, ensuring that practitioners can leverage improved performance without compromising on resource requirements. The changes are documented in the GitHub repository linked in the article, emphasizing the potential for enhanced efficiency in AI applications.
token_speedkv_cacheqwen