ai-digest.dev
last updated 3 h ago
InferenceReddit r/LocalLLaMA 14 d ago

GLM-5.2 (744B, 2-bit) at 7.3 tok/s on 4×3090 + 192GB — and why IQ1_M wasn't any faster

The GLM-5.2 model, featuring 744 billion parameters and utilizing a 2-bit quantization, achieves a decoding speed of approximately 7.3 tokens per second when run on a setup with four RTX 3090 GPUs and 192GB of RAM. Key findings indicate that decoding performance is primarily limited by CPU compute when offloading experts, rather than memory bandwidth, with an observed 22% speed increase when doubling CPU threads from 6 to 12. This information is crucial for practitioners as it highlights the importance of CPU resources and expert distribution for optimizing performance in large language model deployments.

glm-5.2performance3090relevance 0.00 · engagement 0.00
Read at source ↗← all news
GLM-5.2 (744B, 2-bit) at 7.3 tok/s on 4×3090 + 192GB — and why IQ1_M wasn't any faster — AI News Digest