Inference
GLM-5.2 UD-IQ1_M on llama.cpp — 5090 + 3090 Ti speed test (~ 579 t/s prefill @ 8k ctx, ~324 t/s prefill @ 57k ctx, ~10.6 t/s decode)
The GLM-5.2 model, specifically the unsloth/GLM-5.2-GGUF version with UD-IQ1_M quantization, was tested on a system with RTX 5090 and RTX 3090 Ti GPUs, achieving prefill speeds of approximately 579 tokens per second at an 8k context and 324 tokens per second at a 57k context. The model maintained a steady decoding speed of 10.6 tokens per second over 580+ tokens, demonstrating the performance capabilities of the architecture with a 128k context and q8_0 KV cache, which is significant for AI practitioners focusing on optimizing LLM performance and resource allocation.
GLM-5.2speed_test