Inference
GLM 5.2, what speeds are we getting locally?
The community is discussing performance metrics for the GLM 5.2 model when run locally, soliciting reports on inference engines, system specifications, quantization methods, context sizes, and tokens per second. One user reported using the llama.cpp framework on a system with 6 RTX 3090 GPUs and an i7-13700K processor, achieving 7.8 tokens/sec for generation with a 90K context size and Q8_0 KV quantization. This information is crucial for practitioners optimizing local deployments of large language models, as it provides benchmarks to assess performance under various configurations.
glminferenceperformance