Inference
GLM 5.2 on Mac Studio Speedup PR
GLM 5.2 has been optimized for Mac Studio, achieving prefill speeds exceeding 100 tokens per second (t/s) while accommodating larger contexts. This update allows for 4-bit quantization with context sizes over 100,000 tokens, enhancing performance and efficiency for practitioners working with large language models. The improvements are detailed in a pull request by the oMLX creator, indicating significant advancements in model deployment on Mac hardware.
glmmacspeedup