Inference
Trainable Smooth-Rotation Transforms with Learned Channel Scales for LLM Quantization
The paper presents a novel quantile-robust scaling policy for SmoothRot-style transforms aimed at enhancing post-training quantization (PTQ) for Large Language Models (LLMs). The proposed method, tested on the LLaMA-3.2-1B model under W4A4 quantization, achieves an 18.5% reduction in selected-layer error compared to the SmoothRot baseline, with significant improvements in full-layer mean error as well. This approach offers AI practitioners a more effective strategy for reducing quantization errors in LLMs, enhancing model efficiency without compromising the underlying architecture.
quantizationLLMactivation