Inference
Optimal Post-Training Quantization Scales and Where to Find Them
The paper introduces PiSO (Piecewise Scale Optimization), an algorithm designed to enhance post-training quantization (PTQ) by calculating optimal channel-wise weight scales using calibration data, rather than simple heuristics. It efficiently partitions the scale search space for quantization and extends to group-wise quantization with error correction strategies. Experimental results on Llama and Qwen models indicate significant improvements in perplexity and zero-shot accuracy, particularly for lower bit-widths, highlighting the method's relevance for practitioners aiming to optimize model performance under quantization constraints.
quantizationlarge language modelscompression