Inference
Multi-Bitwidth Quantization for LLMs Using Additive Codebooks
The article introduces "Drop-by-Drop," a multi-bitwidth post-training quantization framework for large language models (LLMs) that enables adaptive inference-time precision control without retraining. Leveraging information theory and Matryoshka-style supervision, the method allows a single model to produce accurate reconstructions at various bitwidths, thereby reducing storage and memory requirements while maintaining competitive perplexity and accuracy across architectures like Qwen, LLaMA, Gemma, and Mistral. This approach is significant for practitioners as it facilitates efficient deployment of LLMs on heterogeneous hardware with varying resource constraints.
quantizationllmefficiency