Training
RaBiT: Residual-Aware Binarization Training for Accurate and Efficient LLMs
The article introduces RaBiT, a novel quantization framework designed to enhance the efficiency of large language models (LLMs) through residual-aware binarization training. By addressing the issue of inter-path adaptation during quantization-aware training, RaBiT enforces a residual hierarchy that derives binary paths from a shared full-precision weight, leading to improved accuracy and a $4.49\times$ speed-up in inference over full-precision models on an RTX 4090. This advancement is significant for practitioners as it pushes the boundaries of the accuracy-efficiency trade-off in low-bit quantization methods, rivaling traditional hardware-intensive approaches like Vector Quantization.
quantizationLLMefficiency