Inference
SPEAR: A System for Post-Quantization Error-Adaptive Recovery Enabling Efficient Low-Bit LLM Serving
SPEAR is a system designed to enhance low-bit serving of large language models (LLMs) by implementing post-quantization error-adaptive recovery. It utilizes lightweight Error Compensators (ECs) modulated by per-token gates, strategically placed at the most error-sensitive layers, to address the input-dependent quantization error. SPEAR demonstrates significant performance improvements, recovering 56-75% of the perplexity gap between 4-bit quantized models and FP16, while maintaining low memory overhead and latency similar to existing 4-bit serving solutions, making it a valuable tool for practitioners aiming to optimize LLM deployment efficiency.
llmquantizationserving