Training
TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization
The article introduces TWLA, a post-training quantization framework designed to optimize large language models (LLMs) by achieving 1.58-bit weight compression and 4-bit activation quantization while preserving accuracy. TWLA employs three innovative components: E2M-ATQ for minimizing layer-output error during weight ternarization, KOTMS for reshaping weights into ternary-friendly distributions, and ILA-AMP for optimizing activation quantization across layers. This framework significantly reduces memory and computational costs, making it crucial for practitioners aiming to deploy LLMs efficiently.
quantizationllm