Inference
Ternary Mamba: Grouped Quantization-Aware Training of W1.58A16 State Space Models
The article presents a new approach to quantization-aware training (QAT) for State Space Models (SSMs), specifically the Mamba-2 model, achieving a compression of 3.61x from 1.3B parameters to 744 MB while maintaining competitive performance with 48.1% zero-shot accuracy after training on just 102M tokens. This method utilizes grouped QAT with knowledge distillation from a frozen FP16 teacher, significantly reducing the required training data and time, while also introducing the concept of zero-ratio collapse, a challenge unique to learnable quantization scales in SSMs. This advancement is crucial for practitioners as it allows for efficient deployment of large models on edge devices without the need for extensive training from scratch.
quantizationstate-space-modelsresource-constraints