Training
Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention
This paper analyzes the failures of low-precision training in transformer models, specifically addressing catastrophic loss explosions when using flash attention. It identifies two key issues: the emergence of similar low-rank representations and the compounding bias from rounding errors in low-precision arithmetic. The authors propose a modification to the flash attention mechanism that mitigates these rounding biases, stabilizing the training process and providing a practical solution for practitioners facing instability in low-precision settings.
transformerslow-precisiontraining-instabilities