TrainingarXiv cs.AI — 10 d ago

Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention

This paper analyzes the failures of low-precision training in transformer models, specifically addressing catastrophic loss explosions when using flash attention. It identifies two key issues: the emergence of similar low-rank representations and the compounding bias from rounding errors in low-precision arithmetic. The authors propose a modification to the flash attention mechanism that mitigates these rounding biases, stabilizing the training process and providing a practical solution for practitioners facing instability in low-precision settings.

transformerslow-precisiontraining-instabilitiesrelevance 0.00 · engagement 0.00

Read at source ↗← all news