TrainingarXiv cs.CL — 2 d ago

Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

The paper introduces the issue of "Attention Amnesia" in hybrid linear-attention models, where chain-of-thought (CoT) supervised fine-tuning (SFT) degrades long-context recall, particularly observed in models like HypeNet and Jet-Nemotron. The authors demonstrate that CoT-SFT significantly reduces retrieval performance on the Needle-In-A-Haystack benchmark, with HypeNet-9B dropping from 67.2% to 9.4% on NIAH-S2@256K. They propose a novel method, QK-Restore, which selectively restores query-key projections from pre-SFT checkpoints, achieving improved long-context performance (e.g., HypeNet-5B S3@256K increased from 65.4% to 76.4%) without additional training, thus providing a practical solution for practitioners facing similar degradation in model performance.

fine-tuningattentionrecallrelevance 0.00 · engagement 0.00

Read at source ↗← all news