Mechanistic Analysis of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning
This study investigates catastrophic forgetting in Large Language Models (LLMs) during continual fine-tuning, analyzing twenty leading models, including Claude Fable 5 and GPT-5.5 High. It employs techniques like weight-space trajectory tracking and Centered Kernel Alignment to identify vulnerable neural circuits, revealing that early-layer attention heads and mid-to-deep feed-forward networks are particularly affected. The introduction of Low-Rank Circuit Projection (LRCP) is proposed as a solution, demonstrating a reduction of up to 94.2% in forgetting while maintaining adaptation speed comparable to standard Parameter-Efficient Fine-Tuning (PEFT) methods, providing valuable insights for practitioners addressing model stability during sequential task learning.