Research
Demystifying Variance in Circuit Discovery of LLMs
The paper introduces CEAP, a novel circuit discovery method that enhances the existing EAP-IG technique by significantly reducing resampling variance, which affects the stability of discovered circuits. It identifies that rephrasing variance occurs due to different prompt templates activating distinct model circuits, complicating the task of creating a unified circuit to represent model behavior. The study concludes that while sparsity does not mitigate these issues, sample-wise variance is often benign, linked more to the definition of unfaithfulness rather than the circuits themselves, highlighting challenges in interpretability and control of LLMs.
circuit discoverymechanistic interpretabilityvariability