Research
Transformer Field Theory: A Response-Theoretic Approach to Mechanistic Interpretability
The paper introduces Transformer Field Theory, a response-theoretic framework that conceptualizes the residual stream of a Transformer as a field across layer depth and token position, enhancing mechanistic interpretability. It details methods such as localized source insertion, first-order sensitivity fields, and Green functions to predict the effects of interventions in GPT-2-style autoregressive Transformers. This framework offers a structured approach to understanding patching experiments and facilitates inference and response transfer across scales, providing valuable insights for practitioners working with Transformer models.
mechanistic interpretabilitytransformersactivation patching