ResearcharXiv cs.AI — 7 d ago

Transformer Field Theory: A Response-Theoretic Approach to Mechanistic Interpretability

The paper introduces Transformer Field Theory, a response-theoretic framework that conceptualizes the residual stream of a Transformer as a field across layer depth and token position, enhancing mechanistic interpretability. It details methods such as localized source insertion, first-order sensitivity fields, and Green functions to predict the effects of interventions in GPT-2-style autoregressive Transformers. This framework offers a structured approach to understanding patching experiments and facilitates inference and response transfer across scales, providing valuable insights for practitioners working with Transformer models.

mechanistic interpretabilitytransformersactivation patchingrelevance 0.00 · engagement 0.00

Read at source ↗← all news