Research
Language Model Circuits Are Sparse in the Neuron Basis
The paper presents a novel approach to language model interpretability by demonstrating that multi-layer perceptron (MLP) neurons can form a sparse feature basis similar to sparse autoencoders (SAEs). An end-to-end gradient-based attribution pipeline is introduced for circuit tracing on the MLP neuron basis, revealing that a circuit of approximately 100 neurons can effectively control model behavior on tasks like subject-verb agreement and multi-hop reasoning. This advancement enables automated interpretability of language models without incurring additional training costs, which is crucial for practitioners aiming to enhance model transparency and understanding.
interpretabilityneural networksmlp