Research
Leverage Is Not Reach: A Control-Window Law for Single-Neuron Steering in Language Models
The paper introduces a control-window framework for single-neuron steering in aligned language models, emphasizing that coherent control is achieved when behavior triggers remain below a defined collapse ceiling. The framework quantifies the relationship between the residual stream and neuron writes, demonstrating that coherent control can be predicted with a mean absolute error of 0.14 across various neurons. This research is significant for AI practitioners as it provides a theoretical foundation for manipulating model behaviors through targeted neuron interventions, enhancing understanding of controllability in language models.
neuroncontrollanguage models