ResearcharXiv cs.AI — 10 d ago

Scalable Circuit Learning for Interpreting Large Language Models

The article introduces CircuitLasso, a scalable circuit-learning technique for interpreting large language models (LLMs) using sparse linear regression. This method addresses the challenge of high dimensionality in sparse autoencoder (SAE) features, enabling the recovery of circuits with structural accuracy comparable to existing intervention-based methods, but at significantly reduced computational costs. CircuitLasso enhances interpretability by revealing the relationships among SAE features and their influence on model predictions, demonstrating its effectiveness in achieving competitive performance on domain-generalization tasks.

circuit learninglarge language modelsinterpretabilityrelevance 0.00 · engagement 0.00

Read at source ↗← all news