ResearcharXiv cs.AI — 4 d ago

ICA Lens: Interpreting Language Models Without Training Another Dictionary

The article introduces ICALens, a novel workflow for applying independent component analysis (ICA) to interpret language model representations without the need for extensive training of additional dictionaries. ICALens utilizes a GPU-parallel FastICA pipeline tailored for LLMs, demonstrating competitive performance against sparse autoencoders (SAEs) in tasks like sparse probing and targeted probe perturbation across models such as GPT-2 Small and Gemma 2 2B. This approach highlights ICA's potential as an efficient tool for understanding LLM behavior, offering a stable and auditable method for layer-wise analysis that can accelerate exploration in model interpretability.

interpretabilitylanguage modelsicarelevance 0.00 · engagement 0.00

Read at source ↗← all news