ResearcharXiv cs.CL — 14 d ago

From Sparse Features to Trustworthy Proxies: Certifying SAE-Based Interpretability

The paper presents a framework for certifying the interpretability of sparse autoencoders (SAEs) used in extracting features from language models (LMs), specifically focusing on when these features can be considered faithful representations of the underlying model. The framework establishes an upper bound on the expected risk of the base model, utilizing metrics such as proxy risk and reconstruction gap, and demonstrates empirical validation on models like GPT-2 Small, Gemma-2B, and Llama-3-8B. This work is significant for practitioners as it provides a method to assess the reliability of SAE-based explanations, enhancing the interpretability of LMs in practical applications.

interpretabilityexplainabilitysaerelevance 0.00 · engagement 0.00

Read at source ↗← all news