ResearcharXiv cs.AI — 7 d ago

Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

The article presents the "Bag of Dims" framework, which leverages the standard basis of transformer hidden states for training-free mechanistic interpretability across models like Qwen (3.5-4B), Gemma (3-4B), and Mistral (7B). It demonstrates that individual dimensions in hidden states can encode semantic features and predictive content, achieving up to 93% top-5 next-token accuracy using only sign patterns, and identifies 175 semantic categories with a mean AUC of 0.80 without any training. This approach is significant for practitioners as it facilitates feature extraction and understanding of transformer models without the need for extensive computational resources or training, enabling more efficient interpretability in AI systems.

interpretabilitymechanistictransformersrelevance 0.00 · engagement 0.00

Read at source ↗← all news