Automated Interpretability Metrics Do Not Distinguish Trained and Random Transformers

ICLR 2026 Conference SubmissionAnonymous Authors
Sparse AutoencodersSAEsLLMsinterpretability
Abstract:

Sparse autoencoders (SAEs) are widely used to extract sparse, interpretable latents from transformer activations. We test whether commonly used SAE quality metrics and automatic explanation pipelines can distinguish trained transformers from randomly initialized ones (e.g., where parameters are sampled i.i.d. from a Gaussian). Over a wide range of Pythia model sizes and multiple randomization schemes, we find that, in many settings, SAEs trained on randomly initialized transformers produce auto-interpretability scores and reconstruction metrics that are similar to those from trained models. These results show that high aggregate auto-interpretability scores do not, by themselves, guarantee that learned, computationally relevant features have been recovered. We therefore recommend treating common SAE metrics as useful but insufficient proxies for mechanistic interpretability and argue for routine randomized baselines and targeted measures of feature 'abstractness'.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper tests whether standard SAE quality metrics can distinguish trained transformers from randomly initialized ones, finding that many metrics yield similar scores in both settings. It resides in the 'Baseline Comparisons and Metric Reliability' leaf, which contains only three papers total. This leaf focuses specifically on validating SAE metrics through controlled baselines rather than proposing new architectures or applications. The small population suggests this critical evaluation angle remains relatively underexplored compared to the broader SAE literature, where architectural innovations and domain applications dominate.

The taxonomy reveals that most SAE research concentrates on architectural improvements, domain-specific applications, and feature interpretation pipelines. The paper's parent branch, 'SAE Quality Metrics and Validation Methods', sits alongside five other major branches covering innovations, interpretability techniques, circuit analysis, steering applications, and domain extensions. Its sibling papers examine feature alignment and metric reliability, but the broader field shows limited emphasis on randomized controls. Neighboring leaves like 'Scaling Laws and Hyperparameter Tuning' address performance variation without questioning whether metrics themselves are meaningful, highlighting a gap this work addresses.

Among twenty-one candidates examined, the sanity-check contribution found one refutable prior work, while the token entropy measure and toy model contributions each examined ten candidates with one and zero refutations respectively. The limited search scope means these statistics reflect top-ranked semantic matches rather than exhaustive coverage. The sanity-check finding appears to have some precedent in the examined literature, whereas the entropy-based abstractness measure shows no clear overlap among the candidates reviewed. The toy model analysis similarly lacks strong prior work within this sample, though nine candidates remain unclear or non-refutable.

Based on the top-twenty-one semantic matches, the work addresses a methodological gap in SAE validation that the taxonomy structure confirms is sparsely populated. The sanity-check contribution has at least one overlapping prior result, suggesting the core insight may not be entirely unprecedented. However, the entropy measure and toy models appear more novel within this limited search window. The analysis cannot rule out relevant work outside the examined candidates, particularly in adjacent fields like general neural network interpretability or representation learning beyond SAEs.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
21
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: evaluating sparse autoencoder interpretability metrics on transformer activations. The field has organized itself around several complementary directions. SAE Quality Metrics and Validation Methods focuses on establishing reliable benchmarks and baseline comparisons to assess whether learned features are genuinely interpretable or merely artifacts of the training process, as explored in Random Transformers SAE[2] and Feature-Aligned SAE[21]. SAE Architectural Innovations investigates novel designs such as gated variants (Gated Sparse Autoencoders[28]) and adaptive sparsity mechanisms (AdaptiveK SAE[41]) to improve feature disentanglement. Feature Interpretation and Validation Techniques develops systematic approaches to verify that discovered features correspond to meaningful semantic or syntactic properties, exemplified by SAE Interpretable Features[3] and Testable Vision Features[8]. Circuit Analysis and Mechanistic Interpretability leverages SAE-derived features to trace information flow within transformers (Transcoders Feature Circuits[10]), while Model Steering and Control Applications applies these features to guide model behavior (Latent Feature Steering[27]). Domain-Specific SAE Applications extends the methodology to specialized areas including radiology (Radiology SAE[13]), protein modeling (Protein Language SAE[15]), and scientific foundation models (Scientific Foundation SAE[17]), alongside branches addressing downstream task utilization, multilingual studies (Italian Language SAE[42]), and theoretical foundations. A particularly active line of work examines the reliability of interpretability claims by testing SAEs on controlled or random baselines, questioning whether high reconstruction fidelity alone guarantees meaningful feature discovery. Automated Interpretability Metrics[0] sits squarely within this critical evaluation strand, proposing systematic metrics to distinguish genuinely interpretable features from spurious patterns. Its emphasis on metric reliability aligns closely with Random Transformers SAE[2], which tests SAE behavior on unstructured activations, and Feature-Aligned SAE[21], which explores alignment between learned features and ground-truth concepts. These works collectively address a central tension: while many studies demonstrate impressive reconstruction or downstream performance (SAE Classification Transferability[4]), fewer rigorously validate that the features themselves are interpretable in a human-meaningful sense. By focusing on automated evaluation protocols, Automated Interpretability Metrics[0] contributes methodological rigor to a field where qualitative inspection has often dominated, helping bridge the gap between architectural innovation and trustworthy interpretability.

Claimed Contributions

Sanity check showing SAE metrics fail to distinguish trained from random transformers

The authors demonstrate that commonly used SAE quality metrics and auto-interpretability scores produce similar results when applied to both trained transformers and randomly initialized ones, revealing a fundamental limitation in current evaluation methods for mechanistic interpretability.

1 retrieved paper
Can Refute
Token distribution entropy as a measure of feature abstractness

The authors propose using the entropy of latent activation distributions over token IDs as a proxy for feature complexity or abstractness, showing that this metric can distinguish between simple token-specific features and more abstract learned features where aggregate auto-interpretability scores fail.

10 retrieved papers
Toy models demonstrating superposition preservation and amplification in random networks

The authors develop toy models showing that randomly initialized neural networks can preserve or even amplify superposition present in input data, providing a mechanistic explanation for why SAEs trained on random transformers can produce seemingly interpretable features.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Sanity check showing SAE metrics fail to distinguish trained from random transformers

The authors demonstrate that commonly used SAE quality metrics and auto-interpretability scores produce similar results when applied to both trained transformers and randomly initialized ones, revealing a fundamental limitation in current evaluation methods for mechanistic interpretability.

Contribution

Token distribution entropy as a measure of feature abstractness

The authors propose using the entropy of latent activation distributions over token IDs as a proxy for feature complexity or abstractness, showing that this metric can distinguish between simple token-specific features and more abstract learned features where aggregate auto-interpretability scores fail.

Contribution

Toy models demonstrating superposition preservation and amplification in random networks

The authors develop toy models showing that randomly initialized neural networks can preserve or even amplify superposition present in input data, providing a mechanistic explanation for why SAEs trained on random transformers can produce seemingly interpretable features.