Automated Interpretability Metrics Do Not Distinguish Trained and Random Transformers

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Sparse AutoencodersSAEsLLMsinterpretability

Sparse autoencoders (SAEs) are widely used to extract sparse, interpretable latents from transformer activations. We test whether commonly used SAE quality metrics and automatic explanation pipelines can distinguish trained transformers from randomly initialized ones (e.g., where parameters are sampled i.i.d. from a Gaussian). Over a wide range of Pythia model sizes and multiple randomization schemes, we find that, in many settings, SAEs trained on randomly initialized transformers produce auto-interpretability scores and reconstruction metrics that are similar to those from trained models. These results show that high aggregate auto-interpretability scores do not, by themselves, guarantee that learned, computationally relevant features have been recovered. We therefore recommend treating common SAE metrics as useful but insufficient proxies for mechanistic interpretability and argue for routine randomized baselines and targeted measures of feature 'abstractness'.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper tests whether standard SAE quality metrics can distinguish trained transformers from randomly initialized ones, finding that many metrics yield similar scores in both settings. It resides in the 'Baseline Comparisons and Metric Reliability' leaf, which contains only three papers total. This leaf focuses specifically on validating SAE metrics through controlled baselines rather than proposing new architectures or applications. The small population suggests this critical evaluation angle remains relatively underexplored compared to the broader SAE literature, where architectural innovations and domain applications dominate.

The taxonomy reveals that most SAE research concentrates on architectural improvements, domain-specific applications, and feature interpretation pipelines. The paper's parent branch, 'SAE Quality Metrics and Validation Methods', sits alongside five other major branches covering innovations, interpretability techniques, circuit analysis, steering applications, and domain extensions. Its sibling papers examine feature alignment and metric reliability, but the broader field shows limited emphasis on randomized controls. Neighboring leaves like 'Scaling Laws and Hyperparameter Tuning' address performance variation without questioning whether metrics themselves are meaningful, highlighting a gap this work addresses.

Among twenty-one candidates examined, the sanity-check contribution found one refutable prior work, while the token entropy measure and toy model contributions each examined ten candidates with one and zero refutations respectively. The limited search scope means these statistics reflect top-ranked semantic matches rather than exhaustive coverage. The sanity-check finding appears to have some precedent in the examined literature, whereas the entropy-based abstractness measure shows no clear overlap among the candidates reviewed. The toy model analysis similarly lacks strong prior work within this sample, though nine candidates remain unclear or non-refutable.

Based on the top-twenty-one semantic matches, the work addresses a methodological gap in SAE validation that the taxonomy structure confirms is sparsely populated. The sanity-check contribution has at least one overlapping prior result, suggesting the core insight may not be entirely unprecedented. However, the entropy measure and toy models appear more novel within this limited search window. The analysis cannot rule out relevant work outside the examined candidates, particularly in adjacent fields like general neural network interpretability or representation learning beyond SAEs.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: evaluating sparse autoencoder interpretability metrics on transformer activations. The field has organized itself around several complementary directions. SAE Quality Metrics and Validation Methods focuses on establishing reliable benchmarks and baseline comparisons to assess whether learned features are genuinely interpretable or merely artifacts of the training process, as explored in Random Transformers SAE[2] and Feature-Aligned SAE[21]. SAE Architectural Innovations investigates novel designs such as gated variants (Gated Sparse Autoencoders[28]) and adaptive sparsity mechanisms (AdaptiveK SAE[41]) to improve feature disentanglement. Feature Interpretation and Validation Techniques develops systematic approaches to verify that discovered features correspond to meaningful semantic or syntactic properties, exemplified by SAE Interpretable Features[3] and Testable Vision Features[8]. Circuit Analysis and Mechanistic Interpretability leverages SAE-derived features to trace information flow within transformers (Transcoders Feature Circuits[10]), while Model Steering and Control Applications applies these features to guide model behavior (Latent Feature Steering[27]). Domain-Specific SAE Applications extends the methodology to specialized areas including radiology (Radiology SAE[13]), protein modeling (Protein Language SAE[15]), and scientific foundation models (Scientific Foundation SAE[17]), alongside branches addressing downstream task utilization, multilingual studies (Italian Language SAE[42]), and theoretical foundations. A particularly active line of work examines the reliability of interpretability claims by testing SAEs on controlled or random baselines, questioning whether high reconstruction fidelity alone guarantees meaningful feature discovery. Automated Interpretability Metrics[0] sits squarely within this critical evaluation strand, proposing systematic metrics to distinguish genuinely interpretable features from spurious patterns. Its emphasis on metric reliability aligns closely with Random Transformers SAE[2], which tests SAE behavior on unstructured activations, and Feature-Aligned SAE[21], which explores alignment between learned features and ground-truth concepts. These works collectively address a central tension: while many studies demonstrate impressive reconstruction or downstream performance (SAE Classification Transferability[4]), fewer rigorously validate that the features themselves are interpretable in a human-meaningful sense. By focusing on automated evaluation protocols, Automated Interpretability Metrics[0] contributes methodological rigor to a field where qualitative inspection has often dominated, helping bridge the gap between architectural innovation and trustworthy interpretability.

Claimed Contributions

Sanity check showing SAE metrics fail to distinguish trained from random transformers

Can Refute

1 retrieved paper

The authors demonstrate that commonly used SAE quality metrics and auto-interpretability scores produce similar results when applied to both trained transformers and randomly initialized ones, revealing a fundamental limitation in current evaluation methods for mechanistic interpretability.

1 retrieved paper

Can Refute

Token distribution entropy as a measure of feature abstractness

10 retrieved papers

The authors propose using the entropy of latent activation distributions over token IDs as a proxy for feature complexity or abstractness, showing that this metric can distinguish between simple token-specific features and more abstract learned features where aggregate auto-interpretability scores fail.

10 retrieved papers

Toy models demonstrating superposition preservation and amplification in random networks

Can Refute

10 retrieved papers

The authors develop toy models showing that randomly initialized neural networks can preserve or even amplify superposition present in input data, providing a mechanistic explanation for why SAEs trained on random transformers can produce seemingly interpretable features.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] Sparse Autoencoders Can Interpret Randomly Initialized Transformers PDF

Heap, Thomas, Thomas Heap, Tim Lawson, Aitchison, Laurence, Lucy Farnik, Laurence Aitchison (2025)

[21] Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders PDF

Luke Marks, Paren, Alasdair, Alasdair Paren, Krueger, David, David Krueger, Alisdair Paren, Barez, Fazl, Fazl Barez (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Sanity check showing SAE metrics fail to distinguish trained from random transformers

[2] Sparse Autoencoders Can Interpret Randomly Initialized Transformers PDF

Can Refute

Contribution

Token distribution entropy as a measure of feature abstractness

[51] Filter pruning by quantifying feature similarity and entropy of feature maps PDF

Cannot Refute

[52] RAD-BNN: Regulating activation distribution for accurate binary neural network PDF

Cannot Refute

[53] The geometry of concepts: Sparse autoencoder feature structure PDF

Cannot Refute

[54] Contextual lattice probing for large language models: A study of interleaved multi-space activation patterns PDF

Cannot Refute

[55] Latent cascade synthesis: Investigating iterative pseudo-contextual scaffold formation in contemporary large language models PDF

Cannot Refute

[56] Universal neurons in gpt2 language models PDF

Cannot Refute

[57] An Entropy- and Attention-Based Feature Extraction and Selection Network for Multi-Target Coupling Scenarios PDF

Cannot Refute

[58] Confidence regulation neurons in language models PDF

Cannot Refute

[59] Structural recomposition in large language models through lexico-semantic vector fusion: A computational study PDF

Cannot Refute

[60] Forest Fire Detection via Feature Entropy Guided Neural Network PDF

Cannot Refute

Contribution

Toy models demonstrating superposition preservation and amplification in random networks

[69] 1 Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks PDF

Can Refute

[61] Regimes and mechanisms of transient amplification in abstract and biological neural networks PDF

Cannot Refute

[62] Supermasks in superposition PDF

Cannot Refute

[63] Variational quantum neural networks based on dynamic layerwise strategy: H. Qi et al. PDF

Cannot Refute

[64] Theory of the superposition principle for randomized connectionist representations in neural networks PDF

Cannot Refute

[65] Superposing many tickets into one: A performance booster for sparse neural network training PDF

Cannot Refute

[66] Neural Superposition Networks PDF

Cannot Refute

[67] Artificial neural networks with random weights for incomplete datasets PDF

Cannot Refute

[68] Optimization in Neural Networks PDF

Cannot Refute

[70] Initialization of supervised training for parametric estimation PDF

Cannot Refute

Automated Interpretability Metrics Do Not Distinguish Trained and Random Transformers

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] Sparse Autoencoders Can Interpret Randomly Initialized Transformers PDF

[21] Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders PDF

Contribution Analysis

Sanity check showing SAE metrics fail to distinguish trained from random transformers

[2] Sparse Autoencoders Can Interpret Randomly Initialized Transformers PDF

Token distribution entropy as a measure of feature abstractness

[51] Filter pruning by quantifying feature similarity and entropy of feature maps PDF

[52] RAD-BNN: Regulating activation distribution for accurate binary neural network PDF

[53] The geometry of concepts: Sparse autoencoder feature structure PDF

[54] Contextual lattice probing for large language models: A study of interleaved multi-space activation patterns PDF

[55] Latent cascade synthesis: Investigating iterative pseudo-contextual scaffold formation in contemporary large language models PDF

[56] Universal neurons in gpt2 language models PDF

[57] An Entropy- and Attention-Based Feature Extraction and Selection Network for Multi-Target Coupling Scenarios PDF

[58] Confidence regulation neurons in language models PDF

[59] Structural recomposition in large language models through lexico-semantic vector fusion: A computational study PDF

[60] Forest Fire Detection via Feature Entropy Guided Neural Network PDF

Toy models demonstrating superposition preservation and amplification in random networks

[69] 1 Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks PDF

[61] Regimes and mechanisms of transient amplification in abstract and biological neural networks PDF

[62] Supermasks in superposition PDF

[63] Variational quantum neural networks based on dynamic layerwise strategy: H. Qi et al. PDF

[64] Theory of the superposition principle for randomized connectionist representations in neural networks PDF

[65] Superposing many tickets into one: A performance booster for sparse neural network training PDF

[66] Neural Superposition Networks PDF

[67] Artificial neural networks with random weights for incomplete datasets PDF

[68] Optimization in Neural Networks PDF

[70] Initialization of supervised training for parametric estimation PDF

Table of Contents