Automated Interpretability Metrics Do Not Distinguish Trained and Random Transformers
Overview
Overall Novelty Assessment
The paper tests whether standard SAE quality metrics can distinguish trained transformers from randomly initialized ones, finding that many metrics yield similar scores in both settings. It resides in the 'Baseline Comparisons and Metric Reliability' leaf, which contains only three papers total. This leaf focuses specifically on validating SAE metrics through controlled baselines rather than proposing new architectures or applications. The small population suggests this critical evaluation angle remains relatively underexplored compared to the broader SAE literature, where architectural innovations and domain applications dominate.
The taxonomy reveals that most SAE research concentrates on architectural improvements, domain-specific applications, and feature interpretation pipelines. The paper's parent branch, 'SAE Quality Metrics and Validation Methods', sits alongside five other major branches covering innovations, interpretability techniques, circuit analysis, steering applications, and domain extensions. Its sibling papers examine feature alignment and metric reliability, but the broader field shows limited emphasis on randomized controls. Neighboring leaves like 'Scaling Laws and Hyperparameter Tuning' address performance variation without questioning whether metrics themselves are meaningful, highlighting a gap this work addresses.
Among twenty-one candidates examined, the sanity-check contribution found one refutable prior work, while the token entropy measure and toy model contributions each examined ten candidates with one and zero refutations respectively. The limited search scope means these statistics reflect top-ranked semantic matches rather than exhaustive coverage. The sanity-check finding appears to have some precedent in the examined literature, whereas the entropy-based abstractness measure shows no clear overlap among the candidates reviewed. The toy model analysis similarly lacks strong prior work within this sample, though nine candidates remain unclear or non-refutable.
Based on the top-twenty-one semantic matches, the work addresses a methodological gap in SAE validation that the taxonomy structure confirms is sparsely populated. The sanity-check contribution has at least one overlapping prior result, suggesting the core insight may not be entirely unprecedented. However, the entropy measure and toy models appear more novel within this limited search window. The analysis cannot rule out relevant work outside the examined candidates, particularly in adjacent fields like general neural network interpretability or representation learning beyond SAEs.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors demonstrate that commonly used SAE quality metrics and auto-interpretability scores produce similar results when applied to both trained transformers and randomly initialized ones, revealing a fundamental limitation in current evaluation methods for mechanistic interpretability.
The authors propose using the entropy of latent activation distributions over token IDs as a proxy for feature complexity or abstractness, showing that this metric can distinguish between simple token-specific features and more abstract learned features where aggregate auto-interpretability scores fail.
The authors develop toy models showing that randomly initialized neural networks can preserve or even amplify superposition present in input data, providing a mechanistic explanation for why SAEs trained on random transformers can produce seemingly interpretable features.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[2] Sparse Autoencoders Can Interpret Randomly Initialized Transformers PDF
[21] Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Sanity check showing SAE metrics fail to distinguish trained from random transformers
The authors demonstrate that commonly used SAE quality metrics and auto-interpretability scores produce similar results when applied to both trained transformers and randomly initialized ones, revealing a fundamental limitation in current evaluation methods for mechanistic interpretability.
[2] Sparse Autoencoders Can Interpret Randomly Initialized Transformers PDF
Token distribution entropy as a measure of feature abstractness
The authors propose using the entropy of latent activation distributions over token IDs as a proxy for feature complexity or abstractness, showing that this metric can distinguish between simple token-specific features and more abstract learned features where aggregate auto-interpretability scores fail.
[51] Filter pruning by quantifying feature similarity and entropy of feature maps PDF
[52] RAD-BNN: Regulating activation distribution for accurate binary neural network PDF
[53] The geometry of concepts: Sparse autoencoder feature structure PDF
[54] Contextual lattice probing for large language models: A study of interleaved multi-space activation patterns PDF
[55] Latent cascade synthesis: Investigating iterative pseudo-contextual scaffold formation in contemporary large language models PDF
[56] Universal neurons in gpt2 language models PDF
[57] An Entropy- and Attention-Based Feature Extraction and Selection Network for Multi-Target Coupling Scenarios PDF
[58] Confidence regulation neurons in language models PDF
[59] Structural recomposition in large language models through lexico-semantic vector fusion: A computational study PDF
[60] Forest Fire Detection via Feature Entropy Guided Neural Network PDF
Toy models demonstrating superposition preservation and amplification in random networks
The authors develop toy models showing that randomly initialized neural networks can preserve or even amplify superposition present in input data, providing a mechanistic explanation for why SAEs trained on random transformers can produce seemingly interpretable features.