OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features

ICLR 2026 Conference SubmissionAnonymous Authors
sparse autoencodermechanistic interpretabilitylanguage modelrepresentation learningfeature disentanglementregularization
Abstract:

Sparse autoencoders (SAEs) are a technique for sparse decomposition of neural network activations into human-interpretable features. However, current SAEs suffer from feature absorption, where specialized features capture instances of general features creating representation holes, and feature composition, where independent features merge into composite representations. In this work, we introduce Orthogonal SAE (OrtSAE), a novel approach aimed to mitigate these issues by enforcing orthogonality between the learned features. By implementing a new training procedure that penalizes high pairwise cosine similarity between SAE features, OrtSAE promotes the development of disentangled features while scaling linearly with the SAE size, avoiding significant computational overhead. We train OrtSAE across different models and layers and compare it with other methods. We find that OrtSAE discovers 9% more distinct features, reduces feature absorption (by 65%) and composition (by 15%), improves performance on spurious correlation removal (+6%), and achieves on-par performance for other downstream tasks compared to traditional SAEs.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Orthogonal SAE (OrtSAE), which enforces orthogonality between learned features via cosine similarity penalties during training. This work occupies a dedicated leaf node ('Orthogonality-Constrained SAEs') within the Core Sparse Autoencoder Architectures and Training branch, with no sibling papers in the same leaf. The taxonomy reveals this is a relatively sparse research direction compared to neighboring leaves like 'Standard and Gated SAE Variants' (2 papers) or 'Random Baseline and Interpretability Validation' (1 paper), suggesting orthogonality constraints represent an emerging but not yet crowded approach to SAE design.

The taxonomy positions OrtSAE within a broader ecosystem of SAE architectural innovations. Neighboring leaves explore alternative constraints: gated mechanisms separate feature detection from magnitude estimation, while random baseline studies validate whether interpretability arises from training versus architecture alone. The parent category ('Core Sparse Autoencoder Architectures and Training') excludes evaluation frameworks and application-specific SAEs, focusing purely on foundational training procedures. Related branches address decomposition quality analysis ('Dark Matter and Reconstruction Error Analysis') and theoretical foundations ('Superposition Theory'), indicating OrtSAE contributes to architectural design rather than theoretical understanding or empirical validation of existing methods.

Among 27 candidates examined across three contributions, no clearly refuting prior work was identified. The 'Orthogonal SAE training approach' examined 7 candidates with 0 refutable matches, while 'improved feature atomicity' and 'spurious correlation removal' each examined 10 candidates with 0 refutations. This suggests that within the limited search scope, orthogonality constraints as a training mechanism appear novel, though the analysis does not claim exhaustive coverage. The statistics indicate moderate-scale literature examination rather than comprehensive field survey, leaving open the possibility of relevant work outside the top-K semantic matches and citation expansion performed.

Based on the limited search scope of 27 candidates, OrtSAE appears to occupy a distinct position within SAE architecture design, addressing feature disentanglement through geometric constraints not prominently represented in examined prior work. The taxonomy structure confirms this sits in a sparse research direction, though the analysis acknowledges it cannot rule out relevant orthogonality-based methods beyond the examined candidate set. The contribution-level statistics suggest novelty in mechanism rather than incremental refinement of established approaches.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
27
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: sparse decomposition of neural network activations into interpretable features. The field has organized itself around several complementary strategies for extracting human-understandable structure from opaque neural representations. Sparse Autoencoder Methods for Activation Decomposition form the largest branch, encompassing core architectures like Gated Sparse Autoencoders[2] and variants that impose different structural constraints during training. Alternative Sparse Decomposition and Feature Extraction Methods explore non-autoencoder approaches such as dictionary learning and matching pursuit techniques. Interpretable Neural Architectures with Built-In Sparsity design models with inherent interpretability, while Sparse Features for Mechanistic Interpretability and Control leverage discovered features to understand and steer model behavior, as seen in Steering Vector SAE[27]. Domain-Specific Sparse Interpretability Applications adapt these methods to specialized contexts like medical imaging or music recognition, and Clustering and Unsupervised Sparse Methods pursue feature discovery without supervised signals. Finally, SAE Feature Quality Metrics and Interpretability Scoring address the critical challenge of evaluating whether extracted features are genuinely meaningful, with works like Principled SAE Evaluation[17] proposing systematic assessment frameworks. Recent activity has concentrated on refining autoencoder architectures and addressing feature quality concerns. Within the core SAE branch, researchers have explored various architectural constraints to improve feature disentanglement and interpretability. OrtSAE[0] sits squarely in this lineage, proposing orthogonality constraints on learned features to reduce redundancy and improve decomposition quality. This contrasts with approaches like Gated Sparse Autoencoders[2], which achieve sparsity through gating mechanisms, or Dark Matter SAE[5], which focuses on capturing previously undetected activation patterns. A key tension across these methods involves balancing reconstruction fidelity against feature interpretability: stricter constraints like orthogonality may yield cleaner decompositions but risk missing complex feature interactions that methods like Interpretable Feature Interaction[1] aim to preserve. The positioning of OrtSAE[0] reflects ongoing efforts to impose geometric structure on feature spaces, complementing evaluation work that seeks to validate whether such constraints genuinely enhance human understanding of neural computations.

Claimed Contributions

Orthogonal SAE (OrtSAE) training approach

The authors introduce a new training procedure for sparse autoencoders that enforces orthogonality between learned features through a chunk-wise penalty on pairwise cosine similarity. This approach scales linearly with SAE size and aims to mitigate feature absorption and composition while maintaining computational efficiency.

7 retrieved papers
Demonstration of improved feature atomicity

The authors demonstrate through quantitative experiments that OrtSAE discovers 9% more distinct features, reduces feature absorption by 65%, and reduces feature composition by 15% compared to traditional SAEs, showing improved disentanglement of representations.

10 retrieved papers
Improved performance on spurious correlation removal

The authors show that OrtSAE achieves comparable performance to existing SAE methods on most downstream tasks while providing a 6% improvement on spurious correlation removal, demonstrating practical benefits of orthogonal features for certain applications.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Orthogonal SAE (OrtSAE) training approach

The authors introduce a new training procedure for sparse autoencoders that enforces orthogonality between learned features through a chunk-wise penalty on pairwise cosine similarity. This approach scales linearly with SAE size and aims to mitigate feature absorption and composition while maintaining computational efficiency.

Contribution

Demonstration of improved feature atomicity

The authors demonstrate through quantitative experiments that OrtSAE discovers 9% more distinct features, reduces feature absorption by 65%, and reduces feature composition by 15% compared to traditional SAEs, showing improved disentanglement of representations.

Contribution

Improved performance on spurious correlation removal

The authors show that OrtSAE achieves comparable performance to existing SAE methods on most downstream tasks while providing a 6% improvement on spurious correlation removal, demonstrating practical benefits of orthogonal features for certain applications.

OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features | Novelty Validation