Understanding the Emergence of Seemingly Useless Features in Next-Token Predictors

ICLR 2026 Conference SubmissionAnonymous Authors
next-token predictiontransformersinterpretability
Abstract:

Trained Transformers have been shown to compute abstract features that appear redundant for predicting the immediate next token. We identify which components of the gradient signal from the next-token prediction objective give rise to this phenomenon, and we propose a method to estimate the influence of those components on the emergence of specific features. After validating our approach on toy tasks, we use it to interpret the origins of the world model in OthelloGPT and syntactic features in a small language model. Finally, we apply our framework to a pretrained LLM, showing that features with extremely high or low influence on future tokens tend to be related to formal reasoning domains such as code. Overall, our work takes a step toward understanding hidden features of Transformers through the lens of their development during training.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper investigates why transformers trained on next-token prediction develop features that seem redundant for immediate prediction, proposing a gradient decomposition framework to trace feature origins. It resides in the Feature Emergence and Development Dynamics leaf, which contains four papers total, making this a moderately populated research direction within the broader Mechanistic Interpretability branch. The taxonomy shows this leaf focuses specifically on training dynamics and feature development processes, distinguishing it from static representation analysis or circuit-level reverse engineering covered in sibling leaves.

The paper's leaf sits within Mechanistic Interpretability and Feature Analysis, adjacent to leaves examining learned representation geometry and circuit-level mechanisms. Neighboring branches include Theoretical Foundations, which addresses why next-token prediction enables learning through expressiveness proofs and information-theoretic principles, and Training Objectives, which explores modifications to standard prediction targets. The taxonomy structure reveals that while mechanistic interpretability is well-represented, work specifically tracing gradient influence on feature development occupies a relatively focused niche compared to broader representation analysis or theoretical studies.

Among twenty-one candidates examined, two contributions show potential overlap with prior work. The gradient decomposition into direct learning, pre-caching, and circuit sharing was refuted by one of ten candidates examined, as was the method for estimating gradient component influence on features. The framework connecting interventions to gradient influence ratios appears more novel, with zero refutations among one candidate examined. These statistics suggest that while the core gradient analysis concepts have some precedent in the limited search scope, the specific application to feature development dynamics may offer incremental advances over existing mechanistic interpretability methods.

Based on the top-twenty-one semantic matches examined, the work appears to build on established gradient analysis techniques while applying them to the specific puzzle of redundant feature emergence. The limited search scope means potentially relevant work in optimization theory or feature learning outside the immediate next-token prediction context may not be captured. The taxonomy positioning suggests the paper addresses a recognized gap in understanding training dynamics, though the extent of novelty depends on how substantially the proposed framework advances beyond existing gradient attribution methods.

Taxonomy

Core-task Taxonomy Papers
49
3
Claimed Contributions
21
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: emergence of features in next-token prediction trained transformers. The field explores how transformers develop internal representations and capabilities through the simple objective of predicting the next token. The taxonomy organizes this landscape into four main branches. Mechanistic Interpretability and Feature Analysis examines what features actually emerge and how they can be understood, including work on feature development dynamics like Useless Features[9] and Outlier Features[21]. Theoretical Foundations and Learning Principles investigates the mathematical underpinnings of why certain structures arise, with studies ranging from geometric perspectives like Geometry of Semantics[3] to learning dynamics such as Next-Token Law[24]. Training Objectives and Architectural Variations explores how modifications to the standard setup affect feature emergence, including alternative prediction targets like Next-Latent Prediction[35] and architectural choices. Applications Beyond Natural Language extends these insights to non-linguistic domains such as vision, code, and scientific data, exemplified by work like Sequential Vision Modeling[14] and Jet Foundation Models[22]. A particularly active line of investigation concerns the developmental trajectory of features during training, with researchers documenting stagewise patterns and the surprising emergence of seemingly redundant or task-irrelevant structures. Useless Features Emergence[0] sits squarely within this feature development cluster, examining why transformers sometimes learn representations that appear unnecessary for the prediction objective. This work closely relates to Useless Features[9] and Outlier Features[21], which similarly investigate anomalous or unexpected feature patterns that arise during training. While Stagewise Development[49] focuses on the temporal progression of capability acquisition, Useless Features Emergence[0] emphasizes the puzzle of why certain features appear at all, raising questions about the implicit biases of next-token prediction and whether such features might serve latent functions or simply reflect optimization artifacts.

Claimed Contributions

Decomposition of gradient signal into direct learning, pre-caching, and circuit sharing

The authors provide a theoretically grounded decomposition of the next-token prediction gradient into three components that explain how Transformers learn features beyond immediate next-token prediction. This framework identifies which gradient paths contribute to feature emergence during training.

10 retrieved papers
Can Refute
Method to estimate influence of gradient components on feature development

The authors introduce an experimental approach to quantify how much each gradient component (direct, pre-cached, shared) contributes to the development of specific features during training. This enables attribution of learned features to their underlying gradient sources.

10 retrieved papers
Can Refute
Framework connecting interventions to gradient influence ratios

The authors establish a connection between causal interventions on trained models and the ratio of gradient components, enabling analysis of feature emergence in large language models without requiring full retraining or access to training trajectories.

1 retrieved paper

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Decomposition of gradient signal into direct learning, pre-caching, and circuit sharing

The authors provide a theoretically grounded decomposition of the next-token prediction gradient into three components that explain how Transformers learn features beyond immediate next-token prediction. This framework identifies which gradient paths contribute to feature emergence during training.

Contribution

Method to estimate influence of gradient components on feature development

The authors introduce an experimental approach to quantify how much each gradient component (direct, pre-cached, shared) contributes to the development of specific features during training. This enables attribution of learned features to their underlying gradient sources.

Contribution

Framework connecting interventions to gradient influence ratios

The authors establish a connection between causal interventions on trained models and the ratio of gradient components, enabling analysis of feature emergence in large language models without requiring full retraining or access to training trajectories.