Understanding the Emergence of Seemingly Useless Features in Next-Token Predictors

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.6 Download Report PDF

next-token predictiontransformersinterpretability

Trained Transformers have been shown to compute abstract features that appear redundant for predicting the immediate next token. We identify which components of the gradient signal from the next-token prediction objective give rise to this phenomenon, and we propose a method to estimate the influence of those components on the emergence of specific features. After validating our approach on toy tasks, we use it to interpret the origins of the world model in OthelloGPT and syntactic features in a small language model. Finally, we apply our framework to a pretrained LLM, showing that features with extremely high or low influence on future tokens tend to be related to formal reasoning domains such as code. Overall, our work takes a step toward understanding hidden features of Transformers through the lens of their development during training.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper investigates why transformers trained on next-token prediction develop features that seem redundant for immediate prediction, proposing a gradient decomposition framework to trace feature origins. It resides in the Feature Emergence and Development Dynamics leaf, which contains four papers total, making this a moderately populated research direction within the broader Mechanistic Interpretability branch. The taxonomy shows this leaf focuses specifically on training dynamics and feature development processes, distinguishing it from static representation analysis or circuit-level reverse engineering covered in sibling leaves.

The paper's leaf sits within Mechanistic Interpretability and Feature Analysis, adjacent to leaves examining learned representation geometry and circuit-level mechanisms. Neighboring branches include Theoretical Foundations, which addresses why next-token prediction enables learning through expressiveness proofs and information-theoretic principles, and Training Objectives, which explores modifications to standard prediction targets. The taxonomy structure reveals that while mechanistic interpretability is well-represented, work specifically tracing gradient influence on feature development occupies a relatively focused niche compared to broader representation analysis or theoretical studies.

Among twenty-one candidates examined, two contributions show potential overlap with prior work. The gradient decomposition into direct learning, pre-caching, and circuit sharing was refuted by one of ten candidates examined, as was the method for estimating gradient component influence on features. The framework connecting interventions to gradient influence ratios appears more novel, with zero refutations among one candidate examined. These statistics suggest that while the core gradient analysis concepts have some precedent in the limited search scope, the specific application to feature development dynamics may offer incremental advances over existing mechanistic interpretability methods.

Based on the top-twenty-one semantic matches examined, the work appears to build on established gradient analysis techniques while applying them to the specific puzzle of redundant feature emergence. The limited search scope means potentially relevant work in optimization theory or feature learning outside the immediate next-token prediction context may not be captured. The taxonomy positioning suggests the paper addresses a recognized gap in understanding training dynamics, though the extent of novelty depends on how substantially the proposed framework advances beyond existing gradient attribution methods.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: emergence of features in next-token prediction trained transformers. The field explores how transformers develop internal representations and capabilities through the simple objective of predicting the next token. The taxonomy organizes this landscape into four main branches. Mechanistic Interpretability and Feature Analysis examines what features actually emerge and how they can be understood, including work on feature development dynamics like Useless Features[9] and Outlier Features[21]. Theoretical Foundations and Learning Principles investigates the mathematical underpinnings of why certain structures arise, with studies ranging from geometric perspectives like Geometry of Semantics[3] to learning dynamics such as Next-Token Law[24]. Training Objectives and Architectural Variations explores how modifications to the standard setup affect feature emergence, including alternative prediction targets like Next-Latent Prediction[35] and architectural choices. Applications Beyond Natural Language extends these insights to non-linguistic domains such as vision, code, and scientific data, exemplified by work like Sequential Vision Modeling[14] and Jet Foundation Models[22]. A particularly active line of investigation concerns the developmental trajectory of features during training, with researchers documenting stagewise patterns and the surprising emergence of seemingly redundant or task-irrelevant structures. Useless Features Emergence[0] sits squarely within this feature development cluster, examining why transformers sometimes learn representations that appear unnecessary for the prediction objective. This work closely relates to Useless Features[9] and Outlier Features[21], which similarly investigate anomalous or unexpected feature patterns that arise during training. While Stagewise Development[49] focuses on the temporal progression of capability acquisition, Useless Features Emergence[0] emphasizes the puzzle of why certain features appear at all, raising questions about the implicit biases of next-token prediction and whether such features might serve latent functions or simply reflect optimization artifacts.

Claimed Contributions

Decomposition of gradient signal into direct learning, pre-caching, and circuit sharing

Can Refute

10 retrieved papers

The authors provide a theoretically grounded decomposition of the next-token prediction gradient into three components that explain how Transformers learn features beyond immediate next-token prediction. This framework identifies which gradient paths contribute to feature emergence during training.

10 retrieved papers

Can Refute

Method to estimate influence of gradient components on feature development

Can Refute

10 retrieved papers

The authors introduce an experimental approach to quantify how much each gradient component (direct, pre-cached, shared) contributes to the development of specific features during training. This enables attribution of learned features to their underlying gradient sources.

10 retrieved papers

Can Refute

Framework connecting interventions to gradient influence ratios

1 retrieved paper

The authors establish a connection between causal interventions on trained models and the ratio of gradient components, enabling analysis of feature emergence in large language models without requiring full retraining or access to training trajectories.

1 retrieved paper

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[9] On the Emergence of" Useless" Features in Next Token Predictors PDF

M Rofin, J Naghiyev, M Hahn (2025)

[21] Understanding and minimising outlier features in transformer training PDF

Bobby He, Thomas Hofmann, Lorenzo Noci, Daniele Paliotta, Imanol Schlag (2024)

[49] Stagewise Development in Transformers and the Geometry of the Loss Landscape PDF

J Hoogland, G Wang, M Farrugia-Roberts, L Carroll (0)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Decomposition of gradient signal into direct learning, pre-caching, and circuit sharing

[9] On the Emergence of" Useless" Features in Next Token Predictors PDF

Can Refute

[6] Mechanics of Next Token Prediction with Self-Attention PDF

Cannot Refute

[30] The implicit geometry of language : structure, semantics, and dynamics in next-token prediction PDF

Cannot Refute

[50] Contextual gradient recomposition for sequential coherence preservation in large language model token generation PDF

Cannot Refute

[51] A Close Look at Decomposition-based XAI-Methods for Transformer Language Models PDF

Cannot Refute

[52] DecompX: Explaining Transformers Decisions by Propagating Token Decomposition PDF

Cannot Refute

[53] Faster Language Models with Better Multi-Token Prediction Using Tensor Decomposition PDF

Cannot Refute

[54] Transformer Is Inherently a Causal Learner PDF

Cannot Refute

[55] Adjusting the Output of Decision Transformer with Action Gradient PDF

Cannot Refute

[56] Resolving Token-Space Gradient Conflicts: Token Space Manipulation for Transformer-Based Multi-Task Learning PDF

Cannot Refute

Contribution

Method to estimate influence of gradient components on feature development

[9] On the Emergence of" Useless" Features in Next Token Predictors PDF

Can Refute

[58] Full-gradient representation for neural network visualization PDF

Cannot Refute

[59] Gradient Starvation: A Learning Proclivity in Neural Networks PDF

Cannot Refute

[60] Layerwise optimization by gradient decomposition for continual learning PDF

Cannot Refute

[61] Formation of representations in neural networks PDF

Cannot Refute

[62] A new mechanism for eliminating implicit conflict in graph contrastive learning PDF

Cannot Refute

[63] Gradients as Features for Deep Representation Learning PDF

Cannot Refute

[64] Gradient based feature attribution in explainable ai: A technical review PDF

Cannot Refute

[65] Contrastive Alignment with Semantic Gap-Aware Corrections in Text-Video Retrieval PDF

Cannot Refute

[66] Meta-Learning and Universality: Deep Representations and Gradient Descent can Approximate any Learning Algorithm PDF

Cannot Refute

Contribution

Framework connecting interventions to gradient influence ratios

[57] Structured convergence through latent epoch reshaping for reordering intermediate computations in large language model training PDF

Cannot Refute

Understanding the Emergence of Seemingly Useless Features in Next-Token Predictors

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[9] On the Emergence of" Useless" Features in Next Token Predictors PDF

[21] Understanding and minimising outlier features in transformer training PDF

[49] Stagewise Development in Transformers and the Geometry of the Loss Landscape PDF

Contribution Analysis

Decomposition of gradient signal into direct learning, pre-caching, and circuit sharing

[9] On the Emergence of" Useless" Features in Next Token Predictors PDF

[6] Mechanics of Next Token Prediction with Self-Attention PDF

[30] The implicit geometry of language : structure, semantics, and dynamics in next-token prediction PDF

[50] Contextual gradient recomposition for sequential coherence preservation in large language model token generation PDF

[51] A Close Look at Decomposition-based XAI-Methods for Transformer Language Models PDF

[52] DecompX: Explaining Transformers Decisions by Propagating Token Decomposition PDF

[53] Faster Language Models with Better Multi-Token Prediction Using Tensor Decomposition PDF

[54] Transformer Is Inherently a Causal Learner PDF

[55] Adjusting the Output of Decision Transformer with Action Gradient PDF

[56] Resolving Token-Space Gradient Conflicts: Token Space Manipulation for Transformer-Based Multi-Task Learning PDF

Method to estimate influence of gradient components on feature development

[9] On the Emergence of" Useless" Features in Next Token Predictors PDF

[58] Full-gradient representation for neural network visualization PDF

[59] Gradient Starvation: A Learning Proclivity in Neural Networks PDF

[60] Layerwise optimization by gradient decomposition for continual learning PDF

[61] Formation of representations in neural networks PDF

[62] A new mechanism for eliminating implicit conflict in graph contrastive learning PDF

[63] Gradients as Features for Deep Representation Learning PDF

[64] Gradient based feature attribution in explainable ai: A technical review PDF

[65] Contrastive Alignment with Semantic Gap-Aware Corrections in Text-Video Retrieval PDF

[66] Meta-Learning and Universality: Deep Representations and Gradient Descent can Approximate any Learning Algorithm PDF

Framework connecting interventions to gradient influence ratios

[57] Structured convergence through latent epoch reshaping for reordering intermediate computations in large language model training PDF

Table of Contents