FACT: a first-principles alternative to the Neural Feature Ansatz for how networks learn representations

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

feature learningdeep learningneural feature ansatzconvergencetheory

It is a central challenge in deep learning to understand how neural networks learn representations. A leading approach is the Neural Feature Ansatz (NFA) (Radhakrishnan et al., 2024), a conjectured mechanism for how feature learning occurs. Although the NFA is empirically validated, it is an educated guess and lacks a theoretical basis, and thus it is unclear when it might fail, and how to improve it. In this paper, we take a first-principles approach to understanding why this observation holds, and when it does not. We use first-order optimality conditions to derive the Features at Convergence Theorem (FACT), an alternative to the NFA that (a) obtains greater agreement with learned features at convergence, (b) explains why the NFA holds in most settings, and (c) captures essential feature learning phenomena in neural networks such as grokking behavior in modular arithmetic and phase transitions in learning sparse parities, similarly to the NFA. Thus, our results unify theoretical first-order optimality analyses of neural networks with the empirically-driven NFA literature, and provide a principled alternative that provably and empirically holds at convergence.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes the Features at Convergence Theorem (FACT) as a first-principles alternative to the empirically-driven Neural Feature Ansatz (NFA), deriving feature learning mechanisms from optimization theory and convergence conditions. It resides in the 'First-Principles and Optimization-Based Analysis' leaf, which contains only two papers total within the broader theoretical foundations branch. This represents a relatively sparse research direction within the taxonomy, suggesting that rigorous optimization-theoretic approaches to feature learning remain underexplored compared to empirical or application-driven methods. The sibling paper in this leaf appears to focus on related mechanistic questions, indicating a small but coherent cluster of work examining fundamental learning dynamics.

The taxonomy reveals that theoretical feature learning research is organized into three main directions: first-principles analysis (where this work sits), dynamics and evolution of representations, and transferability studies. Neighboring branches include unsupervised learning methods and interpretability techniques, which analyze learned features post-hoc rather than modeling their formation. The scope note for this leaf explicitly excludes 'empirical observations without theoretical derivation,' positioning FACT as complementary to the larger body of empirical NFA literature. The work bridges optimization theory with the empirically-validated NFA framework, potentially connecting formal convergence analysis to observed learning phenomena like grokking and phase transitions.

Among the twenty-eight candidates examined through semantic search and citation expansion, none were identified as clearly refuting any of the three main contributions. The FACT theorem itself was evaluated against ten candidates with zero refutable matches, as was the FACT-based Recursive Feature Machine algorithm. The theoretical explanation connecting NFA to first-order optimality examined eight candidates, also with no refutations found. This limited search scope suggests that within the top semantic matches, no prior work explicitly derives convergence-based feature learning mechanisms from first-order optimality conditions in the manner proposed. However, the modest search scale means potentially relevant optimization-theoretic analyses outside this candidate set remain unexamined.

Based on the available signals, the work appears to occupy a relatively novel position within the limited scope examined, particularly in formalizing the empirical NFA through optimization theory. The sparse population of the first-principles analysis leaf and absence of refuting candidates among twenty-eight examined papers suggest limited direct prior work on this specific approach. However, the analysis covers only top semantic matches and does not constitute an exhaustive survey of optimization theory applied to neural network feature learning, leaving open the possibility of related theoretical frameworks in adjacent mathematical or machine learning literature.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Understanding how neural networks learn representations through feature learning. The field encompasses a broad spectrum of research directions, organized into several major branches. Theoretical Foundations and Mechanisms of Feature Learning investigates the underlying principles and optimization dynamics that govern how networks discover useful features, often through first-principles analysis and mathematical frameworks. Unsupervised and Self-Supervised Representation Learning explores methods like Contrastive Predictive Coding[18] and Contrastive Self-Distillation[5] that extract structure from unlabeled data. Graph and Network Representation Learning, exemplified by Network Representation Learning[2] and Graph Representation Learning[11], focuses on encoding relational structures. Interpretability and Analysis branches, including Feature Visualization Survey[9], aim to decode what networks have learned. Architectural Innovations introduce novel designs such as Neural Discrete Representation[3] and Matryoshka Learning[7], while Multi-View and Multi-Modal approaches integrate diverse data sources. Application-Driven and Specialized Learning Paradigms address domain-specific challenges and alternative training objectives. Within the theoretical landscape, a particularly active line of work examines optimization-based and mechanistic explanations of feature emergence, contrasting gradient-driven dynamics with structural inductive biases. FACT[0] situates itself in this first-principles analysis cluster, closely aligned with Feature Learning Mechanism[17], which similarly investigates the fundamental processes by which networks construct representations during training. While Feature Learning Mechanism[17] may emphasize empirical observations of learning trajectories, FACT[0] appears to adopt a more analytical stance, potentially leveraging optimization theory to explain when and why certain features emerge. This contrasts with interpretability-focused efforts like Feature Visualization Survey[9] or representation manipulation studies such as Representation Erasure[1], which analyze learned features post-hoc rather than modeling their formation. The central tension across these branches involves balancing mathematical rigor with empirical relevance, and understanding whether feature learning can be predicted from architectural and data properties alone.

Claimed Contributions

Features at Convergence Theorem (FACT)

10 retrieved papers

The authors derive a first-principles relation based on first-order optimality conditions that neural networks must satisfy at convergence. This provides a theoretically grounded alternative to the empirically-observed Neural Feature Ansatz for understanding how networks learn representations.

10 retrieved papers

FACT-based Recursive Feature Machine algorithm

10 retrieved papers

The authors develop a learning algorithm powered by FACT instead of NFA that reproduces key feature learning behaviors such as phase transitions in sparse parity learning and grokking in modular arithmetic, while achieving state-of-the-art performance on tabular data.

10 retrieved papers

Theoretical explanation connecting NFA to first-order optimality

8 retrieved papers

The authors algebraically expand the FACT relation to show it is qualitatively similar to the NFA conjecture, providing theoretical foundation for why NFA typically holds by connecting it to provable first-order optimality conditions.

8 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[17] Mechanism for feature learning in neural networks and backpropagation-free machine learning models PDF

Adityanarayanan Radhakrishnan, Daniel Beaglehole, Parthe Pandit, Misha Belkin, Mikhail Belkin (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Features at Convergence Theorem (FACT)

[51] Neural Tangent Kernel: Convergence and Generalization in Neural Networks PDF

Cannot Refute

[52] A convergence theory for deep learning via over-parameterization PDF

Cannot Refute

[53] The Features at Convergence Theorem: a first-principles alternative to the Neural Feature Ansatz for how networks learn representations PDF

Cannot Refute

[54] On the global convergence of gradient descent for over-parameterized models using optimal transport PDF

Cannot Refute

[55] Cheap Orthogonal Constraints in Neural Networks: A Simple Parametrization of the Orthogonal and Unitary Group PDF

Cannot Refute

[56] Convex multi-task feature learning PDF

Cannot Refute

[57] A comparative analysis of gradient descent-based optimization algorithms on convolutional neural networks PDF

Cannot Refute

[58] Theoretical properties of the global optimizer of two layer neural network PDF

Cannot Refute

[59] Globally optimal gradient descent for a convnet with gaussian inputs PDF

Cannot Refute

[60] No Wrong Turns: The Simple Geometry Of Neural Networks Optimization Paths PDF

Cannot Refute

Contribution

FACT-based Recursive Feature Machine algorithm

[53] The Features at Convergence Theorem: a first-principles alternative to the Neural Feature Ansatz for how networks learn representations PDF

Cannot Refute

[68] A Tale of Two Circuits: Grokking as Competition of Sparse and Dense Subnetworks PDF

Cannot Refute

[69] Bridging lottery ticket and grokking: Understanding grokking from inner structure of networks PDF

Cannot Refute

[70] Combinatorial Tasks as Model Systems of Deep Learning PDF

Cannot Refute

[71] The Slingshot Effect: A Late-Stage Optimization Anomaly in Adaptive Gradient Methods PDF

Cannot Refute

[72] Feature Learning Dynamics under Grokking in a Sparse Parity Task PDF

Cannot Refute

[73] Grokking in Neural Networks: A Review PDF

Cannot Refute

[74] Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking PDF

Cannot Refute

[75] Beyond Memorization: Exploring the Dynamics of Grokking in Sparse Neural Networks PDF

Cannot Refute

[76] A simple and interpretable model of grokking modular arithmetic tasks PDF

Cannot Refute

Contribution

Theoretical explanation connecting NFA to first-order optimality

[53] The Features at Convergence Theorem: a first-principles alternative to the Neural Feature Ansatz for how networks learn representations PDF

Cannot Refute

[61] Quantum many-body dynamics in two dimensions with artificial neural networks PDF

Cannot Refute

[62] Solving nonlinear klein-gordon equations via pydens: A neural network-based pde solver PDF

Cannot Refute

[63] Deep Neural Collapse Is Provably Optimal for the Deep Unconstrained Features Model PDF

Cannot Refute

[64] Solving Elliptic Optimal Control Problems Using Physics Informed Neural Networks PDF

Cannot Refute

[65] Nonlinearity from linear scale invariance, quadratic maps and neural network: applications in bio-solitons PDF

Cannot Refute

[66] Compute-Optimal Solutions for Acoustic Wave Equation Using Hard-Constraint PINNs PDF

Cannot Refute

[67] Kernel learning, optimal control and Bayesian posterior sampling with low rank tensor formats PDF

Cannot Refute

FACT: a first-principles alternative to the Neural Feature Ansatz for how networks learn representations

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[17] Mechanism for feature learning in neural networks and backpropagation-free machine learning models PDF

Contribution Analysis

Features at Convergence Theorem (FACT)

[51] Neural Tangent Kernel: Convergence and Generalization in Neural Networks PDF

[52] A convergence theory for deep learning via over-parameterization PDF

[53] The Features at Convergence Theorem: a first-principles alternative to the Neural Feature Ansatz for how networks learn representations PDF

[54] On the global convergence of gradient descent for over-parameterized models using optimal transport PDF

[55] Cheap Orthogonal Constraints in Neural Networks: A Simple Parametrization of the Orthogonal and Unitary Group PDF

[56] Convex multi-task feature learning PDF

[57] A comparative analysis of gradient descent-based optimization algorithms on convolutional neural networks PDF

[58] Theoretical properties of the global optimizer of two layer neural network PDF

[59] Globally optimal gradient descent for a convnet with gaussian inputs PDF

[60] No Wrong Turns: The Simple Geometry Of Neural Networks Optimization Paths PDF

FACT-based Recursive Feature Machine algorithm

[53] The Features at Convergence Theorem: a first-principles alternative to the Neural Feature Ansatz for how networks learn representations PDF

[68] A Tale of Two Circuits: Grokking as Competition of Sparse and Dense Subnetworks PDF

[69] Bridging lottery ticket and grokking: Understanding grokking from inner structure of networks PDF

[70] Combinatorial Tasks as Model Systems of Deep Learning PDF

[71] The Slingshot Effect: A Late-Stage Optimization Anomaly in Adaptive Gradient Methods PDF

[72] Feature Learning Dynamics under Grokking in a Sparse Parity Task PDF

[73] Grokking in Neural Networks: A Review PDF

[74] Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking PDF

[75] Beyond Memorization: Exploring the Dynamics of Grokking in Sparse Neural Networks PDF

[76] A simple and interpretable model of grokking modular arithmetic tasks PDF

Theoretical explanation connecting NFA to first-order optimality

[53] The Features at Convergence Theorem: a first-principles alternative to the Neural Feature Ansatz for how networks learn representations PDF

[61] Quantum many-body dynamics in two dimensions with artificial neural networks PDF

[62] Solving nonlinear klein-gordon equations via pydens: A neural network-based pde solver PDF

[63] Deep Neural Collapse Is Provably Optimal for the Deep Unconstrained Features Model PDF

[64] Solving Elliptic Optimal Control Problems Using Physics Informed Neural Networks PDF

[65] Nonlinearity from linear scale invariance, quadratic maps and neural network: applications in bio-solitons PDF

[66] Compute-Optimal Solutions for Acoustic Wave Equation Using Hard-Constraint PINNs PDF

[67] Kernel learning, optimal control and Bayesian posterior sampling with low rank tensor formats PDF

Table of Contents