FACT: a first-principles alternative to the Neural Feature Ansatz for how networks learn representations
Overview
Overall Novelty Assessment
The paper proposes the Features at Convergence Theorem (FACT) as a first-principles alternative to the empirically-driven Neural Feature Ansatz (NFA), deriving feature learning mechanisms from optimization theory and convergence conditions. It resides in the 'First-Principles and Optimization-Based Analysis' leaf, which contains only two papers total within the broader theoretical foundations branch. This represents a relatively sparse research direction within the taxonomy, suggesting that rigorous optimization-theoretic approaches to feature learning remain underexplored compared to empirical or application-driven methods. The sibling paper in this leaf appears to focus on related mechanistic questions, indicating a small but coherent cluster of work examining fundamental learning dynamics.
The taxonomy reveals that theoretical feature learning research is organized into three main directions: first-principles analysis (where this work sits), dynamics and evolution of representations, and transferability studies. Neighboring branches include unsupervised learning methods and interpretability techniques, which analyze learned features post-hoc rather than modeling their formation. The scope note for this leaf explicitly excludes 'empirical observations without theoretical derivation,' positioning FACT as complementary to the larger body of empirical NFA literature. The work bridges optimization theory with the empirically-validated NFA framework, potentially connecting formal convergence analysis to observed learning phenomena like grokking and phase transitions.
Among the twenty-eight candidates examined through semantic search and citation expansion, none were identified as clearly refuting any of the three main contributions. The FACT theorem itself was evaluated against ten candidates with zero refutable matches, as was the FACT-based Recursive Feature Machine algorithm. The theoretical explanation connecting NFA to first-order optimality examined eight candidates, also with no refutations found. This limited search scope suggests that within the top semantic matches, no prior work explicitly derives convergence-based feature learning mechanisms from first-order optimality conditions in the manner proposed. However, the modest search scale means potentially relevant optimization-theoretic analyses outside this candidate set remain unexamined.
Based on the available signals, the work appears to occupy a relatively novel position within the limited scope examined, particularly in formalizing the empirical NFA through optimization theory. The sparse population of the first-principles analysis leaf and absence of refuting candidates among twenty-eight examined papers suggest limited direct prior work on this specific approach. However, the analysis covers only top semantic matches and does not constitute an exhaustive survey of optimization theory applied to neural network feature learning, leaving open the possibility of related theoretical frameworks in adjacent mathematical or machine learning literature.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors derive a first-principles relation based on first-order optimality conditions that neural networks must satisfy at convergence. This provides a theoretically grounded alternative to the empirically-observed Neural Feature Ansatz for understanding how networks learn representations.
The authors develop a learning algorithm powered by FACT instead of NFA that reproduces key feature learning behaviors such as phase transitions in sparse parity learning and grokking in modular arithmetic, while achieving state-of-the-art performance on tabular data.
The authors algebraically expand the FACT relation to show it is qualitatively similar to the NFA conjecture, providing theoretical foundation for why NFA typically holds by connecting it to provable first-order optimality conditions.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[17] Mechanism for feature learning in neural networks and backpropagation-free machine learning models PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Features at Convergence Theorem (FACT)
The authors derive a first-principles relation based on first-order optimality conditions that neural networks must satisfy at convergence. This provides a theoretically grounded alternative to the empirically-observed Neural Feature Ansatz for understanding how networks learn representations.
[51] Neural Tangent Kernel: Convergence and Generalization in Neural Networks PDF
[52] A convergence theory for deep learning via over-parameterization PDF
[53] The Features at Convergence Theorem: a first-principles alternative to the Neural Feature Ansatz for how networks learn representations PDF
[54] On the global convergence of gradient descent for over-parameterized models using optimal transport PDF
[55] Cheap Orthogonal Constraints in Neural Networks: A Simple Parametrization of the Orthogonal and Unitary Group PDF
[56] Convex multi-task feature learning PDF
[57] A comparative analysis of gradient descent-based optimization algorithms on convolutional neural networks PDF
[58] Theoretical properties of the global optimizer of two layer neural network PDF
[59] Globally optimal gradient descent for a convnet with gaussian inputs PDF
[60] No Wrong Turns: The Simple Geometry Of Neural Networks Optimization Paths PDF
FACT-based Recursive Feature Machine algorithm
The authors develop a learning algorithm powered by FACT instead of NFA that reproduces key feature learning behaviors such as phase transitions in sparse parity learning and grokking in modular arithmetic, while achieving state-of-the-art performance on tabular data.
[53] The Features at Convergence Theorem: a first-principles alternative to the Neural Feature Ansatz for how networks learn representations PDF
[68] A Tale of Two Circuits: Grokking as Competition of Sparse and Dense Subnetworks PDF
[69] Bridging lottery ticket and grokking: Understanding grokking from inner structure of networks PDF
[70] Combinatorial Tasks as Model Systems of Deep Learning PDF
[71] The Slingshot Effect: A Late-Stage Optimization Anomaly in Adaptive Gradient Methods PDF
[72] Feature Learning Dynamics under Grokking in a Sparse Parity Task PDF
[73] Grokking in Neural Networks: A Review PDF
[74] Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking PDF
[75] Beyond Memorization: Exploring the Dynamics of Grokking in Sparse Neural Networks PDF
[76] A simple and interpretable model of grokking modular arithmetic tasks PDF
Theoretical explanation connecting NFA to first-order optimality
The authors algebraically expand the FACT relation to show it is qualitatively similar to the NFA conjecture, providing theoretical foundation for why NFA typically holds by connecting it to provable first-order optimality conditions.