Energy-Based Transformers are Scalable Learners and Thinkers

ICLR 2026 Conference SubmissionAnonymous Authors
Energy-Based ModelsSystem 2 ThinkingReasoningVerificationScalingTransformersGenerative Modeling
Abstract:

Inference-time computation, analogous to human System 2 Thinking, has recently become popular for improving model performance. However, most existing approaches suffer from several limitations: they are modality-specific (e.g., working only in text), problem-specific (e.g., verifiable domains like math and coding), or require additional supervision/training on top of unsupervised pretraining (e.g., verifiers or verifiable rewards). In this paper, we ask the question “Is it possible to generalize these System 2 Thinking approaches, and develop models that learn to think solely from unsupervised learning?” We find the answer is yes, by learning to explicitly verify the compatibility between inputs and candidate-predictions, and then re-framing prediction problems as optimization with respect to this verifier. Specifically, we train Energy-Based Transformers (EBTs)---a new class of Energy-Based Models (EBMs)---to assign an energy value to every input and candidate-prediction, enabling predictions through energy minimization until convergence. To support this approach, we introduce several key techniques for stable and parallelizable training, which enable the emergence of strong System 2 Thinking capabilities and scalable EBMs. Across discrete and continuous modalities, we find EBTs outperform the Transformer++ approach, scaling up to 35% faster during pretraining, and improving inference-time performance by up to 29%. EBTs also surpass Diffusion Transformers on image denoising while requiring 99% fewer forward passes. Moreover, System 2 Thinking with EBTs yields larger performance gains on data that is farther out-of-distribution, and EBTs achieve better results than existing models on most downstream tasks despite achieving the same or worse pretraining performance, enabling EBTs to generalize better than existing approaches. Consequently, EBTs are a flexible and exciting new approach for scaling both the learning and thinking capabilities of models.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Energy-Based Transformers (EBTs) as a new class of models that learn to assign energy values to input-prediction pairs, enabling inference through iterative energy minimization. Within the taxonomy, this work resides in the 'Energy-Based Transformers and Scalable Learning' leaf under 'Energy-Based Model Architectures and Training'. This leaf contains only two papers, indicating a relatively sparse research direction. The sibling paper focuses on concept learning with energy-based models, suggesting that architectural innovations combining transformers with energy formulations remain an emerging area rather than a crowded subfield.

The taxonomy reveals that neighboring leaves address generative energy models (VAEs, diffusion models) and discriminative energy models (classification, structured prediction), while sibling branches explore inference-time adaptation and test-time optimization. The paper's position bridges architectural design with inference-time reasoning: unlike test-time adaptation methods that adjust pretrained models to distribution shifts, EBTs embed energy-based optimization directly into the architecture. This distinguishes the work from purely application-focused approaches in 'Inference-Time Reasoning and Iterative Optimization' and from domain-specific implementations in NLP or vision, positioning it as a foundational contribution to scalable energy-based architectures.

Among thirty candidates examined across three contributions, none were identified as clearly refuting the proposed work. For 'Energy-Based Transformers (EBTs)', ten candidates were reviewed with zero refutable overlaps; similarly, 'Scalable training techniques for EBMs' and 'System 2 Thinking framework via optimization' each examined ten candidates without finding prior work that directly anticipates these contributions. This suggests that within the limited search scope, the combination of transformer architectures, energy-based formulations, and unsupervised learning for inference-time optimization appears relatively unexplored. However, the analysis is constrained by the top-thirty semantic matches and does not claim exhaustive coverage of all related literature.

Based on the limited literature search, the work appears to occupy a novel position at the intersection of energy-based modeling and transformer architectures for inference-time reasoning. The sparse population of its taxonomy leaf and absence of refuting candidates among thirty examined papers suggest meaningful differentiation from existing approaches. Nonetheless, this assessment reflects the scope of the semantic search and citation expansion, not a comprehensive survey of all potentially relevant prior work in energy-based models or inference-time computation.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: inference-time computation through energy-based optimization. This field explores how energy-based models (EBMs) can be leveraged not only during training but also at inference time to refine predictions, adapt to new data, or optimize task-specific objectives. The taxonomy organizes research into four main branches: Energy-Based Model Architectures and Training examines foundational EBM designs and scalable learning methods, including works like Energy-Based Transformers[0] and Concept Learning EBM[25]; Inference-Time Optimization and Adaptation focuses on test-time refinement strategies such as Test-time Energy Adaptation[4] and Energy-based Test Adaptation[36]; Application Domains and Task-Specific Implementations covers diverse use cases from vision tasks like Plausibility Verification 3D[5] to policy learning in EBT-Policy[30]; and Energy-Efficient Systems and Hardware Optimization addresses computational cost and deployment concerns, exemplified by ML ENERGY Benchmark[1] and EdgeBERT[22]. Together, these branches reflect a shift from viewing energy functions purely as training objectives to treating them as flexible inference-time tools for reasoning and adaptation. Recent work reveals contrasting philosophies around when and how to apply energy-based reasoning. Some studies emphasize architectural innovations that embed energy landscapes directly into transformer-like models for scalable learning, while others prioritize lightweight test-time adaptation mechanisms that adjust pretrained models to distribution shifts or task-specific constraints without retraining. Energy-Based Transformers[0] sits within the architectural branch, proposing scalable training of energy-based transformer models that can support iterative refinement at inference. This contrasts with approaches like Inference-time Alignment[3] or Energy-guided Test Adaptation[39], which focus on post-hoc optimization strategies applied to existing models. A key open question is balancing the expressiveness of learned energy functions against the computational overhead of inference-time optimization, especially in resource-constrained settings. Energy-Based Transformers[0] addresses this by integrating energy-based principles into the architecture itself, aiming for a middle ground between fully iterative inference methods and standard feedforward prediction.

Claimed Contributions

Energy-Based Transformers (EBTs)

The authors introduce Energy-Based Transformers, a novel architecture that combines Transformers with Energy-Based Models to enable System 2 Thinking capabilities. EBTs learn to verify input-prediction compatibility through energy assignment and generate predictions via optimization, supporting dynamic computation allocation and prediction verification across modalities.

10 retrieved papers
Scalable training techniques for EBMs

The authors develop practical training improvements including energy landscape regularization techniques (replay buffer, Langevin Dynamics variant, randomized optimization paths) that address historical scalability challenges in Energy-Based Models, enabling stable and efficient training at scale.

10 retrieved papers
System 2 Thinking framework via optimization

The authors formalize System 2 Thinking as an optimization process over a learned energy landscape, where models iteratively refine predictions through gradient descent until convergence. This framework enables dynamic computation allocation and prediction verification to emerge from unsupervised learning alone, generalizing across modalities and problem types.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Energy-Based Transformers (EBTs)

The authors introduce Energy-Based Transformers, a novel architecture that combines Transformers with Energy-Based Models to enable System 2 Thinking capabilities. EBTs learn to verify input-prediction compatibility through energy assignment and generate predictions via optimization, supporting dynamic computation allocation and prediction verification across modalities.

Contribution

Scalable training techniques for EBMs

The authors develop practical training improvements including energy landscape regularization techniques (replay buffer, Langevin Dynamics variant, randomized optimization paths) that address historical scalability challenges in Energy-Based Models, enabling stable and efficient training at scale.

Contribution

System 2 Thinking framework via optimization

The authors formalize System 2 Thinking as an optimization process over a learned energy landscape, where models iteratively refine predictions through gradient descent until convergence. This framework enables dynamic computation allocation and prediction verification to emerge from unsupervised learning alone, generalizing across modalities and problem types.