Reasoning without Training: Your Base Model is Smarter Than You Think

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.5 Download Report PDF

LLMsreasoningMCMCsamplinginference-time compute

Frontier reasoning models have exhibited incredible capabilities across a wide array of disciplines, driven by posttraining large language models (LLMs) with reinforcement learning (RL). However, despite the widespread success of this paradigm, much of the literature has been devoted to disentangling truly novel behaviors that emerge during RL but are not present in the base models. In our work, we approach this question from a different angle, instead asking whether comparable reasoning capabilities can be elicited from base models at inference time by pure sampling, without any additional training. Inspired by Markov chain Monte Carlo (MCMC) techniques for sampling from sharpened distributions, we propose a simple iterative sampling algorithm leveraging the base models' own likelihoods. Over different base models, we show that our algorithm offers substantial boosts in reasoning that nearly match and even outperform those from RL on a wide variety of single-shot tasks, including MATH500, HumanEval, and GPQA. Moreover, our sampler avoids the collapse in diversity over multiple samples that is characteristic of RL-posttraining. Crucially, our method does not require training, curated datasets, or a verifier, suggesting broad applicability beyond easily verifiable domains.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes an MCMC-inspired iterative sampling algorithm that uses base models' own likelihoods to sample from sharpened distributions, aiming to elicit reasoning capabilities without training. It resides in the 'Pure Sampling-Based Methods' leaf, which contains only four papers total, indicating a relatively sparse research direction within the broader taxonomy of 50 papers across 22 leaf nodes. This leaf explicitly excludes methods using verifiers or tree search, focusing instead on repeated sampling and likelihood-based selection—precisely the approach this work adopts.

The taxonomy reveals neighboring leaves with distinct strategies: 'Verification-Guided Sampling' employs external verifiers or reward models to select among candidates, while 'Structured Search and Tree-Based Exploration' uses systematic tree search methods. The paper's approach diverges from these by relying solely on the base model's likelihood without external verification or structured exploration. The broader 'Inference-Time Sampling and Search Strategies' branch contains 16 papers, suggesting moderate activity in inference-time methods overall, though the pure sampling subcategory remains less crowded than verification-guided or structured search alternatives.

Among 26 candidates examined, the contribution-level analysis shows mixed novelty signals. The power distribution sampling target (Contribution A) examined 6 candidates with 1 refutable match, suggesting some prior exploration of sharpened distributions. The MCMC-based algorithm (Contribution B) examined 10 candidates with none refutable, indicating stronger technical novelty in the specific algorithmic approach. The empirical claim of matching RL-posttraining (Contribution C) examined 10 candidates with 1 refutable, suggesting that demonstrating parity with training-based methods has been explored before, though perhaps not with this exact sampling technique.

Given the limited search scope of 26 candidates from semantic search, this assessment captures the most directly relevant prior work but cannot claim exhaustive coverage. The paper appears to occupy a moderately novel position within a sparse subcategory, with its core algorithmic contribution (MCMC-based power sampling) showing stronger novelty signals than its conceptual framing or empirical claims about matching RL performance.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: eliciting reasoning capabilities from base language models through inference-time sampling. The field has organized itself around several complementary branches. Inference-Time Sampling and Search Strategies explore pure sampling-based methods that generate multiple candidate solutions and select among them, often using techniques like majority voting or reward-guided selection. Training-Aware and Hybrid Approaches combine inference-time computation with model fine-tuning or reinforcement learning, bridging the gap between static model capabilities and dynamic reasoning. Efficiency and Acceleration Techniques address the computational cost of extended inference, proposing methods to reduce latency while preserving reasoning quality. Domain-Specific Applications tailor these strategies to specialized areas such as medicine, law, and web navigation, while Theoretical Foundations and Analysis provide formal understanding of scaling laws and optimality conditions. Auxiliary Techniques and Mechanisms encompass supporting tools like process reward models, critique mechanisms, and adaptive decoding strategies that enhance the core sampling paradigm. Recent work has concentrated on understanding how test-time compute scales with performance, as surveyed in Slow Thinking Survey[3] and Test-Time Compute Survey[33], revealing trade-offs between sample diversity, verification accuracy, and computational budget. Within the pure sampling branch, Reasoning Without Training[0] emphasizes extracting reasoning purely at inference time without additional model updates, positioning itself alongside works like Reasoning with Sampling[14] and FIRE Sampling[18] that similarly rely on generating and filtering multiple reasoning paths. This contrasts with hybrid methods such as RL of Thoughts[4] or Inference-Aware Fine-Tuning[8], which interleave sampling with learning signals. A key open question is whether pure sampling can match the performance of training-augmented approaches when both are given comparable computational resources, and how to best allocate that budget across breadth of exploration versus depth of verification.

Claimed Contributions

Power distribution as a sampling target for reasoning tasks

Can Refute

6 retrieved papers

The authors propose using the power distribution (p raised to power α) as an explicit target for sampling from base language models to enhance reasoning capabilities. This distribution sharpens the base model distribution by upweighting high-likelihood sequences without requiring any training.

6 retrieved papers

Can Refute

MCMC-based power sampling algorithm for autoregressive models

10 retrieved papers

The authors develop a training-free sampling algorithm (Algorithm 1) that uses Metropolis-Hastings MCMC with random resampling proposals to approximately sample from the power distribution. The algorithm progressively samples from intermediate distributions in blocks to avoid exponential mixing time issues.

10 retrieved papers

Empirical demonstration matching RL-posttraining performance without training

9 retrieved papers

The authors show that their training-free power sampling algorithm achieves single-shot reasoning performance comparable to or exceeding GRPO (a state-of-the-art RL method) across multiple base models and reasoning benchmarks, while maintaining better sample diversity and pass@k performance.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[17] Flaming-hot Initiation with Regular Execution Sampling for Large Language Models PDF

Chen, Weizhe, Dun Chen, Jin Xing, Liu, Guanlin, Shi, Wenlei, Wu Zheng, Yan Lin, Zhang Zhicheng, Zheng, Renjie (2024) • North American Chapter of the Association for Computational Linguistics

[20] Rollout Roulette: A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods PDF

Puri, Isha, Sudalairaj, Shivchander, Xu, Guangxuan, Xu Kai, Srivastava, Akash (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Power distribution as a sampling target for reasoning tasks

[74] Power-Flow: Unlocking LLMs with -Power Distribution Fine-Tuning PDF

Can Refute

[69] Enhancing Large Language Models with Graph-Based Node Sampling for Fault Attribution in Power Distribution Networks PDF

Cannot Refute

[70] Training a Reasoning Large Language Model for Improving Power Flow Convergence PDF

Cannot Refute

[71] Uncertainty-Driven Adaptive Sampling for Resource-Efficient Language Model Inference PDF

Cannot Refute

[72] Build a Multimodal Interaction and Multi-Agent Collaborative Decision-Making Mechanism Enhanced by Large Models in the Intelligent Decision-Making System for â¦ PDF

Cannot Refute

[73] Automating High Energy Physics Data Analysis with LLM-Powered Agents PDF

Cannot Refute

Contribution

MCMC-based power sampling algorithm for autoregressive models

[59] Eliciting language model behaviors using reverse language models PDF

Cannot Refute

[60] A Bayesian mixture model for Poisson network autoregression PDF

Cannot Refute

[61] Amortizing intractable inference in diffusion models for vision, language, and control PDF

Cannot Refute

[62] Mix and match: Learning-free controllable text generation using energy language models PDF

Cannot Refute

[63] Designing proteins with language models PDF

Cannot Refute

[64] Principled gradient-based MCMC for conditional sampling of text PDF

Cannot Refute

[65] Toward automated story generation with markov chain monte carlo methods and deep neural networks PDF

Cannot Refute

[66] Sequential Monte Carlo Methods in the nimble and nimbleSMC R Packages PDF

Cannot Refute

[67] Posterior sampling via autoregressive generation PDF

Cannot Refute

[68] Bayesian estimation of an autoregressive model using Markov chain Monte Carlo PDF

Cannot Refute

Contribution

Empirical demonstration matching RL-posttraining performance without training

[50] Absolute zero: Reinforced self-play reasoning with zero data PDF

Cannot Refute

[51] Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement PDF

Cannot Refute

[52] Reflexion: language agents with verbal reinforcement learning PDF

Cannot Refute

[53] ThinkLess: A Training-Free Inference-Efficient Method for Reducing Reasoning Redundancy PDF

Cannot Refute

[54] RFG: Test-Time Scaling for Diffusion Large Language Model Reasoning with Reward-Free Guidance PDF

Cannot Refute

[55] VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model PDF

Cannot Refute

[56] Better zero-shot reasoning with self-adaptive prompting PDF

Cannot Refute

[57] Seg-R1: Segmentation Can Be Surprisingly Simple with Reinforcement Learning PDF

Cannot Refute

[58] Training-free Generation of Temporally Consistent Rewards from VLMs PDF

Cannot Refute

Reasoning without Training: Your Base Model is Smarter Than You Think

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[17] Flaming-hot Initiation with Regular Execution Sampling for Large Language Models PDF

[20] Rollout Roulette: A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods PDF

Contribution Analysis

Power distribution as a sampling target for reasoning tasks

[74] Power-Flow: Unlocking LLMs with -Power Distribution Fine-Tuning PDF

[69] Enhancing Large Language Models with Graph-Based Node Sampling for Fault Attribution in Power Distribution Networks PDF

[70] Training a Reasoning Large Language Model for Improving Power Flow Convergence PDF

[71] Uncertainty-Driven Adaptive Sampling for Resource-Efficient Language Model Inference PDF

[72] Build a Multimodal Interaction and Multi-Agent Collaborative Decision-Making Mechanism Enhanced by Large Models in the Intelligent Decision-Making System for â¦ PDF

[73] Automating High Energy Physics Data Analysis with LLM-Powered Agents PDF

MCMC-based power sampling algorithm for autoregressive models

[59] Eliciting language model behaviors using reverse language models PDF

[60] A Bayesian mixture model for Poisson network autoregression PDF

[61] Amortizing intractable inference in diffusion models for vision, language, and control PDF

[62] Mix and match: Learning-free controllable text generation using energy language models PDF

[63] Designing proteins with language models PDF

[64] Principled gradient-based MCMC for conditional sampling of text PDF

[65] Toward automated story generation with markov chain monte carlo methods and deep neural networks PDF

[66] Sequential Monte Carlo Methods in the nimble and nimbleSMC R Packages PDF

[67] Posterior sampling via autoregressive generation PDF

[68] Bayesian estimation of an autoregressive model using Markov chain Monte Carlo PDF

Empirical demonstration matching RL-posttraining performance without training

[50] Absolute zero: Reinforced self-play reasoning with zero data PDF

[51] Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement PDF

[52] Reflexion: language agents with verbal reinforcement learning PDF

[53] ThinkLess: A Training-Free Inference-Efficient Method for Reducing Reasoning Redundancy PDF

[54] RFG: Test-Time Scaling for Diffusion Large Language Model Reasoning with Reward-Free Guidance PDF

[55] VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model PDF

[56] Better zero-shot reasoning with self-adaptive prompting PDF

[57] Seg-R1: Segmentation Can Be Surprisingly Simple with Reinforcement Learning PDF

[58] Training-free Generation of Temporally Consistent Rewards from VLMs PDF

Table of Contents

[72] Build a Multimodal Interaction and Multi-Agent Collaborative Decision-Making Mechanism Enhanced by Large Models in the Intelligent Decision-Making System for â¦ PDF