Reasoning without Training: Your Base Model is Smarter Than You Think

ICLR 2026 Conference SubmissionAnonymous Authors
LLMsreasoningMCMCsamplinginference-time compute
Abstract:

Frontier reasoning models have exhibited incredible capabilities across a wide array of disciplines, driven by posttraining large language models (LLMs) with reinforcement learning (RL). However, despite the widespread success of this paradigm, much of the literature has been devoted to disentangling truly novel behaviors that emerge during RL but are not present in the base models. In our work, we approach this question from a different angle, instead asking whether comparable reasoning capabilities can be elicited from base models at inference time by pure sampling, without any additional training. Inspired by Markov chain Monte Carlo (MCMC) techniques for sampling from sharpened distributions, we propose a simple iterative sampling algorithm leveraging the base models' own likelihoods. Over different base models, we show that our algorithm offers substantial boosts in reasoning that nearly match and even outperform those from RL on a wide variety of single-shot tasks, including MATH500, HumanEval, and GPQA. Moreover, our sampler avoids the collapse in diversity over multiple samples that is characteristic of RL-posttraining. Crucially, our method does not require training, curated datasets, or a verifier, suggesting broad applicability beyond easily verifiable domains.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes an MCMC-inspired iterative sampling algorithm that uses base models' own likelihoods to sample from sharpened distributions, aiming to elicit reasoning capabilities without training. It resides in the 'Pure Sampling-Based Methods' leaf, which contains only four papers total, indicating a relatively sparse research direction within the broader taxonomy of 50 papers across 22 leaf nodes. This leaf explicitly excludes methods using verifiers or tree search, focusing instead on repeated sampling and likelihood-based selection—precisely the approach this work adopts.

The taxonomy reveals neighboring leaves with distinct strategies: 'Verification-Guided Sampling' employs external verifiers or reward models to select among candidates, while 'Structured Search and Tree-Based Exploration' uses systematic tree search methods. The paper's approach diverges from these by relying solely on the base model's likelihood without external verification or structured exploration. The broader 'Inference-Time Sampling and Search Strategies' branch contains 16 papers, suggesting moderate activity in inference-time methods overall, though the pure sampling subcategory remains less crowded than verification-guided or structured search alternatives.

Among 26 candidates examined, the contribution-level analysis shows mixed novelty signals. The power distribution sampling target (Contribution A) examined 6 candidates with 1 refutable match, suggesting some prior exploration of sharpened distributions. The MCMC-based algorithm (Contribution B) examined 10 candidates with none refutable, indicating stronger technical novelty in the specific algorithmic approach. The empirical claim of matching RL-posttraining (Contribution C) examined 10 candidates with 1 refutable, suggesting that demonstrating parity with training-based methods has been explored before, though perhaps not with this exact sampling technique.

Given the limited search scope of 26 candidates from semantic search, this assessment captures the most directly relevant prior work but cannot claim exhaustive coverage. The paper appears to occupy a moderately novel position within a sparse subcategory, with its core algorithmic contribution (MCMC-based power sampling) showing stronger novelty signals than its conceptual framing or empirical claims about matching RL performance.

Taxonomy

Core-task Taxonomy Papers
49
3
Claimed Contributions
25
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: eliciting reasoning capabilities from base language models through inference-time sampling. The field has organized itself around several complementary branches. Inference-Time Sampling and Search Strategies explore pure sampling-based methods that generate multiple candidate solutions and select among them, often using techniques like majority voting or reward-guided selection. Training-Aware and Hybrid Approaches combine inference-time computation with model fine-tuning or reinforcement learning, bridging the gap between static model capabilities and dynamic reasoning. Efficiency and Acceleration Techniques address the computational cost of extended inference, proposing methods to reduce latency while preserving reasoning quality. Domain-Specific Applications tailor these strategies to specialized areas such as medicine, law, and web navigation, while Theoretical Foundations and Analysis provide formal understanding of scaling laws and optimality conditions. Auxiliary Techniques and Mechanisms encompass supporting tools like process reward models, critique mechanisms, and adaptive decoding strategies that enhance the core sampling paradigm. Recent work has concentrated on understanding how test-time compute scales with performance, as surveyed in Slow Thinking Survey[3] and Test-Time Compute Survey[33], revealing trade-offs between sample diversity, verification accuracy, and computational budget. Within the pure sampling branch, Reasoning Without Training[0] emphasizes extracting reasoning purely at inference time without additional model updates, positioning itself alongside works like Reasoning with Sampling[14] and FIRE Sampling[18] that similarly rely on generating and filtering multiple reasoning paths. This contrasts with hybrid methods such as RL of Thoughts[4] or Inference-Aware Fine-Tuning[8], which interleave sampling with learning signals. A key open question is whether pure sampling can match the performance of training-augmented approaches when both are given comparable computational resources, and how to best allocate that budget across breadth of exploration versus depth of verification.

Claimed Contributions

Power distribution as a sampling target for reasoning tasks

The authors propose using the power distribution (p raised to power α) as an explicit target for sampling from base language models to enhance reasoning capabilities. This distribution sharpens the base model distribution by upweighting high-likelihood sequences without requiring any training.

6 retrieved papers
Can Refute
MCMC-based power sampling algorithm for autoregressive models

The authors develop a training-free sampling algorithm (Algorithm 1) that uses Metropolis-Hastings MCMC with random resampling proposals to approximately sample from the power distribution. The algorithm progressively samples from intermediate distributions in blocks to avoid exponential mixing time issues.

10 retrieved papers
Empirical demonstration matching RL-posttraining performance without training

The authors show that their training-free power sampling algorithm achieves single-shot reasoning performance comparable to or exceeding GRPO (a state-of-the-art RL method) across multiple base models and reasoning benchmarks, while maintaining better sample diversity and pass@k performance.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Power distribution as a sampling target for reasoning tasks

The authors propose using the power distribution (p raised to power α) as an explicit target for sampling from base language models to enhance reasoning capabilities. This distribution sharpens the base model distribution by upweighting high-likelihood sequences without requiring any training.

Contribution

MCMC-based power sampling algorithm for autoregressive models

The authors develop a training-free sampling algorithm (Algorithm 1) that uses Metropolis-Hastings MCMC with random resampling proposals to approximately sample from the power distribution. The algorithm progressively samples from intermediate distributions in blocks to avoid exponential mixing time issues.

Contribution

Empirical demonstration matching RL-posttraining performance without training

The authors show that their training-free power sampling algorithm achieves single-shot reasoning performance comparable to or exceeding GRPO (a state-of-the-art RL method) across multiple base models and reasoning benchmarks, while maintaining better sample diversity and pass@k performance.