RL of Thoughts: Navigating LLM Reasoning with Inference-time Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors
Large language modelsreasoningreinforcement learning
Abstract:

Despite rapid advancements in large language models (LLMs), the token-level autoregressive nature constrains their complex reasoning capabilities. To enhance LLM reasoning, inference-time techniques, including Chain/Tree/Graph-of-Thought(s), successfully improve the performance, as they are fairly cost-effective by guiding reasoning through external logical structures without modifying LLMs' parameters. However, these manually predefined, task-agnostic frameworks are applied uniformly across diverse tasks, lacking adaptability. To improve this, we propose RL-of-Thoughts (RLoT), where we train a lightweight navigator model with reinforcement learning (RL) to generate task-adaptive logical structures at inference time, enhancing LLM reasoning. Specifically, we design five basic logic blocks from the perspective of human cognition. During the reasoning process, the trained RL navigator dynamically selects the suitable logic blocks and combines them into task-specific logical structures according to problem characteristics. Experiments across multiple reasoning benchmarks (AIME, MATH, GPQA, etc.) with multiple LLMs (GPT, Llama, Qwen, and DeepSeek) illustrate that RLoT outperforms established inference-time techniques in most cases and improves up to 13.4% in challenging situations. Remarkably, with less than 3K parameters, our RL navigator is able to make sub-10B LLMs comparable to 100B-scale counterparts. Moreover, the RL navigator demonstrates strong transferability: a model trained on one specific LLM-task pair can effectively generalize to unseen LLMs and tasks. Our code is open-source at https://anonymous.4open.science/r/RL-LLM-Reasoning-1A30.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes RL-of-Thoughts (RLoT), a framework that trains a lightweight navigator model using reinforcement learning to dynamically generate task-adaptive logical structures at inference time. It resides in the 'Adaptive Thought Structure Generation' leaf of the taxonomy, which contains four papers total including this one. This leaf sits within the broader 'Neural Adaptive Reasoning Mechanisms' branch, distinguishing itself from symbolic integration approaches by focusing on purely neural methods that learn to adjust reasoning structures. The leaf appears moderately populated, suggesting an active but not overcrowded research direction focused on flexible, non-linear thought construction.

The taxonomy reveals several neighboring research directions that contextualize this work. The sibling leaf 'Dynamic Strategy Selection and Routing' contains four papers focused on selecting among predefined strategies rather than generating novel structures. Nearby branches include 'Symbolic Logic Integration and Hybrid Reasoning', which uses explicit formal systems, and 'Chain-of-Thought Enhancement and Optimization', which refines fixed prompting patterns. RLoT diverges from these by learning to compose basic logic blocks into task-specific structures rather than relying on manual templates or symbolic solvers, positioning it at the intersection of adaptive neural methods and structured reasoning.

Among twenty-nine candidates examined, the contribution-level analysis reveals mixed novelty signals. The core RLoT framework examined ten candidates with one appearing to provide overlapping prior work, suggesting some precedent for RL-based adaptive structure generation exists within this limited search scope. The five logic blocks contribution also examined nine candidates with one potential refutation, indicating similar decomposition strategies may have been explored. The transferability and parameter efficiency claim examined ten candidates with no clear refutations, appearing more distinctive within the examined literature. These statistics reflect a targeted semantic search, not an exhaustive field survey.

Based on the limited search scope of twenty-nine semantically similar papers, the work appears to occupy a moderately explored niche within adaptive reasoning. The taxonomy structure suggests the field is actively developing multiple complementary approaches to inference-time reasoning enhancement. While some overlap exists with prior RL-based and adaptive structure methods among the examined candidates, the specific combination of learned logic block composition and task-adaptive generation may offer incremental advances. A broader literature review would be needed to assess whether similar frameworks exist beyond the top-K semantic matches analyzed here.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: Enhancing large language model reasoning with adaptive logical structures at inference time. The field organizes around several complementary directions. Symbolic Logic Integration and Hybrid Reasoning explores how explicit logical formalisms can be combined with neural models, as seen in works like Logic-LM[3] and Reasoning-as-logic-units[1]. Neural Adaptive Reasoning Mechanisms focuses on learning-based approaches that dynamically adjust reasoning pathways, including frameworks such as Adaptive-solver framework for dynamic[2] and Reasoning in Flux[5]. Chain-of-Thought Enhancement and Optimization refines prompting strategies to improve step-by-step reasoning quality. Test-Time Compute Scaling and Efficiency investigates trade-offs between inference-time computation and performance, while Domain-Specific Adaptive Reasoning Applications tailors these techniques to specialized contexts like medicine or law. Recent work highlights tensions between structured symbolic guidance and flexible neural adaptation. Some studies emphasize explicit logical scaffolding to ensure correctness, while others prioritize end-to-end learning of adaptive structures that respond to problem complexity. RL of Thoughts[0] sits within the Neural Adaptive Reasoning Mechanisms branch, specifically under Adaptive Thought Structure Generation, where it learns to construct reasoning graphs via reinforcement learning rather than relying on fixed templates. This contrasts with nearby approaches like Agentic Reasoning[26], which employs agent-based frameworks for dynamic strategy selection, and Ratt[37], which adapts reasoning traces through iterative refinement. The central challenge across these lines involves balancing the interpretability and guarantees of symbolic methods against the flexibility and scalability of learned adaptive structures, with ongoing exploration of how much inference-time compute to allocate and whether adaptation should be rule-driven or data-driven.

Claimed Contributions

RL-of-Thoughts (RLoT) framework for adaptive logical structure generation

The authors introduce RLoT, a framework that trains a lightweight navigator model using reinforcement learning to dynamically select and combine basic logic blocks during inference. This generates task-adaptive logical structures that guide LLM reasoning without modifying the LLM's parameters.

10 retrieved papers
Can Refute
Five human cognition-inspired basic logic blocks as action space

The authors design five fundamental logic blocks (Reason one step, Decompose, Debate, Refine, Terminate) inspired by human cognitive strategies. These blocks serve as the action space in the MDP formulation and can be flexibly combined to construct task-specific reasoning pathways.

9 retrieved papers
Can Refute
Demonstration of transferability and parameter efficiency of the navigator model

The authors show that their navigator model, containing fewer than 3,000 parameters, can generalize across different LLMs and reasoning tasks without fine-tuning. This lightweight model enables sub-10B LLMs to achieve performance comparable to models with 10 times more parameters.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

RL-of-Thoughts (RLoT) framework for adaptive logical structure generation

The authors introduce RLoT, a framework that trains a lightweight navigator model using reinforcement learning to dynamically select and combine basic logic blocks during inference. This generates task-adaptive logical structures that guide LLM reasoning without modifying the LLM's parameters.

Contribution

Five human cognition-inspired basic logic blocks as action space

The authors design five fundamental logic blocks (Reason one step, Decompose, Debate, Refine, Terminate) inspired by human cognitive strategies. These blocks serve as the action space in the MDP formulation and can be flexibly combined to construct task-specific reasoning pathways.

Contribution

Demonstration of transferability and parameter efficiency of the navigator model

The authors show that their navigator model, containing fewer than 3,000 parameters, can generalize across different LLMs and reasoning tasks without fine-tuning. This lightweight model enables sub-10B LLMs to achieve performance comparable to models with 10 times more parameters.