RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling

ICLR 2026 Conference SubmissionAnonymous Authors
logical reasoningrule-based reasoningreinforcement learninglanguage models
Abstract:

Rule-based reasoning is acknowledged as one of the fundamental problems of reasoning. While recent studies show that large reasoning models (LRMs) have remarkable reasoning capabilities enhanced by reinforcement learning (RL), real applications still face severe challenges due to variations in rule formats, types, and complexity. To mitigate this issue, we introduce RuleReasoner, an effective method for rule-based reasoning via a wide collection of curated tasks and a novel domain-aware dynamic sampling approach in RL. Specifically, RuleReasoner resamples each training batch by updating the domain weights based on historical rewards. This facilitates domain balance and active learning schedules for RL, obviating static mix-training engineered by human. Evaluations on in-distribution (ID) and out-of-distribution (OOD) benchmarks reveal that RuleReasoner outperforms frontier LRMs by a significant margin (Δ\Delta4.1% on eight ID tasks and Δ\Delta10.4% on three OOD tasks over OpenAI-o1). Notably, our approach also exhibits higher computational efficiency compared to prior methods.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces RuleReasoner, a training framework for rule-based reasoning in large language models using reinforcement learning with domain-aware dynamic sampling. It sits within the 'Text-Domain Reasoning with Rule-Based RL' leaf, which contains only three papers total including this one. This is a relatively sparse research direction within the broader taxonomy of 50 papers, suggesting the specific combination of rule-based reasoning and RL for text-domain LLMs remains an emerging area rather than a crowded subfield.

The taxonomy reveals neighboring work in multimodal reasoning (MM-Eureka, Visual Aha Moment) and symbolic integration branches (Neural Logic RL, SymDQN), but the text-domain leaf is notably isolated. The sibling papers Logic-RL and CPGD both address rule extraction and compositional generalization respectively, while RuleReasoner focuses on dynamic sampling for domain balance during RL training. The taxonomy's scope notes clarify that this leaf excludes multimodal and symbolic integration approaches, positioning RuleReasoner squarely in pure text-based rule reasoning rather than cross-modal or formal logic synthesis.

Among 30 candidates examined, the domain-aware dynamic sampling contribution shows one refutable candidate from 10 examined, while the RuleCollection-32K dataset appears more novel with zero refutations from 10 candidates. The RLVR training regularization framework also has one refutable candidate from 10 examined. These statistics suggest moderate prior work overlap for the training methodology components, though the limited search scope means substantial related work may exist beyond the top-30 semantic matches. The dataset contribution appears least contested within this bounded search.

Based on the limited literature search covering 30 candidates, the work appears to occupy a relatively sparse research direction with some methodological overlap in training approaches but potentially novel dataset contributions. The analysis cannot claim exhaustiveness—only that among semantically similar recent papers, the core training innovations show modest prior work while the curated task collection shows less direct precedent.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: rule-based reasoning with reinforcement learning. The field integrates symbolic reasoning structures—such as temporal logic, fuzzy rules, and automata—with RL to improve interpretability, safety, and generalization. The taxonomy reveals several major branches: one focuses on large language and multimodal models that leverage rule-based guidance for text and vision tasks (e.g., MM-Eureka[3], Visual Aha Moment[4]); another emphasizes interactive and human-guided approaches where domain experts or iterative feedback shape rule discovery (e.g., Persistent Rule Interactive[2], Iterative Rule Guided[5]); temporal logic integration provides formal specifications for safe control (e.g., Temporal Logic Safe[8], Temporal Logic Reward[18]); symbolic and neuro-symbolic RL merges neural networks with logic-based representations (e.g., Neural Logic RL[10], SymDQN[19]); fuzzy logic-based methods handle uncertainty in continuous domains (e.g., Fuzzy Fractal Control[11]); domain-specific applications target robotics, autonomous driving, and energy systems (e.g., Lane Change Constraints[12], Eco-Driving Bus[14]); inference-time and meta-level reasoning explores how models decide when to apply deliberate reasoning (e.g., Think or Not[17], RL of Thoughts[29]); and automata-based learning uses finite-state machines or logic programs to structure policies (e.g., GALOIS[39], Tsetlin Machine[47]). Recent work highlights a tension between end-to-end neural flexibility and the interpretability of explicit rules. Many studies in the symbolic and neuro-symbolic branch pursue hybrid architectures that balance differentiable learning with logical constraints (e.g., Off-Policy Differentiable Logic[44], Deep ILP RL[42]), while temporal logic integration prioritizes provable safety guarantees in safety-critical domains (e.g., Temporal Logic Goals[34], Safe Highway Driving[43]). Within the text-domain reasoning cluster, RuleReasoner[0] sits alongside Logic-RL[1] and CPGD[9], all exploring how to extract or enforce logical rules during language-based reasoning tasks. Compared to Logic-RL[1], which emphasizes formal logic extraction, RuleReasoner[0] appears to focus more tightly on rule-based inference strategies tailored to textual problem-solving, while CPGD[9] investigates compositional generalization through policy decomposition. These neighboring works collectively illustrate ongoing efforts to make RL agents reason more transparently and reliably in structured symbolic environments.

Claimed Contributions

RuleReasoner training framework with domain-aware dynamic sampling

The authors propose RuleReasoner, a training framework that enhances rule-based reasoning through reinforcement learning. It introduces a domain-aware dynamic sampling (DADS) method that resamples training batches by updating domain weights based on historical rewards, facilitating domain balance and active learning schedules without static human-engineered mixing.

10 retrieved papers
Can Refute
RuleCollection-32K dataset for rule-based reasoning

The authors curate and release RuleCollection-32K, a dataset containing 32K examples across eight rule-based reasoning tasks. The dataset features varying rule formats, reasoning forms, and complexity levels, designed to enable training and evaluation of generalizable rule application rather than memorization.

10 retrieved papers
RLVR framework with training regularization for rule-based reasoning

The authors design a Reinforcement Learning with Verifiable Rewards (RLVR) framework incorporating training regularization techniques such as disabling entropy bonus, discarding KL divergence, and rule order shuffling. This framework achieves stable training dynamics for complex rule-based reasoning tasks and improves generalization to unseen rules.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

RuleReasoner training framework with domain-aware dynamic sampling

The authors propose RuleReasoner, a training framework that enhances rule-based reasoning through reinforcement learning. It introduces a domain-aware dynamic sampling (DADS) method that resamples training batches by updating domain weights based on historical rewards, facilitating domain balance and active learning schedules without static human-engineered mixing.

Contribution

RuleCollection-32K dataset for rule-based reasoning

The authors curate and release RuleCollection-32K, a dataset containing 32K examples across eight rule-based reasoning tasks. The dataset features varying rule formats, reasoning forms, and complexity levels, designed to enable training and evaluation of generalizable rule application rather than memorization.

Contribution

RLVR framework with training regularization for rule-based reasoning

The authors design a Reinforcement Learning with Verifiable Rewards (RLVR) framework incorporating training regularization techniques such as disabling entropy bonus, discarding KL divergence, and rule order shuffling. This framework achieves stable training dynamics for complex rule-based reasoning tasks and improves generalization to unseen rules.