Probing to Refine: Reinforcement Distillation of LLM Reasoners via Explanatory Inversion

ICLR 2026 Conference SubmissionAnonymous Authors
Large Language ModelKnowledge Distillation
Abstract:

Distilling robust reasoning capabilities from large language models (LLMs) into smaller, computationally efficient student models remains an unresolved challenge. Despite recent advances, distilled models frequently suffer from superficial pattern memorization and subpar generalization. To overcome these limitations, we introduce a novel distillation framework that moves beyond simple mimicry to instill a deeper conceptual understanding. Our framework features two key innovations. \underline{\textit{First}}, to address pattern memorization, Explanatory Inversion (EI) generates targeted ``explanatory probes'' that compel the student to articulate the underlying logic behind an answer, rather than just memorizing it. \underline{\textit{Second}}, to improve generalization, Explanatory GRPO (\texttt{EXGRPO}) uses a reinforcement learning algorithm with a novel Dialogue Structure Utility Bonus, which explicitly rewards the student for maintaining a coherent reasoning process across these probes. Extensive evaluations on 12 datasets demonstrate significant improvements. Using Gemma-7b as the student model, our method yields an average \textbf{20.39%} increase over zero-shot performance and a \textbf{6.02%} improvement over the state-of-the-art distillation baselines. Moreover, models distilled with our method show remarkable training efficiency (e.g., surpassing vanilla fine-tuning with \textbf{10-25%} training data) and strong generalization to out-of-distribution tasks.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a distillation framework combining Explanatory Inversion (EI) and Explanatory GRPO (ExGRPO) to transfer reasoning capabilities from large language models to smaller student models. It resides in the 'Explanatory Probing and Reinforcement Distillation' leaf, which contains only two papers total. This represents a relatively sparse research direction within the broader reinforcement learning-based distillation branch, suggesting the specific combination of explanatory probing with reinforcement learning for reasoning distillation remains underexplored compared to more populated areas like standard chain-of-thought distillation (five papers) or mathematical reasoning distillation (four papers).

The taxonomy reveals neighboring work in adjacent leaves: RLAIF and reward-based distillation (one paper) and implicit multi-branch structure distillation (one paper). These sibling categories share the reinforcement learning foundation but differ in mechanism—RLAIF emphasizes AI feedback signals, while implicit structure methods focus on meta-reasoning processes. The broader chain-of-thought distillation branch (ten papers across three leaves) represents an alternative paradigm relying on supervised learning rather than reinforcement signals. The paper's position bridges explanatory mechanisms (common in structured distillation approaches) with policy optimization (characteristic of RL-based methods), occupying a methodological intersection less densely populated than either pure CoT or pure RL approaches.

Among twenty-nine candidates examined across three contributions, none were identified as clearly refuting the proposed methods. Explanatory Inversion examined nine candidates with zero refutable overlaps, ExGRPO examined ten candidates with zero refutations, and the combined framework examined ten candidates with zero refutations. This suggests that within the limited search scope, the specific techniques—generating explanatory probes to prevent pattern memorization and using dialogue structure utility bonuses in reinforcement learning—appear distinct from examined prior work. However, the modest search scale (twenty-nine candidates, not hundreds) means the analysis captures immediate semantic neighbors rather than exhaustive field coverage.

The analysis indicates the work occupies a sparsely populated methodological niche combining explanatory probing with reinforcement learning for reasoning distillation. The limited search scope and absence of refutable candidates among examined papers suggest novelty within the sampled literature, though broader field coverage might reveal additional related work. The taxonomy structure shows this approach sits at an intersection of explanatory mechanisms and RL-based optimization that has received less attention than either pure supervised CoT distillation or domain-specific reasoning transfer.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Distilling reasoning capabilities from large language models into smaller student models. The field has organized itself into several major branches reflecting different methodological emphases. Chain-of-thought distillation methods focus on transferring step-by-step reasoning traces, often using rationales generated by teacher models to guide student learning. Structured and multi-teacher approaches leverage multiple sources or hierarchical knowledge representations to enrich the distillation process. Reinforcement learning-based distillation employs reward signals and policy optimization to refine student reasoning, while knowledge-augmented and retrieval-enhanced methods integrate external information sources. Domain-specific branches target particular reasoning tasks such as mathematical problem-solving or visual reasoning, and optimization strategies explore training dynamics and curriculum design. Multi-agent and interaction-based distillation examines collaborative reasoning scenarios, and application-specific work addresses specialized deployment contexts. Methodological foundations provide surveys and theoretical grounding, while emerging applications explore cross-domain transfer. Within reinforcement learning-based distillation, a particularly active line of work explores how feedback and iterative refinement can improve student model reasoning. Some studies emphasize direct policy learning from teacher demonstrations, while others incorporate explanatory signals to guide the learning process. Probing to Refine[0] sits within the explanatory probing and reinforcement distillation cluster, sharing thematic connections with Document Reranking Distillation[23], which also leverages probing mechanisms to enhance reasoning quality. Compared to broader RL-based approaches like Feedback-Driven Math Distillation[6] that focus on domain-specific feedback loops, Probing to Refine[0] emphasizes using probing techniques to identify and correct reasoning errors during distillation. This contrasts with purely imitation-based methods such as Teaching Small Models[3] or Distilling Step-by-Step[10], which transfer reasoning without explicit reinforcement signals. The work highlights an ongoing tension between sample efficiency and reasoning fidelity in distillation pipelines.

Claimed Contributions

Explanatory Inversion (EI) technique for reasoning augmentation

A cognitively-inspired data augmentation method that systematically generates diverse explanatory probes using N distinct transformation rules to challenge student models beyond superficial pattern memorization, promoting deeper conceptual understanding of reasoning tasks.

9 retrieved papers
Explanatory GRPO (ExGRPO) reinforcement learning algorithm

A novel RL-based distillation algorithm that adapts Group Relative Policy Optimization with a Dialogue Structure Utility Bonus to reward students for coherent reasoning across multi-turn explanatory dialogues, moving beyond simple outcome-based rewards to promote internalization of complex reasoning structures.

10 retrieved papers
Reinforcement distillation framework combining EI and ExGRPO

An integrated two-stage framework that first uses Explanatory Inversion to generate targeted probes, then applies ExGRPO with dialogue-based rewards to distill robust reasoning capabilities from large teacher models into smaller student models, addressing generalization limitations amplified in distilled LLMs.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Explanatory Inversion (EI) technique for reasoning augmentation

A cognitively-inspired data augmentation method that systematically generates diverse explanatory probes using N distinct transformation rules to challenge student models beyond superficial pattern memorization, promoting deeper conceptual understanding of reasoning tasks.

Contribution

Explanatory GRPO (ExGRPO) reinforcement learning algorithm

A novel RL-based distillation algorithm that adapts Group Relative Policy Optimization with a Dialogue Structure Utility Bonus to reward students for coherent reasoning across multi-turn explanatory dialogues, moving beyond simple outcome-based rewards to promote internalization of complex reasoning structures.

Contribution

Reinforcement distillation framework combining EI and ExGRPO

An integrated two-stage framework that first uses Explanatory Inversion to generate targeted probes, then applies ExGRPO with dialogue-based rewards to distill robust reasoning capabilities from large teacher models into smaller student models, addressing generalization limitations amplified in distilled LLMs.