Probing to Refine: Reinforcement Distillation of LLM Reasoners via Explanatory Inversion

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.7 Download Report PDF

Large Language ModelKnowledge Distillation

Distilling robust reasoning capabilities from large language models (LLMs) into smaller, computationally efficient student models remains an unresolved challenge. Despite recent advances, distilled models frequently suffer from superficial pattern memorization and subpar generalization. To overcome these limitations, we introduce a novel distillation framework that moves beyond simple mimicry to instill a deeper conceptual understanding. Our framework features two key innovations. \underline{\textit{First}}, to address pattern memorization, Explanatory Inversion (EI) generates targeted ``explanatory probes'' that compel the student to articulate the underlying logic behind an answer, rather than just memorizing it. \underline{\textit{Second}}, to improve generalization, Explanatory GRPO (\texttt{EXGRPO}) uses a reinforcement learning algorithm with a novel Dialogue Structure Utility Bonus, which explicitly rewards the student for maintaining a coherent reasoning process across these probes. Extensive evaluations on 12 datasets demonstrate significant improvements. Using Gemma-7b as the student model, our method yields an average \textbf{20.39%} increase over zero-shot performance and a \textbf{6.02%} improvement over the state-of-the-art distillation baselines. Moreover, models distilled with our method show remarkable training efficiency (e.g., surpassing vanilla fine-tuning with \textbf{10-25%} training data) and strong generalization to out-of-distribution tasks.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a distillation framework combining Explanatory Inversion (EI) and Explanatory GRPO (ExGRPO) to transfer reasoning capabilities from large language models to smaller student models. It resides in the 'Explanatory Probing and Reinforcement Distillation' leaf, which contains only two papers total. This represents a relatively sparse research direction within the broader reinforcement learning-based distillation branch, suggesting the specific combination of explanatory probing with reinforcement learning for reasoning distillation remains underexplored compared to more populated areas like standard chain-of-thought distillation (five papers) or mathematical reasoning distillation (four papers).

The taxonomy reveals neighboring work in adjacent leaves: RLAIF and reward-based distillation (one paper) and implicit multi-branch structure distillation (one paper). These sibling categories share the reinforcement learning foundation but differ in mechanism—RLAIF emphasizes AI feedback signals, while implicit structure methods focus on meta-reasoning processes. The broader chain-of-thought distillation branch (ten papers across three leaves) represents an alternative paradigm relying on supervised learning rather than reinforcement signals. The paper's position bridges explanatory mechanisms (common in structured distillation approaches) with policy optimization (characteristic of RL-based methods), occupying a methodological intersection less densely populated than either pure CoT or pure RL approaches.

Among twenty-nine candidates examined across three contributions, none were identified as clearly refuting the proposed methods. Explanatory Inversion examined nine candidates with zero refutable overlaps, ExGRPO examined ten candidates with zero refutations, and the combined framework examined ten candidates with zero refutations. This suggests that within the limited search scope, the specific techniques—generating explanatory probes to prevent pattern memorization and using dialogue structure utility bonuses in reinforcement learning—appear distinct from examined prior work. However, the modest search scale (twenty-nine candidates, not hundreds) means the analysis captures immediate semantic neighbors rather than exhaustive field coverage.

The analysis indicates the work occupies a sparsely populated methodological niche combining explanatory probing with reinforcement learning for reasoning distillation. The limited search scope and absence of refutable candidates among examined papers suggest novelty within the sampled literature, though broader field coverage might reveal additional related work. The taxonomy structure shows this approach sits at an intersection of explanatory mechanisms and RL-based optimization that has received less attention than either pure supervised CoT distillation or domain-specific reasoning transfer.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Distilling reasoning capabilities from large language models into smaller student models. The field has organized itself into several major branches reflecting different methodological emphases. Chain-of-thought distillation methods focus on transferring step-by-step reasoning traces, often using rationales generated by teacher models to guide student learning. Structured and multi-teacher approaches leverage multiple sources or hierarchical knowledge representations to enrich the distillation process. Reinforcement learning-based distillation employs reward signals and policy optimization to refine student reasoning, while knowledge-augmented and retrieval-enhanced methods integrate external information sources. Domain-specific branches target particular reasoning tasks such as mathematical problem-solving or visual reasoning, and optimization strategies explore training dynamics and curriculum design. Multi-agent and interaction-based distillation examines collaborative reasoning scenarios, and application-specific work addresses specialized deployment contexts. Methodological foundations provide surveys and theoretical grounding, while emerging applications explore cross-domain transfer. Within reinforcement learning-based distillation, a particularly active line of work explores how feedback and iterative refinement can improve student model reasoning. Some studies emphasize direct policy learning from teacher demonstrations, while others incorporate explanatory signals to guide the learning process. Probing to Refine[0] sits within the explanatory probing and reinforcement distillation cluster, sharing thematic connections with Document Reranking Distillation[23], which also leverages probing mechanisms to enhance reasoning quality. Compared to broader RL-based approaches like Feedback-Driven Math Distillation[6] that focus on domain-specific feedback loops, Probing to Refine[0] emphasizes using probing techniques to identify and correct reasoning errors during distillation. This contrasts with purely imitation-based methods such as Teaching Small Models[3] or Distilling Step-by-Step[10], which transfer reasoning without explicit reinforcement signals. The work highlights an ongoing tension between sample efficiency and reasoning fidelity in distillation pipelines.

Claimed Contributions

Explanatory Inversion (EI) technique for reasoning augmentation

9 retrieved papers

A cognitively-inspired data augmentation method that systematically generates diverse explanatory probes using N distinct transformation rules to challenge student models beyond superficial pattern memorization, promoting deeper conceptual understanding of reasoning tasks.

9 retrieved papers

Explanatory GRPO (ExGRPO) reinforcement learning algorithm

10 retrieved papers

A novel RL-based distillation algorithm that adapts Group Relative Policy Optimization with a Dialogue Structure Utility Bonus to reward students for coherent reasoning across multi-turn explanatory dialogues, moving beyond simple outcome-based rewards to promote internalization of complex reasoning structures.

10 retrieved papers

Reinforcement distillation framework combining EI and ExGRPO

10 retrieved papers

An integrated two-stage framework that first uses Explanatory Inversion to generate targeted probes, then applies ExGRPO with dialogue-based rewards to distill robust reasoning capabilities from large teacher models into smaller student models, addressing generalization limitations amplified in distilled LLMs.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[23] Distillation and refinement of reasoning in small language models for document re-ranking PDF

Zamani, Hamed (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Explanatory Inversion (EI) technique for reasoning augmentation

[59] Understanding synthetic context extension via retrieval heads PDF

Cannot Refute

[60] Explaining data patterns in natural language with language models PDF

Cannot Refute

[61] CausaLM: Causal Model Explanation Through Counterfactual Language Models PDF

Cannot Refute

[62] Honest Students from Untrusted Teachers: Learning an Interpretable Question-Answering Pipeline from a Pretrained Language Model PDF

Cannot Refute

[63] JailbreakTracer: Explainable Detection of Jailbreaking Prompts in LLMs Using Synthetic Data Generation PDF

Cannot Refute

[64] Can Retriever-Augmented Language Models Reason? The Blame Game Between the Retriever and the Language Model PDF

Cannot Refute

[65] SMARTER: A Data-efficient Framework to Improve Toxicity Detection with Explanation via Self-augmenting Large Language Models PDF

Cannot Refute

[66] From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data PDF

Cannot Refute

[67] COMET: Closed-loop Orchestration for Malicious Elicitation Techniques in Code Models PDF

Cannot Refute

Contribution

Explanatory GRPO (ExGRPO) reinforcement learning algorithm

[49] Distilling the Implicit Multi-Branch Structure in LLMs' Reasoning via Reinforcement Learning PDF

Cannot Refute

[68] Not all thoughts are generated equal: Efficient llm reasoning via multi-turn reinforcement learning PDF

Cannot Refute

[69] Chain-of-agents: End-to-end agent foundation models via multi-agent distillation and agentic rl PDF

Cannot Refute

[70] Research on the Humanization Design of Game NPCs and User Experience Optimization Based on Large Language Models PDF

Cannot Refute

[71] Reinforcement learning foundations for deep research systems: A survey PDF

Cannot Refute

[72] Reinforced multi-teacher knowledge distillation for unsupervised sentence representation PDF

Cannot Refute

[73] Learning Multi-turn Response Selection in Grounded Dialogues with Reinforced Knowledge and Context Distillation PDF

Cannot Refute

[74] Refine and Imitate: Reducing Repetition and Inconsistency in Persuasion Dialogues via Reinforcement Learning and Human Demonstration PDF

Cannot Refute

[75] Benchmarking and Learning Real-World Customer Service Dialogue PDF

Cannot Refute

[76] Distribution Matching Distillation Meets Reinforcement Learning PDF

Cannot Refute

Contribution

Reinforcement distillation framework combining EI and ExGRPO

[25] Probe then retrieve and reason: Distilling probing and reasoning capabilities into smaller language models PDF

Cannot Refute

[43] R2Vul: Learning to Reason about Software Vulnerabilities with Reinforcement Learning and Structured Reasoning Distillation PDF

Cannot Refute

[51] A study on efficient reinforcement learning through knowledge transfer PDF

Cannot Refute

[52] Cross-Scale Symmetry-Aware Causal Spatiotemporal Modeling with Adaptive Fusion and Region-Knowledge Transfer PDF

Cannot Refute

[53] â¦ of Adaptation of Large Language Models to Idea and Hypothesis Generation: Downstream Task Adaptation, Knowledge Distillation Approaches and Challenges PDF

Cannot Refute

[54] EfficientGFormer: Multimodal Brain Tumor Segmentation via Pruned Graph-Augmented Transformer PDF

Cannot Refute

[55] Socratic RL: A Novel Framework for Efficient Knowledge Acquisition through Iterative Reflection and Viewpoint Distillation PDF

Cannot Refute

[56] Explainable LLM-driven Multi-dimensional Distillation for E-Commerce Relevance Learning PDF

Cannot Refute

[57] Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning PDF

Cannot Refute

[58] Learning to Reason: Temporal Saliency Distillation for Interpretable Knowledge Transfer PDF

Cannot Refute

Probing to Refine: Reinforcement Distillation of LLM Reasoners via Explanatory Inversion

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[23] Distillation and refinement of reasoning in small language models for document re-ranking PDF

Contribution Analysis

Explanatory Inversion (EI) technique for reasoning augmentation

[59] Understanding synthetic context extension via retrieval heads PDF

[60] Explaining data patterns in natural language with language models PDF

[61] CausaLM: Causal Model Explanation Through Counterfactual Language Models PDF

[62] Honest Students from Untrusted Teachers: Learning an Interpretable Question-Answering Pipeline from a Pretrained Language Model PDF

[63] JailbreakTracer: Explainable Detection of Jailbreaking Prompts in LLMs Using Synthetic Data Generation PDF

[64] Can Retriever-Augmented Language Models Reason? The Blame Game Between the Retriever and the Language Model PDF

[65] SMARTER: A Data-efficient Framework to Improve Toxicity Detection with Explanation via Self-augmenting Large Language Models PDF

[66] From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data PDF

[67] COMET: Closed-loop Orchestration for Malicious Elicitation Techniques in Code Models PDF

Explanatory GRPO (ExGRPO) reinforcement learning algorithm

[49] Distilling the Implicit Multi-Branch Structure in LLMs' Reasoning via Reinforcement Learning PDF

[68] Not all thoughts are generated equal: Efficient llm reasoning via multi-turn reinforcement learning PDF

[69] Chain-of-agents: End-to-end agent foundation models via multi-agent distillation and agentic rl PDF

[70] Research on the Humanization Design of Game NPCs and User Experience Optimization Based on Large Language Models PDF

[71] Reinforcement learning foundations for deep research systems: A survey PDF

[72] Reinforced multi-teacher knowledge distillation for unsupervised sentence representation PDF

[73] Learning Multi-turn Response Selection in Grounded Dialogues with Reinforced Knowledge and Context Distillation PDF

[74] Refine and Imitate: Reducing Repetition and Inconsistency in Persuasion Dialogues via Reinforcement Learning and Human Demonstration PDF

[75] Benchmarking and Learning Real-World Customer Service Dialogue PDF

[76] Distribution Matching Distillation Meets Reinforcement Learning PDF

Reinforcement distillation framework combining EI and ExGRPO

[25] Probe then retrieve and reason: Distilling probing and reasoning capabilities into smaller language models PDF

[43] R2Vul: Learning to Reason about Software Vulnerabilities with Reinforcement Learning and Structured Reasoning Distillation PDF

[51] A study on efficient reinforcement learning through knowledge transfer PDF

[52] Cross-Scale Symmetry-Aware Causal Spatiotemporal Modeling with Adaptive Fusion and Region-Knowledge Transfer PDF

[53] â¦ of Adaptation of Large Language Models to Idea and Hypothesis Generation: Downstream Task Adaptation, Knowledge Distillation Approaches and Challenges PDF

[54] EfficientGFormer: Multimodal Brain Tumor Segmentation via Pruned Graph-Augmented Transformer PDF

[55] Socratic RL: A Novel Framework for Efficient Knowledge Acquisition through Iterative Reflection and Viewpoint Distillation PDF

[56] Explainable LLM-driven Multi-dimensional Distillation for E-Commerce Relevance Learning PDF

[57] Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning PDF

[58] Learning to Reason: Temporal Saliency Distillation for Interpretable Knowledge Transfer PDF

Table of Contents

[53] â¦ of Adaptation of Large Language Models to Idea and Hypothesis Generation: Downstream Task Adaptation, Knowledge Distillation Approaches and Challenges PDF