ATPO: ADAPTIVE TREE POLICY OPTIMIZATION FOR MULTI-TURN MEDICAL DIALOGUE

ICLR 2026 Conference SubmissionAnonymous Authors
Reinforcement Learning (RL)Large Language Models (LLMs)Medical DialogueTree Search
Abstract:

Effective information seeking in multi-turn medical dialogues is critical for accurate diagnosis, especially when dealing with incomplete information. Aligning Large Language Models (LLMs) for these interactive scenarios is challenging due to the uncertainty inherent in user-agent interactions, which we formulate as a Hierarchical Markov Decision Process (H-MDP). While conventional Reinforcement Learning (RL) methods like Group Relative Policy Optimization (GRPO) struggle with long-horizon credit assignment and Proximal Policy Optimization (PPO) suffers from unstable value estimation in this context, we propose a novel uncertainty-aware Adaptive Tree Policy Optimization (ATPO) algorithm. Our method adaptively allocates the rollout budget to states with high uncertainty, quantified by a composite metric of Bellman error and action-value variance. This strategy enables more accurate value estimation, while fostering more efficient and diverse exploration. To mitigate the high computational cost of tree-based RL, we introduce two key optimizations: an uncertainty-guided pruning mechanism to minimize the number of rollouts, and an asynchronous search architecture that leverages KV cache reuse to maximize inference throughput. Extensive experiments on three public medical dialogue benchmarks demonstrate that our algorithm significantly outperforms several strong baselines, culminating in Qwen3-8B model surpassing the much larger GPT-4o (+0.92% accuracy).

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Adaptive Tree Policy Optimization (ATPO) for aligning LLMs in multi-turn medical dialogues, formulating the problem as a Hierarchical Markov Decision Process. It resides in the 'Tree-Based and Hierarchical RL for Multi-Turn Dialogue' leaf, which contains only four papers total including this work. This represents a relatively sparse research direction within the broader taxonomy of fifty papers, suggesting the specific combination of tree-based methods and hierarchical MDPs for medical dialogue remains an emerging area rather than a saturated subfield.

The taxonomy reveals neighboring work in adjacent leaves: 'Process Feedback and Preference Learning' (three papers on physician logic integration), 'Proactive and Goal-Directed RL Strategies' (four papers on strategic information-seeking), and 'Knowledge-Enhanced and Evidence-Based RL' (four papers incorporating medical knowledge graphs). The original paper's focus on uncertainty-aware tree search distinguishes it from these directions—it emphasizes computational efficiency and exploration strategy rather than external knowledge integration or process-level feedback. The scope note explicitly excludes flat single-level RL methods, positioning this work within hierarchical decomposition approaches.

Among twenty-two candidates examined across three contributions, the 'Uncertainty-guided pruning and asynchronous search architecture' contribution shows one refutable candidate from three examined, indicating some prior work on computational optimization in tree-based RL. The 'ATPO algorithm' contribution examined ten candidates with zero refutations, suggesting relative novelty in the specific uncertainty-aware adaptive allocation mechanism. The 'Hierarchical MDP formulation' examined nine candidates without clear refutation, though hierarchical structures appear in sibling papers. These statistics reflect a limited semantic search scope, not exhaustive coverage of all relevant prior work.

Based on the limited search of twenty-two candidates, the work appears to occupy a moderately novel position within a sparse research direction. The core algorithmic contribution shows fewer overlaps than the architectural optimizations, though the small candidate pool and focused taxonomy leaf prevent definitive claims about field-wide novelty. The analysis captures top semantic matches but cannot rule out relevant work outside this scope.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
22
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Aligning large language models for multi-turn medical dialogue through reinforcement learning. The field encompasses diverse approaches to building and refining LLMs that can engage in extended clinical conversations, spanning from foundational RL frameworks and training pipelines to specialized dialogue systems and benchmarking environments. The taxonomy reveals several main branches: Reinforcement Learning Frameworks for Medical Dialogue Alignment explores tree-based and hierarchical methods (e.g., Adaptive Tree Policy[0], Doctor-R1[3]) that structure multi-turn interactions; Medical LLM Development and Training Pipelines focuses on end-to-end model construction (e.g., Zhongjing[2], Huatuogpt[4]); Task-Specific Medical Dialogue Systems targets particular clinical scenarios such as disease screening or mental health support; Multi-Turn Interaction and Consistency Mechanisms addresses coherence across extended exchanges; Specialized Training Environments and Benchmarks provides simulation platforms like Medagentgym[5]; and Surveys and Theoretical Foundations offer broader perspectives on RL in NLP and medical agents. A particularly active line of work centers on hierarchical and tree-based RL methods that decompose complex diagnostic dialogues into structured decision processes, balancing exploration of symptom space with efficient convergence to accurate diagnoses. Adaptive Tree Policy[0] sits within this branch alongside Doctoragent-RL[20] and MA-HRL[40], emphasizing adaptive policy structures that manage multi-turn reasoning. In contrast, works like Doctor-R1[3] and DiaLLMs[6] integrate RL more tightly with clinical reasoning chains and proactive questioning strategies, while simulation-driven approaches such as Medagentgym[5] and Simulating Human Personas[1] focus on creating realistic training environments. The original paper's emphasis on tree-based policy adaptation positions it closely with hierarchical RL methods, distinguishing itself from end-to-end training pipelines by explicitly modeling dialogue structure and from task-specific systems by maintaining generality across diagnostic scenarios. Open questions remain around scalability, interpretability of learned policies, and transferability across diverse patient populations.

Claimed Contributions

Adaptive Tree Policy Optimization (ATPO) algorithm

ATPO is a reinforcement learning algorithm that adaptively allocates rollout budgets to states with high uncertainty in multi-turn medical dialogues. It uses a composite metric of Bellman error and action-value variance to guide tree expansion, enabling more accurate value estimation and efficient exploration.

10 retrieved papers
Uncertainty-guided pruning and asynchronous search architecture

Two computational optimizations are introduced to reduce the cost of tree-based RL: an uncertainty-guided pruning mechanism that reduces the number of rollouts, and an asynchronous search architecture that reuses KV cache to improve inference throughput.

3 retrieved papers
Can Refute
Hierarchical MDP formulation for multi-turn medical dialogue

The authors formalize multi-turn medical dialogues as a Hierarchical MDP, where macro-actions correspond to full conversational turns and micro-actions correspond to individual tokens. This formulation addresses the uncertainty inherent in user-agent interactions.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Adaptive Tree Policy Optimization (ATPO) algorithm

ATPO is a reinforcement learning algorithm that adaptively allocates rollout budgets to states with high uncertainty in multi-turn medical dialogues. It uses a composite metric of Bellman error and action-value variance to guide tree expansion, enabling more accurate value estimation and efficient exploration.

Contribution

Uncertainty-guided pruning and asynchronous search architecture

Two computational optimizations are introduced to reduce the cost of tree-based RL: an uncertainty-guided pruning mechanism that reduces the number of rollouts, and an asynchronous search architecture that reuses KV cache to improve inference throughput.

Contribution

Hierarchical MDP formulation for multi-turn medical dialogue

The authors formalize multi-turn medical dialogues as a Hierarchical MDP, where macro-actions correspond to full conversational turns and micro-actions correspond to individual tokens. This formulation addresses the uncertainty inherent in user-agent interactions.