ATPO: ADAPTIVE TREE POLICY OPTIMIZATION FOR MULTI-TURN MEDICAL DIALOGUE

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

Reinforcement Learning (RL)Large Language Models (LLMs)Medical DialogueTree Search

Effective information seeking in multi-turn medical dialogues is critical for accurate diagnosis, especially when dealing with incomplete information. Aligning Large Language Models (LLMs) for these interactive scenarios is challenging due to the uncertainty inherent in user-agent interactions, which we formulate as a Hierarchical Markov Decision Process (H-MDP). While conventional Reinforcement Learning (RL) methods like Group Relative Policy Optimization (GRPO) struggle with long-horizon credit assignment and Proximal Policy Optimization (PPO) suffers from unstable value estimation in this context, we propose a novel uncertainty-aware Adaptive Tree Policy Optimization (ATPO) algorithm. Our method adaptively allocates the rollout budget to states with high uncertainty, quantified by a composite metric of Bellman error and action-value variance. This strategy enables more accurate value estimation, while fostering more efficient and diverse exploration. To mitigate the high computational cost of tree-based RL, we introduce two key optimizations: an uncertainty-guided pruning mechanism to minimize the number of rollouts, and an asynchronous search architecture that leverages KV cache reuse to maximize inference throughput. Extensive experiments on three public medical dialogue benchmarks demonstrate that our algorithm significantly outperforms several strong baselines, culminating in Qwen3-8B model surpassing the much larger GPT-4o (+0.92% accuracy).

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Adaptive Tree Policy Optimization (ATPO) for aligning LLMs in multi-turn medical dialogues, formulating the problem as a Hierarchical Markov Decision Process. It resides in the 'Tree-Based and Hierarchical RL for Multi-Turn Dialogue' leaf, which contains only four papers total including this work. This represents a relatively sparse research direction within the broader taxonomy of fifty papers, suggesting the specific combination of tree-based methods and hierarchical MDPs for medical dialogue remains an emerging area rather than a saturated subfield.

The taxonomy reveals neighboring work in adjacent leaves: 'Process Feedback and Preference Learning' (three papers on physician logic integration), 'Proactive and Goal-Directed RL Strategies' (four papers on strategic information-seeking), and 'Knowledge-Enhanced and Evidence-Based RL' (four papers incorporating medical knowledge graphs). The original paper's focus on uncertainty-aware tree search distinguishes it from these directions—it emphasizes computational efficiency and exploration strategy rather than external knowledge integration or process-level feedback. The scope note explicitly excludes flat single-level RL methods, positioning this work within hierarchical decomposition approaches.

Among twenty-two candidates examined across three contributions, the 'Uncertainty-guided pruning and asynchronous search architecture' contribution shows one refutable candidate from three examined, indicating some prior work on computational optimization in tree-based RL. The 'ATPO algorithm' contribution examined ten candidates with zero refutations, suggesting relative novelty in the specific uncertainty-aware adaptive allocation mechanism. The 'Hierarchical MDP formulation' examined nine candidates without clear refutation, though hierarchical structures appear in sibling papers. These statistics reflect a limited semantic search scope, not exhaustive coverage of all relevant prior work.

Based on the limited search of twenty-two candidates, the work appears to occupy a moderately novel position within a sparse research direction. The core algorithmic contribution shows fewer overlaps than the architectural optimizations, though the small candidate pool and focused taxonomy leaf prevent definitive claims about field-wide novelty. The analysis captures top semantic matches but cannot rule out relevant work outside this scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Aligning large language models for multi-turn medical dialogue through reinforcement learning. The field encompasses diverse approaches to building and refining LLMs that can engage in extended clinical conversations, spanning from foundational RL frameworks and training pipelines to specialized dialogue systems and benchmarking environments. The taxonomy reveals several main branches: Reinforcement Learning Frameworks for Medical Dialogue Alignment explores tree-based and hierarchical methods (e.g., Adaptive Tree Policy[0], Doctor-R1[3]) that structure multi-turn interactions; Medical LLM Development and Training Pipelines focuses on end-to-end model construction (e.g., Zhongjing[2], Huatuogpt[4]); Task-Specific Medical Dialogue Systems targets particular clinical scenarios such as disease screening or mental health support; Multi-Turn Interaction and Consistency Mechanisms addresses coherence across extended exchanges; Specialized Training Environments and Benchmarks provides simulation platforms like Medagentgym[5]; and Surveys and Theoretical Foundations offer broader perspectives on RL in NLP and medical agents. A particularly active line of work centers on hierarchical and tree-based RL methods that decompose complex diagnostic dialogues into structured decision processes, balancing exploration of symptom space with efficient convergence to accurate diagnoses. Adaptive Tree Policy[0] sits within this branch alongside Doctoragent-RL[20] and MA-HRL[40], emphasizing adaptive policy structures that manage multi-turn reasoning. In contrast, works like Doctor-R1[3] and DiaLLMs[6] integrate RL more tightly with clinical reasoning chains and proactive questioning strategies, while simulation-driven approaches such as Medagentgym[5] and Simulating Human Personas[1] focus on creating realistic training environments. The original paper's emphasis on tree-based policy adaptation positions it closely with hierarchical RL methods, distinguishing itself from end-to-end training pipelines by explicitly modeling dialogue structure and from task-specific systems by maintaining generality across diagnostic scenarios. Open questions remain around scalability, interpretability of learned policies, and transferability across diverse patient populations.

Claimed Contributions

Adaptive Tree Policy Optimization (ATPO) algorithm

10 retrieved papers

ATPO is a reinforcement learning algorithm that adaptively allocates rollout budgets to states with high uncertainty in multi-turn medical dialogues. It uses a composite metric of Bellman error and action-value variance to guide tree expansion, enabling more accurate value estimation and efficient exploration.

10 retrieved papers

Uncertainty-guided pruning and asynchronous search architecture

Can Refute

3 retrieved papers

Two computational optimizations are introduced to reduce the cost of tree-based RL: an uncertainty-guided pruning mechanism that reduces the number of rollouts, and an asynchronous search architecture that reuses KV cache to improve inference throughput.

3 retrieved papers

Can Refute

Hierarchical MDP formulation for multi-turn medical dialogue

9 retrieved papers

The authors formalize multi-turn medical dialogues as a Hierarchical MDP, where macro-actions correspond to full conversational turns and micro-actions correspond to individual tokens. This formulation addresses the uncertainty inherent in user-agent interactions.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[20] Doctoragent-rl: A multi-agent collaborative reinforcement learning system for multi-turn clinical dialogue PDF

Feng Yi-chun, Wang Jiawei, Yichun Feng, Zhou Lu, Jiawei Wang, Lei Zhen, Lu Zhou, Li Yixue, Yixue Li (2025)

[31] A Knowledge-Enhanced Hierarchical Reinforcement Learning-Based Dialogue System for Automatic Disease Diagnosis PDF

Ying Zhu, Yameng Li, Yuan Cui, Tianbao Zhang, Da-Ling Wang, Yifei Zhang, Daling Wang, Shi Feng (2023)

[40] MA-HRL: Multi-Agent Hierarchical Reinforcement Learning for Medical Diagnostic Dialogue Systems PDF

Xingchuang Liao, Yuchen Qin, Zhimin Fan, Xiaoming Yu, Jingbo Yang, Rongye Shi, Wenjun Wu, Xiao-ming Yu, Jing-bo Yang (2025) • Electronics

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Adaptive Tree Policy Optimization (ATPO) algorithm

[59] Uncertainty-guided optimization on large language model search trees PDF

Cannot Refute

[60] TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling PDF

Cannot Refute

[61] Bayes Adaptive Monte Carlo Tree Search for Offline Model-based Reinforcement Learning PDF

Cannot Refute

[62] Uncertainty-Guided Likelihood Tree Search PDF

Cannot Refute

[63] TreeRL: LLM Reinforcement Learning with On-Policy Tree Search PDF

Cannot Refute

[64] Monte-Carlo tree search for Bayesian reinforcement learning PDF

Cannot Refute

[65] Learning to stop: Dynamic simulation monte-carlo tree search PDF

Cannot Refute

[66] ATPO: Agentic Turn-based Policy Optimization via Tree Search PDF

Cannot Refute

[67] TreeAdv: Tree-Structured Advantage Redistribution for Group-Based RL PDF

Cannot Refute

[68] Approximate inference in discrete distributions with monte carlo tree search and value functions PDF

Cannot Refute

Contribution

Uncertainty-guided pruning and asynchronous search architecture

[60] TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling PDF

Can Refute

[69] Treebon: Enhancing inference-time alignment with speculative tree-search and best-of-n sampling PDF

Cannot Refute

[70] An Adaptive Parallel Layer-Skipping Framework for Large Language Model Inference Speedup With Speculative Decoding PDF

Cannot Refute

Contribution

Hierarchical MDP formulation for multi-turn medical dialogue

[40] MA-HRL: Multi-Agent Hierarchical Reinforcement Learning for Medical Diagnostic Dialogue Systems PDF

Cannot Refute

[51] Composite task-completion dialogue policy learning via hierarchical deep reinforcement learning PDF

Cannot Refute

[52] Modelling Hierarchical Structure between Dialogue Policy and Natural Language Generator with Option Framework for Task-oriented Dialogue System PDF

Cannot Refute

[53] Subgoal discovery for hierarchical dialogue policy learning PDF

Cannot Refute

[54] A Unified Dialogue Management Strategy for Multi-intent Dialogue Conversations in Multiple Languages PDF

Cannot Refute

[55] Reinforced Strategy Optimization for Conversational Recommender Systems via Network-of-Experts PDF

Cannot Refute

[56] Knowledge-Driven Hierarchical Reinforcement Learning Framework for Automated Disease Diagnosis in Dialogue Systems PDF

Cannot Refute

[57] Spatially-aware dialogue control using hierarchical reinforcement learning PDF

Cannot Refute

[58] GoChat: Goal-oriented Chatbots with Hierarchical Reinforcement Learning PDF

Cannot Refute

ATPO: ADAPTIVE TREE POLICY OPTIMIZATION FOR MULTI-TURN MEDICAL DIALOGUE

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[20] Doctoragent-rl: A multi-agent collaborative reinforcement learning system for multi-turn clinical dialogue PDF

[31] A Knowledge-Enhanced Hierarchical Reinforcement Learning-Based Dialogue System for Automatic Disease Diagnosis PDF

[40] MA-HRL: Multi-Agent Hierarchical Reinforcement Learning for Medical Diagnostic Dialogue Systems PDF

Contribution Analysis

Adaptive Tree Policy Optimization (ATPO) algorithm

[59] Uncertainty-guided optimization on large language model search trees PDF

[60] TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling PDF

[61] Bayes Adaptive Monte Carlo Tree Search for Offline Model-based Reinforcement Learning PDF

[62] Uncertainty-Guided Likelihood Tree Search PDF

[63] TreeRL: LLM Reinforcement Learning with On-Policy Tree Search PDF

[64] Monte-Carlo tree search for Bayesian reinforcement learning PDF

[65] Learning to stop: Dynamic simulation monte-carlo tree search PDF

[66] ATPO: Agentic Turn-based Policy Optimization via Tree Search PDF

[67] TreeAdv: Tree-Structured Advantage Redistribution for Group-Based RL PDF

[68] Approximate inference in discrete distributions with monte carlo tree search and value functions PDF

Uncertainty-guided pruning and asynchronous search architecture

[60] TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling PDF

[69] Treebon: Enhancing inference-time alignment with speculative tree-search and best-of-n sampling PDF

[70] An Adaptive Parallel Layer-Skipping Framework for Large Language Model Inference Speedup With Speculative Decoding PDF

Hierarchical MDP formulation for multi-turn medical dialogue

[40] MA-HRL: Multi-Agent Hierarchical Reinforcement Learning for Medical Diagnostic Dialogue Systems PDF

[51] Composite task-completion dialogue policy learning via hierarchical deep reinforcement learning PDF

[52] Modelling Hierarchical Structure between Dialogue Policy and Natural Language Generator with Option Framework for Task-oriented Dialogue System PDF

[53] Subgoal discovery for hierarchical dialogue policy learning PDF

[54] A Unified Dialogue Management Strategy for Multi-intent Dialogue Conversations in Multiple Languages PDF

[55] Reinforced Strategy Optimization for Conversational Recommender Systems via Network-of-Experts PDF

[56] Knowledge-Driven Hierarchical Reinforcement Learning Framework for Automated Disease Diagnosis in Dialogue Systems PDF

[57] Spatially-aware dialogue control using hierarchical reinforcement learning PDF

[58] GoChat: Goal-oriented Chatbots with Hierarchical Reinforcement Learning PDF

Table of Contents