Multimodal Policy Internalization for Conversational Agents

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.3 Download Report PDF

Conversational AIMultimodal modelsPolicy internalizationReinforcement learning with verifiable rewards

Modern conversational agents such as ChatGPT and Alexa+ have become indispensable in everyday life. To handle diverse business requirements and enable agentic capabilities, these LLM-based systems often rely on predefined policies, which specify instructions such as model metadata, response styles, and tool-using rules. These policies, typically implemented as in-context prompts, are becoming increasingly complex and lengthy, posing challenges for models in faithfully following them. Moreover, they impose a large fixed computational cost regardless of the input query. As multimodal conversational agents emerge, complex policies that govern multimodal tasks and even involve visual instructions are becoming increasingly necessary, yet they have been rarely studied in previous work. In particular, prior work on prompt compression has focused solely on reducing the length of task templates and demonstrations, which require limited reasoning compared to policies. Meanwhile, related work on policy alignment has been limited to internalizing text-only safety instructions. To bridge this gap, we introduce Multimodal Policy Internalization (MPI), a new task that aims to internalize reasoning-intensive multimodal policies into the parameters of a large multimodal model, enabling stronger policy-following behavior without requiring the policy to be included in-context during inference. MPI presents unique challenges from both data and algorithmic perspectives. We construct two new datasets that cover complex decision-making and tool-using tasks across both synthetic and real-world visual inputs. We investigate diverse internalization strategies and propose a novel three-stage training framework, TriMPI, which enables stronger guidance from the original policy during internalization. Specifically, we first introduce a continual pretraining stage before supervised finetuning, which directly injects policy knowledge into the model. We then propose PolicyRollout, a simple yet effective extension to GRPO-style RL algorithms, which enables more grounded exploration by augmenting the rollout space with policy-aware responses. We show significant improvements of TriMPI over strong baselines in end-to-end performance, generalization capability, and robustness to catastrophic forgetting. As the first work on multimodal policy internalization, we aim to build a strong foundation for future research by providing datasets, training recipes, and comprehensive evaluations.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Multimodal Policy Internalization (MPI), a task aimed at embedding complex multimodal policies—including visual instructions and tool-using rules—directly into model parameters through multi-stage training. Within the taxonomy, it occupies the 'Multimodal Policy Internalization via Multi-Stage Training' leaf under 'Policy Internalization and Alignment Methods'. Notably, this leaf contains only the original paper itself, with no sibling papers identified, suggesting this specific formulation of multi-stage multimodal policy internalization represents a relatively sparse research direction within the broader field of 15 surveyed papers.

The taxonomy reveals neighboring work in adjacent leaves: 'Safety-Grounded Policy Alignment for Vision-Language Models' focuses on safety-specific alignment rather than general policy internalization, while 'Task Vector-Based In-Context Policy Adaptation' explores parameter-free adaptation mechanisms. The broader 'Policy Integration Architectures' branch contains hierarchical planning-control systems that maintain modular separation between reasoning and execution, contrasting with the paper's emphasis on unified parameter-level internalization. The 'Unified Multimodal Policy Learning' branch addresses cross-modal reasoning but without the explicit multi-stage internalization strategy proposed here, highlighting how this work bridges internalization methods with unified policy execution.

Among 30 candidates examined through semantic search, none were found to clearly refute any of the three main contributions: the MPI task formulation (10 candidates examined, 0 refutable), the ClevrPolicy and GTAPolicy datasets (10 candidates, 0 refutable), and the TriMPI training framework with PolicyRollout algorithm (10 candidates, 0 refutable). This suggests that within the limited search scope, the specific combination of multimodal policy internalization through multi-stage training with visual policy instructions appears relatively unexplored. However, the analysis is constrained by the top-30 semantic matches and does not constitute an exhaustive literature review.

Based on the limited search scope, the work appears to occupy a novel position by explicitly targeting multimodal policy internalization through parameter-level training, rather than relying on in-context prompting or modular architectures. The absence of sibling papers in its taxonomy leaf and the lack of refuting candidates among 30 examined works suggest potential novelty, though a broader literature search would be needed to confirm whether related approaches exist in adjacent research communities or under different terminology.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Internalizing multimodal policies into large multimodal model parameters. This field addresses how to embed decision-making capabilities directly within the weights of large multimodal models, enabling them to act as embodied agents that perceive and respond to complex environments. The taxonomy organizes research into four main branches: Policy Internalization and Alignment Methods focus on training strategies that fuse behavioral policies with pretrained representations, often through multi-stage pipelines or alignment objectives; Policy Integration Architectures for Embodied Agents explore architectural designs that combine vision-language backbones with action prediction modules; Unified Multimodal Policy Learning from Task Specifications investigates how models can generalize across diverse tasks by conditioning on natural language or other modalities; and Supporting Technologies for Multimodal Policy Systems cover auxiliary techniques such as data synthesis, safety mechanisms, and evaluation frameworks. Representative works like MultiGen[1] and RoboMP2[2] illustrate how pretraining and fine-tuning stages can be orchestrated to internalize control policies, while LMM Planners Skills[3] and Optimus-2[5] demonstrate different ways to integrate planning and low-level skills within unified architectures. A particularly active line of work examines trade-offs between end-to-end internalization and modular decomposition: some approaches embed all reasoning and control within a single model, while others retain separate planning or skill modules that interact with a central multimodal backbone. Another recurring theme is the tension between generalization across tasks and specialization for specific embodiments, with methods like MSR-Align[4] and Robo-mutual[8] exploring alignment strategies that balance broad pretraining with domain-specific adaptation. The original paper, Multimodal Policy Internalization[0], sits within the multi-stage training branch and emphasizes progressive internalization of policies through carefully designed training phases. Compared to nearby works such as LMM Planners Skills[3], which may retain explicit planning modules, and Optimus-2[5], which focuses on unified policy learning from task specifications, Multimodal Policy Internalization[0] appears to prioritize deeper integration of behavioral policies directly into model parameters, aiming for a more seamless fusion of perception, reasoning, and action generation within a single multimodal architecture.

Claimed Contributions

Multimodal Policy Internalization (MPI) task

10 retrieved papers

The authors define a new task called Multimodal Policy Internalization (MPI), which aims to embed complex multimodal policies into model parameters so that models can generate policy-compliant responses without requiring the policy in-context during inference. This task extends prior work on text-only policy alignment to the multimodal domain.

10 retrieved papers

ClevrPolicy and GTAPolicy datasets

10 retrieved papers

The authors introduce two new datasets: ClevrPolicy, which focuses on reasoning-intensive decision-making with synthetic images and controllable policy complexity, and GTAPolicy, which targets tool-usage instructions with real-world images in a low-data regime. These datasets support training and evaluation of multimodal policy internalization methods.

10 retrieved papers

TriMPI training framework with PolicyRollout algorithm

10 retrieved papers

The authors propose TriMPI, a three-stage training framework consisting of visually-masked continual pretraining, chain-of-thought supervised finetuning, and reinforcement learning with PolicyRollout. PolicyRollout is a novel extension to GRPO-style RL algorithms that augments the rollout space with policy-aware responses to enable more grounded exploration during training.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Multimodal Policy Internalization (MPI) task

[36] Perception, reason, think, and plan: A survey on large multimodal reasoning models PDF

Cannot Refute

[37] Multimodal chain-of-thought reasoning: A comprehensive survey PDF

Cannot Refute

[38] Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models PDF

Cannot Refute

[39] R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization PDF

Cannot Refute

[40] Reinforced mllm: A survey on rl-based reasoning in multimodal large language models PDF

Cannot Refute

[41] Struct2D: A Perception-Guided Framework for Spatial Reasoning in Large Multimodal Models PDF

Cannot Refute

[42] Perception-aware policy optimization for multimodal reasoning PDF

Cannot Refute

[43] Think Then Embed: Generative Context Improves Multimodal Embedding PDF

Cannot Refute

[44] Boosting Reasoning in Large Multimodal Models via Activation Replay PDF

Cannot Refute

[45] Critic-v: Vlm critics help catch vlm errors in multimodal reasoning PDF

Cannot Refute

Contribution

ClevrPolicy and GTAPolicy datasets

[26] A Benchmarking Study of Vision-based Robotic Grasping Algorithms PDF

Cannot Refute

[27] VISION Datasets: A Benchmark for Vision-based InduStrial InspectiON PDF

Cannot Refute

[28] Development and validation of an autonomous artificial intelligence agent for clinical decision-making in oncology PDF

Cannot Refute

[29] Ui-vision: A desktop-centric gui benchmark for visual perception and interaction PDF

Cannot Refute

[30] Openthinkimg: Learning to think with images via visual tool reinforcement learning PDF

Cannot Refute

[31] Visualtoolagent (vista): A reinforcement learning framework for visual tool selection PDF

Cannot Refute

[32] Benchmarking vision, language, & action models on robotic learning tasks PDF

Cannot Refute

[33] MedOrch: Medical Diagnosis with Tool-Augmented Reasoning Agents for Flexible Extensibility PDF

Cannot Refute

[34] Validation of computer vision-based ergonomic risk assessment tools for real manufacturing environments PDF

Cannot Refute

[35] Efficient and Accurate Pneumonia Detection Using a Novel Multi-Scale Transformer Approach PDF

Cannot Refute

Contribution

TriMPI training framework with PolicyRollout algorithm

[16] Demystifying long chain-of-thought reasoning in llms PDF

Cannot Refute

[17] Automated unit test generation via chain of thought prompt and reinforcement learning from coverage feedback PDF

Cannot Refute

[18] AdaCoT: Pareto-Optimal Adaptive Chain-of-Thought Triggering via Reinforcement Learning PDF

Cannot Refute

[19] VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks PDF

Cannot Refute

[20] ARES: Alternating reinforcement learning and supervised fine-tuning for enhanced multi-modal chain-of-thought reasoning through diverse AI feedback PDF

Cannot Refute

[21] Unified multimodal chain-of-thought reward model through reinforcement fine-tuning PDF

Cannot Refute

[22] Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning PDF

Cannot Refute

[23] Rl of thoughts: Navigating llm reasoning with inference-time reinforcement learning PDF

Cannot Refute

[24] Bridging formal language with chain-of-thought reasoning to geometry problem solving PDF

Cannot Refute

[25] S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models PDF

Cannot Refute

Multimodal Policy Internalization for Conversational Agents

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Multimodal Policy Internalization (MPI) task

[36] Perception, reason, think, and plan: A survey on large multimodal reasoning models PDF

[37] Multimodal chain-of-thought reasoning: A comprehensive survey PDF

[38] Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models PDF

[39] R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization PDF

[40] Reinforced mllm: A survey on rl-based reasoning in multimodal large language models PDF

[41] Struct2D: A Perception-Guided Framework for Spatial Reasoning in Large Multimodal Models PDF

[42] Perception-aware policy optimization for multimodal reasoning PDF

[43] Think Then Embed: Generative Context Improves Multimodal Embedding PDF

[44] Boosting Reasoning in Large Multimodal Models via Activation Replay PDF

[45] Critic-v: Vlm critics help catch vlm errors in multimodal reasoning PDF

ClevrPolicy and GTAPolicy datasets

[26] A Benchmarking Study of Vision-based Robotic Grasping Algorithms PDF

[27] VISION Datasets: A Benchmark for Vision-based InduStrial InspectiON PDF

[28] Development and validation of an autonomous artificial intelligence agent for clinical decision-making in oncology PDF

[29] Ui-vision: A desktop-centric gui benchmark for visual perception and interaction PDF

[30] Openthinkimg: Learning to think with images via visual tool reinforcement learning PDF

[31] Visualtoolagent (vista): A reinforcement learning framework for visual tool selection PDF

[32] Benchmarking vision, language, & action models on robotic learning tasks PDF

[33] MedOrch: Medical Diagnosis with Tool-Augmented Reasoning Agents for Flexible Extensibility PDF

[34] Validation of computer vision-based ergonomic risk assessment tools for real manufacturing environments PDF

[35] Efficient and Accurate Pneumonia Detection Using a Novel Multi-Scale Transformer Approach PDF

TriMPI training framework with PolicyRollout algorithm

[16] Demystifying long chain-of-thought reasoning in llms PDF

[17] Automated unit test generation via chain of thought prompt and reinforcement learning from coverage feedback PDF

[18] AdaCoT: Pareto-Optimal Adaptive Chain-of-Thought Triggering via Reinforcement Learning PDF

[19] VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks PDF

[20] ARES: Alternating reinforcement learning and supervised fine-tuning for enhanced multi-modal chain-of-thought reasoning through diverse AI feedback PDF

[21] Unified multimodal chain-of-thought reward model through reinforcement fine-tuning PDF

[22] Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning PDF

[23] Rl of thoughts: Navigating llm reasoning with inference-time reinforcement learning PDF

[24] Bridging formal language with chain-of-thought reasoning to geometry problem solving PDF

[25] S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models PDF

Table of Contents