Multimodal Policy Internalization for Conversational Agents

ICLR 2026 Conference SubmissionAnonymous Authors
Conversational AIMultimodal modelsPolicy internalizationReinforcement learning with verifiable rewards
Abstract:

Modern conversational agents such as ChatGPT and Alexa+ have become indispensable in everyday life. To handle diverse business requirements and enable agentic capabilities, these LLM-based systems often rely on predefined policies, which specify instructions such as model metadata, response styles, and tool-using rules. These policies, typically implemented as in-context prompts, are becoming increasingly complex and lengthy, posing challenges for models in faithfully following them. Moreover, they impose a large fixed computational cost regardless of the input query. As multimodal conversational agents emerge, complex policies that govern multimodal tasks and even involve visual instructions are becoming increasingly necessary, yet they have been rarely studied in previous work. In particular, prior work on prompt compression has focused solely on reducing the length of task templates and demonstrations, which require limited reasoning compared to policies. Meanwhile, related work on policy alignment has been limited to internalizing text-only safety instructions. To bridge this gap, we introduce Multimodal Policy Internalization (MPI), a new task that aims to internalize reasoning-intensive multimodal policies into the parameters of a large multimodal model, enabling stronger policy-following behavior without requiring the policy to be included in-context during inference. MPI presents unique challenges from both data and algorithmic perspectives. We construct two new datasets that cover complex decision-making and tool-using tasks across both synthetic and real-world visual inputs. We investigate diverse internalization strategies and propose a novel three-stage training framework, TriMPI, which enables stronger guidance from the original policy during internalization. Specifically, we first introduce a continual pretraining stage before supervised finetuning, which directly injects policy knowledge into the model. We then propose PolicyRollout, a simple yet effective extension to GRPO-style RL algorithms, which enables more grounded exploration by augmenting the rollout space with policy-aware responses. We show significant improvements of TriMPI over strong baselines in end-to-end performance, generalization capability, and robustness to catastrophic forgetting. As the first work on multimodal policy internalization, we aim to build a strong foundation for future research by providing datasets, training recipes, and comprehensive evaluations.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Multimodal Policy Internalization (MPI), a task aimed at embedding complex multimodal policies—including visual instructions and tool-using rules—directly into model parameters through multi-stage training. Within the taxonomy, it occupies the 'Multimodal Policy Internalization via Multi-Stage Training' leaf under 'Policy Internalization and Alignment Methods'. Notably, this leaf contains only the original paper itself, with no sibling papers identified, suggesting this specific formulation of multi-stage multimodal policy internalization represents a relatively sparse research direction within the broader field of 15 surveyed papers.

The taxonomy reveals neighboring work in adjacent leaves: 'Safety-Grounded Policy Alignment for Vision-Language Models' focuses on safety-specific alignment rather than general policy internalization, while 'Task Vector-Based In-Context Policy Adaptation' explores parameter-free adaptation mechanisms. The broader 'Policy Integration Architectures' branch contains hierarchical planning-control systems that maintain modular separation between reasoning and execution, contrasting with the paper's emphasis on unified parameter-level internalization. The 'Unified Multimodal Policy Learning' branch addresses cross-modal reasoning but without the explicit multi-stage internalization strategy proposed here, highlighting how this work bridges internalization methods with unified policy execution.

Among 30 candidates examined through semantic search, none were found to clearly refute any of the three main contributions: the MPI task formulation (10 candidates examined, 0 refutable), the ClevrPolicy and GTAPolicy datasets (10 candidates, 0 refutable), and the TriMPI training framework with PolicyRollout algorithm (10 candidates, 0 refutable). This suggests that within the limited search scope, the specific combination of multimodal policy internalization through multi-stage training with visual policy instructions appears relatively unexplored. However, the analysis is constrained by the top-30 semantic matches and does not constitute an exhaustive literature review.

Based on the limited search scope, the work appears to occupy a novel position by explicitly targeting multimodal policy internalization through parameter-level training, rather than relying on in-context prompting or modular architectures. The absence of sibling papers in its taxonomy leaf and the lack of refuting candidates among 30 examined works suggest potential novelty, though a broader literature search would be needed to confirm whether related approaches exist in adjacent research communities or under different terminology.

Taxonomy

Core-task Taxonomy Papers
15
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Internalizing multimodal policies into large multimodal model parameters. This field addresses how to embed decision-making capabilities directly within the weights of large multimodal models, enabling them to act as embodied agents that perceive and respond to complex environments. The taxonomy organizes research into four main branches: Policy Internalization and Alignment Methods focus on training strategies that fuse behavioral policies with pretrained representations, often through multi-stage pipelines or alignment objectives; Policy Integration Architectures for Embodied Agents explore architectural designs that combine vision-language backbones with action prediction modules; Unified Multimodal Policy Learning from Task Specifications investigates how models can generalize across diverse tasks by conditioning on natural language or other modalities; and Supporting Technologies for Multimodal Policy Systems cover auxiliary techniques such as data synthesis, safety mechanisms, and evaluation frameworks. Representative works like MultiGen[1] and RoboMP2[2] illustrate how pretraining and fine-tuning stages can be orchestrated to internalize control policies, while LMM Planners Skills[3] and Optimus-2[5] demonstrate different ways to integrate planning and low-level skills within unified architectures. A particularly active line of work examines trade-offs between end-to-end internalization and modular decomposition: some approaches embed all reasoning and control within a single model, while others retain separate planning or skill modules that interact with a central multimodal backbone. Another recurring theme is the tension between generalization across tasks and specialization for specific embodiments, with methods like MSR-Align[4] and Robo-mutual[8] exploring alignment strategies that balance broad pretraining with domain-specific adaptation. The original paper, Multimodal Policy Internalization[0], sits within the multi-stage training branch and emphasizes progressive internalization of policies through carefully designed training phases. Compared to nearby works such as LMM Planners Skills[3], which may retain explicit planning modules, and Optimus-2[5], which focuses on unified policy learning from task specifications, Multimodal Policy Internalization[0] appears to prioritize deeper integration of behavioral policies directly into model parameters, aiming for a more seamless fusion of perception, reasoning, and action generation within a single multimodal architecture.

Claimed Contributions

Multimodal Policy Internalization (MPI) task

The authors define a new task called Multimodal Policy Internalization (MPI), which aims to embed complex multimodal policies into model parameters so that models can generate policy-compliant responses without requiring the policy in-context during inference. This task extends prior work on text-only policy alignment to the multimodal domain.

10 retrieved papers
ClevrPolicy and GTAPolicy datasets

The authors introduce two new datasets: ClevrPolicy, which focuses on reasoning-intensive decision-making with synthetic images and controllable policy complexity, and GTAPolicy, which targets tool-usage instructions with real-world images in a low-data regime. These datasets support training and evaluation of multimodal policy internalization methods.

10 retrieved papers
TriMPI training framework with PolicyRollout algorithm

The authors propose TriMPI, a three-stage training framework consisting of visually-masked continual pretraining, chain-of-thought supervised finetuning, and reinforcement learning with PolicyRollout. PolicyRollout is a novel extension to GRPO-style RL algorithms that augments the rollout space with policy-aware responses to enable more grounded exploration during training.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Multimodal Policy Internalization (MPI) task

The authors define a new task called Multimodal Policy Internalization (MPI), which aims to embed complex multimodal policies into model parameters so that models can generate policy-compliant responses without requiring the policy in-context during inference. This task extends prior work on text-only policy alignment to the multimodal domain.

Contribution

ClevrPolicy and GTAPolicy datasets

The authors introduce two new datasets: ClevrPolicy, which focuses on reasoning-intensive decision-making with synthetic images and controllable policy complexity, and GTAPolicy, which targets tool-usage instructions with real-world images in a low-data regime. These datasets support training and evaluation of multimodal policy internalization methods.

Contribution

TriMPI training framework with PolicyRollout algorithm

The authors propose TriMPI, a three-stage training framework consisting of visually-masked continual pretraining, chain-of-thought supervised finetuning, and reinforcement learning with PolicyRollout. PolicyRollout is a novel extension to GRPO-style RL algorithms that augments the rollout space with policy-aware responses to enable more grounded exploration during training.