Robust Fine-tuning of Vision-Language-Action Robot Policies via Parameter Merging

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Robust fine-tuninggeneralist robot policymodel merging

Generalist robot policies, trained on large and diverse datasets, have demonstrated the ability to generalize across a wide spectrum of behaviors, enabling a single policy to act in varied real-world environments. However, they still fall short on new tasks not covered in the training data. When finetuned on limited demonstrations of a new task, these policies often overfit to the specific demonstrations---not only losing their prior abilities to solve a wide variety of generalist tasks but also failing to generalize within the new task itself. In this work, we aim to develop a method that preserves the generalization capabilities of the generalist policy during finetuning, allowing a single policy to robustly incorporate a new skill into its repertoire. Our goal is a single policy that both learns to generalize to variations of the new task and retains the broad competencies gained from pretraining. We show that this can be achieved through a simple yet effective strategy: interpolating the weights of a finetuned model with that of the pretrained model. We show, across extensive simulated and real-world experiments, that such model merging produces a single model that inherits the generalist abilities of the base model and learns to solve the new task robustly, outperforming both the pretrained and finetuned model on out-of-distribution variations of the new task. Moreover, we show that model merging enables continual acquisition of new skills in a lifelong learning setting, without sacrificing previously learned generalist abilities.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes weight interpolation between pretrained and finetuned vision-language-action models to preserve generalist capabilities while learning new tasks. It resides in the 'Weight Interpolation and Model Merging' leaf, which contains only two papers total, indicating a relatively sparse research direction within the broader taxonomy. This leaf sits under 'Parameter-Efficient Adaptation and Catastrophic Forgetting Mitigation', one of three major branches addressing continual learning in VLA policies. The small sibling count suggests this specific approach—direct parameter blending—is less explored compared to modular expert libraries or reinforcement learning-based adaptation methods.

The taxonomy reveals neighboring directions that tackle similar challenges through different mechanisms. The sibling branch 'Progressive Expert Libraries and Knowledge-Driven Continual Learning' uses modular components rather than weight merging, while the parallel 'Reinforcement Learning and Self-Improvement' branch emphasizes online learning and world models. The 'Explicit Reasoning and Visual Latent Planning' branch decouples high-level reasoning from action execution entirely. The paper's approach contrasts with these alternatives by operating directly on parameter space without requiring modular architectures, explicit reasoning systems, or environmental interaction during adaptation.

Among thirty candidates examined across three contributions, none were identified as clearly refuting the proposed methods. The RETAIN method for parameter merging examined ten candidates with zero refutable matches, as did the modality-specific merging strategy and the continual skill acquisition framework. This absence of overlapping prior work within the limited search scope suggests the specific combination of weight interpolation for VLA policies, modality-aware merging, and sequential skill accumulation may represent a relatively unexplored configuration. However, the search examined only top-K semantic matches and citations, not the entire literature.

The analysis indicates the paper occupies a sparsely populated research direction within a broader field that favors modular or reinforcement learning-based solutions. The limited search scope—thirty candidates across three contributions—provides evidence of novelty within examined papers but cannot confirm exhaustive originality. The taxonomy structure suggests the field is actively exploring diverse continual learning strategies, with weight interpolation representing one of several competing paradigms for balancing plasticity and stability in robot policy adaptation.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: continual learning and robust generalization in vision-language-action robot policy finetuning. The field addresses how to adapt large-scale vision-language-action (VLA) models to new robotic tasks without losing previously acquired capabilities or overfitting to narrow domains. The taxonomy reveals three main branches: Parameter-Efficient Adaptation and Catastrophic Forgetting Mitigation focuses on lightweight tuning strategies and techniques to preserve base model knowledge during specialization; Reinforcement Learning and Self-Improvement for VLA Adaptation explores online learning and iterative refinement methods that leverage environmental feedback; and Explicit Reasoning and Visual Latent Planning emphasizes structured decision-making and internal world models. These branches reflect complementary perspectives on the same challenge—balancing plasticity for new skills with stability of existing competencies—while differing in whether they prioritize efficient parameter updates, interactive learning loops, or interpretable reasoning pathways. Several active lines of work illustrate key trade-offs in this landscape. Some approaches emphasize self-correction and reflection mechanisms (Self-Correcting VLA[6], Reflection Task Adaptation[4]) to improve generalization through iterative reasoning, while others focus on modular skill libraries or expert ensembles (Dynamic Expert Library[7], Continual Skill Knowledge[5]) to compartmentalize knowledge and reduce interference. Within the Parameter-Efficient Adaptation branch, Robust VLA Finetuning[0] sits alongside Actions as Language[3] in exploring weight interpolation and model merging strategies that blend task-specific adaptations with pretrained representations. Compared to WMPO[1], which integrates preference optimization into the adaptation process, Robust VLA Finetuning[0] appears more concerned with maintaining robustness across diverse deployment conditions through careful weight combination. This positioning highlights an ongoing question: whether continual learning is best achieved by merging separately trained experts or by designing unified training objectives that inherently balance old and new knowledge.

Claimed Contributions

RETAIN method for robust robot policy finetuning via parameter merging

10 retrieved papers

The authors propose a method that interpolates the weights of pretrained and finetuned generalist robot policies in weight space. This simple weight merging strategy enables finetuned policies to generalize better to out-of-distribution variations of new tasks while retaining the broad competencies from pretraining.

10 retrieved papers

Modality-specific parameter merging for vision-language-action policies

10 retrieved papers

The authors extend their merging approach to use different interpolation coefficients for the vision encoder, language model backbone, and action expert components of VLA policies. They demonstrate that merging only the language model parameters can be sufficient for effective policy adaptation.

10 retrieved papers

Continual skill acquisition framework through sequential parameter merging

10 retrieved papers

The authors show that RETAIN can be applied iteratively to sequentially add new tasks into a pretrained checkpoint by merging finetuned weights into the base model. This enables lifelong learning where a single policy accumulates new capabilities without sacrificing previously learned generalist abilities.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[3] Actions as language: Fine-tuning vlms into vlas without catastrophic forgetting PDF

Wu, Xindi, Asher Hancock, Zha, Lihan, Xindi Wu, Russakovsky, Olga, Lihan Zha, Majumdar, Anirudha, Olga Russakovsky, Anirudha Majumdar (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

RETAIN method for robust robot policy finetuning via parameter merging

[8] Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards PDF

Cannot Refute

[9] Policy-space interpolation for physics-based characters PDF

Cannot Refute

[10] Poco: Policy composition from and for heterogeneous robot learning PDF

Cannot Refute

[11] Waypoint-based imitation learning for robotic manipulation PDF

Cannot Refute

[12] A survey on diffusion policy for robotic manipulation: Taxonomy, analysis, and future directions PDF

Cannot Refute

[13] Robust imitation of diverse behaviors PDF

Cannot Refute

[14] DaWin: Training-free Dynamic Weight Interpolation for Robust Adaptation PDF

Cannot Refute

[15] RoFt-Mol: Benchmarking Robust Fine-Tuning with Molecular Graph Foundation Models PDF

Cannot Refute

[16] Robust Policy Generalization and Risk Quantification in Autonomous Robotics PDF

Cannot Refute

[17] Robust Fine-tuning for Pre-trained 3D Point Cloud Models PDF

Cannot Refute

Contribution

Modality-specific parameter merging for vision-language-action policies

[18] Remedy: Recipe merging dynamics in large vision-language models PDF

Cannot Refute

[19] A dynamic weighted fusion model for multimodal sentiment analysis PDF

Cannot Refute

[20] Parameter efficient merging for multimodal large language models with complementary parameter adaptation PDF

Cannot Refute

[21] Visual grounding with multi-modal conditional adaptation PDF

Cannot Refute

[22] Deciphering Cross-Modal Alignment in Large Vision-Language Models via Modality Integration Rate PDF

Cannot Refute

[23] X-Fusion: Introducing New Modality to Frozen Large Language Models PDF

Cannot Refute

[24] Multi-modality expansion and retention for llms through parameter merging and decoupling PDF

Cannot Refute

[25] An Empirical Study of Multimodal Model Merging PDF

Cannot Refute

[26] Token Expand-Merge: Training-Free Token Compression for Vision-Language-Action Models PDF

Cannot Refute

[27] MAKE: Vision-Language Pre-training based Product Retrieval in Taobao Search PDF

Cannot Refute

Contribution

Continual skill acquisition framework through sequential parameter merging

[7] Dynamic Mixture of Progressive Parameter-Efficient Expert Library for Lifelong Robot Learning PDF

Cannot Refute

[28] Preserving and combining knowledge in robotic lifelong reinforcement learning PDF

Cannot Refute

[29] A Novel Robotic Skill Learning Approach for Assembly Task With Dynamical System and Broad Learning PDF

Cannot Refute

[30] A trajectory and force dual-incremental robot skill learning and generalization framework using improved dynamical movement primitives and adaptive neural â¦ PDF

Cannot Refute

[31] SPECI: Skill Prompts based Hierarchical Continual Imitation Learning for Robot Manipulation PDF

Cannot Refute

[32] Toward Scalable Multirobot Control: Fast Policy Learning in Distributed MPC PDF

Cannot Refute

[33] Active incremental learning of robot movement primitives PDF

Cannot Refute

[34] Incremental learning of subtasks from unsegmented demonstration PDF

Cannot Refute

[35] iManip: Skill-Incremental Learning for Robotic Manipulation PDF

Cannot Refute

[36] Analysis of methods for incremental policy refinement by kinesthetic guidance PDF

Cannot Refute

Robust Fine-tuning of Vision-Language-Action Robot Policies via Parameter Merging

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[3] Actions as language: Fine-tuning vlms into vlas without catastrophic forgetting PDF

Contribution Analysis

RETAIN method for robust robot policy finetuning via parameter merging

[8] Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards PDF

[9] Policy-space interpolation for physics-based characters PDF

[10] Poco: Policy composition from and for heterogeneous robot learning PDF

[11] Waypoint-based imitation learning for robotic manipulation PDF

[12] A survey on diffusion policy for robotic manipulation: Taxonomy, analysis, and future directions PDF

[13] Robust imitation of diverse behaviors PDF

[14] DaWin: Training-free Dynamic Weight Interpolation for Robust Adaptation PDF

[15] RoFt-Mol: Benchmarking Robust Fine-Tuning with Molecular Graph Foundation Models PDF

[16] Robust Policy Generalization and Risk Quantification in Autonomous Robotics PDF

[17] Robust Fine-tuning for Pre-trained 3D Point Cloud Models PDF

Modality-specific parameter merging for vision-language-action policies

[18] Remedy: Recipe merging dynamics in large vision-language models PDF

[19] A dynamic weighted fusion model for multimodal sentiment analysis PDF

[20] Parameter efficient merging for multimodal large language models with complementary parameter adaptation PDF

[21] Visual grounding with multi-modal conditional adaptation PDF

[22] Deciphering Cross-Modal Alignment in Large Vision-Language Models via Modality Integration Rate PDF

[23] X-Fusion: Introducing New Modality to Frozen Large Language Models PDF

[24] Multi-modality expansion and retention for llms through parameter merging and decoupling PDF

[25] An Empirical Study of Multimodal Model Merging PDF

[26] Token Expand-Merge: Training-Free Token Compression for Vision-Language-Action Models PDF

[27] MAKE: Vision-Language Pre-training based Product Retrieval in Taobao Search PDF

Continual skill acquisition framework through sequential parameter merging

[7] Dynamic Mixture of Progressive Parameter-Efficient Expert Library for Lifelong Robot Learning PDF

[28] Preserving and combining knowledge in robotic lifelong reinforcement learning PDF

[29] A Novel Robotic Skill Learning Approach for Assembly Task With Dynamical System and Broad Learning PDF

[30] A trajectory and force dual-incremental robot skill learning and generalization framework using improved dynamical movement primitives and adaptive neural â¦ PDF

[31] SPECI: Skill Prompts based Hierarchical Continual Imitation Learning for Robot Manipulation PDF

[32] Toward Scalable Multirobot Control: Fast Policy Learning in Distributed MPC PDF

[33] Active incremental learning of robot movement primitives PDF

[34] Incremental learning of subtasks from unsegmented demonstration PDF

[35] iManip: Skill-Incremental Learning for Robotic Manipulation PDF

[36] Analysis of methods for incremental policy refinement by kinesthetic guidance PDF

Table of Contents

[30] A trajectory and force dual-incremental robot skill learning and generalization framework using improved dynamical movement primitives and adaptive neural â¦ PDF