Robust Fine-tuning of Vision-Language-Action Robot Policies via Parameter Merging

ICLR 2026 Conference SubmissionAnonymous Authors
Robust fine-tuninggeneralist robot policymodel merging
Abstract:

Generalist robot policies, trained on large and diverse datasets, have demonstrated the ability to generalize across a wide spectrum of behaviors, enabling a single policy to act in varied real-world environments. However, they still fall short on new tasks not covered in the training data. When finetuned on limited demonstrations of a new task, these policies often overfit to the specific demonstrations---not only losing their prior abilities to solve a wide variety of generalist tasks but also failing to generalize within the new task itself. In this work, we aim to develop a method that preserves the generalization capabilities of the generalist policy during finetuning, allowing a single policy to robustly incorporate a new skill into its repertoire. Our goal is a single policy that both learns to generalize to variations of the new task and retains the broad competencies gained from pretraining. We show that this can be achieved through a simple yet effective strategy: interpolating the weights of a finetuned model with that of the pretrained model. We show, across extensive simulated and real-world experiments, that such model merging produces a single model that inherits the generalist abilities of the base model and learns to solve the new task robustly, outperforming both the pretrained and finetuned model on out-of-distribution variations of the new task. Moreover, we show that model merging enables continual acquisition of new skills in a lifelong learning setting, without sacrificing previously learned generalist abilities.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes weight interpolation between pretrained and finetuned vision-language-action models to preserve generalist capabilities while learning new tasks. It resides in the 'Weight Interpolation and Model Merging' leaf, which contains only two papers total, indicating a relatively sparse research direction within the broader taxonomy. This leaf sits under 'Parameter-Efficient Adaptation and Catastrophic Forgetting Mitigation', one of three major branches addressing continual learning in VLA policies. The small sibling count suggests this specific approach—direct parameter blending—is less explored compared to modular expert libraries or reinforcement learning-based adaptation methods.

The taxonomy reveals neighboring directions that tackle similar challenges through different mechanisms. The sibling branch 'Progressive Expert Libraries and Knowledge-Driven Continual Learning' uses modular components rather than weight merging, while the parallel 'Reinforcement Learning and Self-Improvement' branch emphasizes online learning and world models. The 'Explicit Reasoning and Visual Latent Planning' branch decouples high-level reasoning from action execution entirely. The paper's approach contrasts with these alternatives by operating directly on parameter space without requiring modular architectures, explicit reasoning systems, or environmental interaction during adaptation.

Among thirty candidates examined across three contributions, none were identified as clearly refuting the proposed methods. The RETAIN method for parameter merging examined ten candidates with zero refutable matches, as did the modality-specific merging strategy and the continual skill acquisition framework. This absence of overlapping prior work within the limited search scope suggests the specific combination of weight interpolation for VLA policies, modality-aware merging, and sequential skill accumulation may represent a relatively unexplored configuration. However, the search examined only top-K semantic matches and citations, not the entire literature.

The analysis indicates the paper occupies a sparsely populated research direction within a broader field that favors modular or reinforcement learning-based solutions. The limited search scope—thirty candidates across three contributions—provides evidence of novelty within examined papers but cannot confirm exhaustive originality. The taxonomy structure suggests the field is actively exploring diverse continual learning strategies, with weight interpolation representing one of several competing paradigms for balancing plasticity and stability in robot policy adaptation.

Taxonomy

Core-task Taxonomy Papers
7
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: continual learning and robust generalization in vision-language-action robot policy finetuning. The field addresses how to adapt large-scale vision-language-action (VLA) models to new robotic tasks without losing previously acquired capabilities or overfitting to narrow domains. The taxonomy reveals three main branches: Parameter-Efficient Adaptation and Catastrophic Forgetting Mitigation focuses on lightweight tuning strategies and techniques to preserve base model knowledge during specialization; Reinforcement Learning and Self-Improvement for VLA Adaptation explores online learning and iterative refinement methods that leverage environmental feedback; and Explicit Reasoning and Visual Latent Planning emphasizes structured decision-making and internal world models. These branches reflect complementary perspectives on the same challenge—balancing plasticity for new skills with stability of existing competencies—while differing in whether they prioritize efficient parameter updates, interactive learning loops, or interpretable reasoning pathways. Several active lines of work illustrate key trade-offs in this landscape. Some approaches emphasize self-correction and reflection mechanisms (Self-Correcting VLA[6], Reflection Task Adaptation[4]) to improve generalization through iterative reasoning, while others focus on modular skill libraries or expert ensembles (Dynamic Expert Library[7], Continual Skill Knowledge[5]) to compartmentalize knowledge and reduce interference. Within the Parameter-Efficient Adaptation branch, Robust VLA Finetuning[0] sits alongside Actions as Language[3] in exploring weight interpolation and model merging strategies that blend task-specific adaptations with pretrained representations. Compared to WMPO[1], which integrates preference optimization into the adaptation process, Robust VLA Finetuning[0] appears more concerned with maintaining robustness across diverse deployment conditions through careful weight combination. This positioning highlights an ongoing question: whether continual learning is best achieved by merging separately trained experts or by designing unified training objectives that inherently balance old and new knowledge.

Claimed Contributions

RETAIN method for robust robot policy finetuning via parameter merging

The authors propose a method that interpolates the weights of pretrained and finetuned generalist robot policies in weight space. This simple weight merging strategy enables finetuned policies to generalize better to out-of-distribution variations of new tasks while retaining the broad competencies from pretraining.

10 retrieved papers
Modality-specific parameter merging for vision-language-action policies

The authors extend their merging approach to use different interpolation coefficients for the vision encoder, language model backbone, and action expert components of VLA policies. They demonstrate that merging only the language model parameters can be sufficient for effective policy adaptation.

10 retrieved papers
Continual skill acquisition framework through sequential parameter merging

The authors show that RETAIN can be applied iteratively to sequentially add new tasks into a pretrained checkpoint by merging finetuned weights into the base model. This enables lifelong learning where a single policy accumulates new capabilities without sacrificing previously learned generalist abilities.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

RETAIN method for robust robot policy finetuning via parameter merging

The authors propose a method that interpolates the weights of pretrained and finetuned generalist robot policies in weight space. This simple weight merging strategy enables finetuned policies to generalize better to out-of-distribution variations of new tasks while retaining the broad competencies from pretraining.

Contribution

Modality-specific parameter merging for vision-language-action policies

The authors extend their merging approach to use different interpolation coefficients for the vision encoder, language model backbone, and action expert components of VLA policies. They demonstrate that merging only the language model parameters can be sufficient for effective policy adaptation.

Contribution

Continual skill acquisition framework through sequential parameter merging

The authors show that RETAIN can be applied iteratively to sequentially add new tasks into a pretrained checkpoint by merging finetuned weights into the base model. This enables lifelong learning where a single policy accumulates new capabilities without sacrificing previously learned generalist abilities.