Robust Fine-tuning of Vision-Language-Action Robot Policies via Parameter Merging
Overview
Overall Novelty Assessment
The paper proposes weight interpolation between pretrained and finetuned vision-language-action models to preserve generalist capabilities while learning new tasks. It resides in the 'Weight Interpolation and Model Merging' leaf, which contains only two papers total, indicating a relatively sparse research direction within the broader taxonomy. This leaf sits under 'Parameter-Efficient Adaptation and Catastrophic Forgetting Mitigation', one of three major branches addressing continual learning in VLA policies. The small sibling count suggests this specific approach—direct parameter blending—is less explored compared to modular expert libraries or reinforcement learning-based adaptation methods.
The taxonomy reveals neighboring directions that tackle similar challenges through different mechanisms. The sibling branch 'Progressive Expert Libraries and Knowledge-Driven Continual Learning' uses modular components rather than weight merging, while the parallel 'Reinforcement Learning and Self-Improvement' branch emphasizes online learning and world models. The 'Explicit Reasoning and Visual Latent Planning' branch decouples high-level reasoning from action execution entirely. The paper's approach contrasts with these alternatives by operating directly on parameter space without requiring modular architectures, explicit reasoning systems, or environmental interaction during adaptation.
Among thirty candidates examined across three contributions, none were identified as clearly refuting the proposed methods. The RETAIN method for parameter merging examined ten candidates with zero refutable matches, as did the modality-specific merging strategy and the continual skill acquisition framework. This absence of overlapping prior work within the limited search scope suggests the specific combination of weight interpolation for VLA policies, modality-aware merging, and sequential skill accumulation may represent a relatively unexplored configuration. However, the search examined only top-K semantic matches and citations, not the entire literature.
The analysis indicates the paper occupies a sparsely populated research direction within a broader field that favors modular or reinforcement learning-based solutions. The limited search scope—thirty candidates across three contributions—provides evidence of novelty within examined papers but cannot confirm exhaustive originality. The taxonomy structure suggests the field is actively exploring diverse continual learning strategies, with weight interpolation representing one of several competing paradigms for balancing plasticity and stability in robot policy adaptation.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors propose a method that interpolates the weights of pretrained and finetuned generalist robot policies in weight space. This simple weight merging strategy enables finetuned policies to generalize better to out-of-distribution variations of new tasks while retaining the broad competencies from pretraining.
The authors extend their merging approach to use different interpolation coefficients for the vision encoder, language model backbone, and action expert components of VLA policies. They demonstrate that merging only the language model parameters can be sufficient for effective policy adaptation.
The authors show that RETAIN can be applied iteratively to sequentially add new tasks into a pretrained checkpoint by merging finetuned weights into the base model. This enables lifelong learning where a single policy accumulates new capabilities without sacrificing previously learned generalist abilities.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[3] Actions as language: Fine-tuning vlms into vlas without catastrophic forgetting PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
RETAIN method for robust robot policy finetuning via parameter merging
The authors propose a method that interpolates the weights of pretrained and finetuned generalist robot policies in weight space. This simple weight merging strategy enables finetuned policies to generalize better to out-of-distribution variations of new tasks while retaining the broad competencies from pretraining.
[8] Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards PDF
[9] Policy-space interpolation for physics-based characters PDF
[10] Poco: Policy composition from and for heterogeneous robot learning PDF
[11] Waypoint-based imitation learning for robotic manipulation PDF
[12] A survey on diffusion policy for robotic manipulation: Taxonomy, analysis, and future directions PDF
[13] Robust imitation of diverse behaviors PDF
[14] DaWin: Training-free Dynamic Weight Interpolation for Robust Adaptation PDF
[15] RoFt-Mol: Benchmarking Robust Fine-Tuning with Molecular Graph Foundation Models PDF
[16] Robust Policy Generalization and Risk Quantification in Autonomous Robotics PDF
[17] Robust Fine-tuning for Pre-trained 3D Point Cloud Models PDF
Modality-specific parameter merging for vision-language-action policies
The authors extend their merging approach to use different interpolation coefficients for the vision encoder, language model backbone, and action expert components of VLA policies. They demonstrate that merging only the language model parameters can be sufficient for effective policy adaptation.
[18] Remedy: Recipe merging dynamics in large vision-language models PDF
[19] A dynamic weighted fusion model for multimodal sentiment analysis PDF
[20] Parameter efficient merging for multimodal large language models with complementary parameter adaptation PDF
[21] Visual grounding with multi-modal conditional adaptation PDF
[22] Deciphering Cross-Modal Alignment in Large Vision-Language Models via Modality Integration Rate PDF
[23] X-Fusion: Introducing New Modality to Frozen Large Language Models PDF
[24] Multi-modality expansion and retention for llms through parameter merging and decoupling PDF
[25] An Empirical Study of Multimodal Model Merging PDF
[26] Token Expand-Merge: Training-Free Token Compression for Vision-Language-Action Models PDF
[27] MAKE: Vision-Language Pre-training based Product Retrieval in Taobao Search PDF
Continual skill acquisition framework through sequential parameter merging
The authors show that RETAIN can be applied iteratively to sequentially add new tasks into a pretrained checkpoint by merging finetuned weights into the base model. This enables lifelong learning where a single policy accumulates new capabilities without sacrificing previously learned generalist abilities.