MetaVLA: Unified Meta Co-Training for Efficient Embodied Adaptation

ICLR 2026 Conference SubmissionAnonymous Authors
Vision–Language–Action modelsEfficient Robot ReasoningGeneralization
Abstract:

Vision–Language–Action (VLA) models show promise in embodied reasoning, yet remain far from true generalists—they often require task-specific fine-tuning, incur high compute costs, and generalize poorly to unseen tasks. We propose MetaVLA, a unified, backbone-agnostic post-training framework for efficient and scalable alignment. MetaVLA introduces Context-Aware Meta Co-Training, which consolidates diverse target tasks into a single fine-tuning stage while leveraging structurally diverse auxiliary tasks to improve in-domain generalization. Unlike naive multi-task SFT, MetaVLA integrates a lightweight meta-learning mechanism—derived from Attentive Neural Processes—to enable rapid adaptation from diverse contexts with minimal architectural change or inference overhead. On the LIBERO benchmark, MetaVLA with six auxiliary tasks outperforms OpenVLA by up to 8.0% on long-horizon tasks, reduces training steps from 240K to 75K, and cuts GPU time by 76%. These results show that scalable, low-resource post-training is achievable—paving the way toward general-purpose embodied agents. Code will be available.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

MetaVLA proposes a unified post-training framework combining meta-learning with multi-task co-training to enable efficient adaptation of vision-language-action models. The paper sits in the Meta-Learning and Multi-Task Co-Training leaf, which currently contains only this work as its sole member. This sparse positioning suggests the specific combination of meta-learning mechanisms with multi-task auxiliary training for VLA adaptation represents a relatively unexplored direction within the broader post-training optimization landscape, which includes more populated branches like reinforcement learning fine-tuning and supervised strategies.

The taxonomy reveals neighboring research directions that address efficient adaptation through alternative mechanisms. The Reinforcement Learning-Based Post-Training branch contains multiple subcategories exploring online RL fine-tuning, offline trajectory optimization, and world model-based methods. The Supervised and Hybrid Fine-Tuning Strategies branch includes work on data generation and self-improvement. The Adaptation Paradigms branch explores parameter-efficient methods like adapters and LoRA, as well as in-context learning approaches. MetaVLA diverges from these by integrating meta-learning derived from Attentive Neural Processes specifically for rapid cross-task generalization, rather than relying on extensive online interaction, large-scale data augmentation, or architectural parameter isolation.

Among twenty-six candidates examined across three contributions, no clearly refuting prior work was identified. The MetaVLA unified framework examined six candidates with zero refutations, the Context-Aware Meta Co-Training mechanism examined ten candidates with zero refutations, and the Action-ANP module examined ten candidates with zero refutations. This suggests that within the limited search scope, the specific integration of meta-learning mechanisms with multi-task auxiliary training for VLA post-training appears relatively novel. The absence of refuting work may reflect both the sparse population of this particular research direction and the limited scale of the literature search conducted.

Based on the top-26 semantic matches examined, the work appears to occupy a distinct position combining meta-learning with multi-task co-training for VLA adaptation. The analysis does not cover exhaustive literature search across all meta-learning or multi-task learning methods in robotics, nor does it examine broader transfer learning frameworks outside the VLA context. The novelty assessment is therefore constrained by the search scope and the specific framing of contributions within the VLA post-training domain.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
26
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Efficient post-training adaptation for vision-language-action models. The field has organized itself around several complementary directions. Post-Training Optimization Methods explore techniques such as meta-learning, multi-task co-training, reinforcement learning fine-tuning (e.g., VLA-R1[8], πRL[9]), and interactive or self-improving strategies (Interactive Post-Training for Vision-Language-Action[2], Self-Improving Vision-Language-Action Models[39]). Architectural and Representation Enhancements focus on novel backbones, spatial reasoning modules (Spatialvla[1], 3D CAVLA[18]), and adapter-based designs (VLA-ADAPTER[19], Vla-adapter[29]). Adaptation Paradigms and Transfer Mechanisms investigate how pretrained vision-language models can be efficiently specialized for robotic control, while Model Compression and Efficiency address deployment constraints through distillation and pruning (Tinyvla[4], TinyVLA[5]). Meanwhile, Pretraining Data and Foundation Models examine large-scale data curation and base model choices (OpenVLA[3], RT-2[25]), and Evaluation and Robustness Analysis assess generalization and failure modes across diverse tasks. Several active lines of work reveal key trade-offs between sample efficiency, computational cost, and generalization. Meta-learning approaches like MetaVLA[0] aim to enable rapid adaptation across multiple tasks with minimal data, contrasting with methods that rely on extensive online RL fine-tuning (Online RL Fine-tuning[34]) or large-scale supervised pretraining (Scalable vision-language-action model pretraining[44]). MetaVLA[0] sits within the Post-Training Optimization Methods branch, specifically under Meta-Learning and Multi-Task Co-Training, emphasizing few-shot transfer and cross-task knowledge sharing. Compared to interactive refinement strategies like Interactive Post-Training for Vision-Language-Action[2] or self-correcting frameworks (A Self-Correcting Vision-Language-Action Model[41]), MetaVLA[0] prioritizes learning reusable task priors rather than iterative policy improvement. This positioning highlights an ongoing question in the field: whether efficient adaptation is best achieved through better initialization via meta-learning, richer interaction during fine-tuning, or architectural innovations that reduce parameter overhead.

Claimed Contributions

MetaVLA unified post-training framework

The authors introduce MetaVLA, a framework designed for efficient post-training of Vision-Language-Action models that works across different backbone architectures and enables scalable alignment to new tasks without requiring task-specific fine-tuning.

6 retrieved papers
Context-Aware Meta Co-Training mechanism

The authors propose a training approach that jointly trains all target tasks in one unified stage while using diverse auxiliary tasks to enhance generalization. This mechanism integrates a lightweight meta-learning module derived from Attentive Neural Processes to enable rapid adaptation from diverse contexts.

10 retrieved papers
Action-ANP module for meta-learning

The authors develop Action-ANP, a compact module based on Attentive Neural Processes that aggregates contextual demonstrations through self-attention and cross-attention mechanisms. This module enables the model to leverage both in-domain and auxiliary task data for improved adaptation while adding minimal computational overhead.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MetaVLA unified post-training framework

The authors introduce MetaVLA, a framework designed for efficient post-training of Vision-Language-Action models that works across different backbone architectures and enables scalable alignment to new tasks without requiring task-specific fine-tuning.

Contribution

Context-Aware Meta Co-Training mechanism

The authors propose a training approach that jointly trains all target tasks in one unified stage while using diverse auxiliary tasks to enhance generalization. This mechanism integrates a lightweight meta-learning module derived from Attentive Neural Processes to enable rapid adaptation from diverse contexts.

Contribution

Action-ANP module for meta-learning

The authors develop Action-ANP, a compact module based on Attentive Neural Processes that aggregates contextual demonstrations through self-attention and cross-attention mechanisms. This module enables the model to leverage both in-domain and auxiliary task data for improved adaptation while adding minimal computational overhead.