MetaVLA: Unified Meta Co-Training for Efficient Embodied Adaptation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Vision–Language–Action modelsEfficient Robot ReasoningGeneralization

Vision–Language–Action (VLA) models show promise in embodied reasoning, yet remain far from true generalists—they often require task-specific fine-tuning, incur high compute costs, and generalize poorly to unseen tasks. We propose MetaVLA, a unified, backbone-agnostic post-training framework for efficient and scalable alignment. MetaVLA introduces Context-Aware Meta Co-Training, which consolidates diverse target tasks into a single fine-tuning stage while leveraging structurally diverse auxiliary tasks to improve in-domain generalization. Unlike naive multi-task SFT, MetaVLA integrates a lightweight meta-learning mechanism—derived from Attentive Neural Processes—to enable rapid adaptation from diverse contexts with minimal architectural change or inference overhead. On the LIBERO benchmark, MetaVLA with six auxiliary tasks outperforms OpenVLA by up to 8.0% on long-horizon tasks, reduces training steps from 240K to 75K, and cuts GPU time by 76%. These results show that scalable, low-resource post-training is achievable—paving the way toward general-purpose embodied agents. Code will be available.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

MetaVLA proposes a unified post-training framework combining meta-learning with multi-task co-training to enable efficient adaptation of vision-language-action models. The paper sits in the Meta-Learning and Multi-Task Co-Training leaf, which currently contains only this work as its sole member. This sparse positioning suggests the specific combination of meta-learning mechanisms with multi-task auxiliary training for VLA adaptation represents a relatively unexplored direction within the broader post-training optimization landscape, which includes more populated branches like reinforcement learning fine-tuning and supervised strategies.

The taxonomy reveals neighboring research directions that address efficient adaptation through alternative mechanisms. The Reinforcement Learning-Based Post-Training branch contains multiple subcategories exploring online RL fine-tuning, offline trajectory optimization, and world model-based methods. The Supervised and Hybrid Fine-Tuning Strategies branch includes work on data generation and self-improvement. The Adaptation Paradigms branch explores parameter-efficient methods like adapters and LoRA, as well as in-context learning approaches. MetaVLA diverges from these by integrating meta-learning derived from Attentive Neural Processes specifically for rapid cross-task generalization, rather than relying on extensive online interaction, large-scale data augmentation, or architectural parameter isolation.

Among twenty-six candidates examined across three contributions, no clearly refuting prior work was identified. The MetaVLA unified framework examined six candidates with zero refutations, the Context-Aware Meta Co-Training mechanism examined ten candidates with zero refutations, and the Action-ANP module examined ten candidates with zero refutations. This suggests that within the limited search scope, the specific integration of meta-learning mechanisms with multi-task auxiliary training for VLA post-training appears relatively novel. The absence of refuting work may reflect both the sparse population of this particular research direction and the limited scale of the literature search conducted.

Based on the top-26 semantic matches examined, the work appears to occupy a distinct position combining meta-learning with multi-task co-training for VLA adaptation. The analysis does not cover exhaustive literature search across all meta-learning or multi-task learning methods in robotics, nor does it examine broader transfer learning frameworks outside the VLA context. The novelty assessment is therefore constrained by the search scope and the specific framing of contributions within the VLA post-training domain.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Efficient post-training adaptation for vision-language-action models. The field has organized itself around several complementary directions. Post-Training Optimization Methods explore techniques such as meta-learning, multi-task co-training, reinforcement learning fine-tuning (e.g., VLA-R1[8], πRL[9]), and interactive or self-improving strategies (Interactive Post-Training for Vision-Language-Action[2], Self-Improving Vision-Language-Action Models[39]). Architectural and Representation Enhancements focus on novel backbones, spatial reasoning modules (Spatialvla[1], 3D CAVLA[18]), and adapter-based designs (VLA-ADAPTER[19], Vla-adapter[29]). Adaptation Paradigms and Transfer Mechanisms investigate how pretrained vision-language models can be efficiently specialized for robotic control, while Model Compression and Efficiency address deployment constraints through distillation and pruning (Tinyvla[4], TinyVLA[5]). Meanwhile, Pretraining Data and Foundation Models examine large-scale data curation and base model choices (OpenVLA[3], RT-2[25]), and Evaluation and Robustness Analysis assess generalization and failure modes across diverse tasks. Several active lines of work reveal key trade-offs between sample efficiency, computational cost, and generalization. Meta-learning approaches like MetaVLA[0] aim to enable rapid adaptation across multiple tasks with minimal data, contrasting with methods that rely on extensive online RL fine-tuning (Online RL Fine-tuning[34]) or large-scale supervised pretraining (Scalable vision-language-action model pretraining[44]). MetaVLA[0] sits within the Post-Training Optimization Methods branch, specifically under Meta-Learning and Multi-Task Co-Training, emphasizing few-shot transfer and cross-task knowledge sharing. Compared to interactive refinement strategies like Interactive Post-Training for Vision-Language-Action[2] or self-correcting frameworks (A Self-Correcting Vision-Language-Action Model[41]), MetaVLA[0] prioritizes learning reusable task priors rather than iterative policy improvement. This positioning highlights an ongoing question in the field: whether efficient adaptation is best achieved through better initialization via meta-learning, richer interaction during fine-tuning, or architectural innovations that reduce parameter overhead.

Claimed Contributions

MetaVLA unified post-training framework

6 retrieved papers

The authors introduce MetaVLA, a framework designed for efficient post-training of Vision-Language-Action models that works across different backbone architectures and enables scalable alignment to new tasks without requiring task-specific fine-tuning.

6 retrieved papers

Context-Aware Meta Co-Training mechanism

10 retrieved papers

The authors propose a training approach that jointly trains all target tasks in one unified stage while using diverse auxiliary tasks to enhance generalization. This mechanism integrates a lightweight meta-learning module derived from Attentive Neural Processes to enable rapid adaptation from diverse contexts.

10 retrieved papers

Action-ANP module for meta-learning

10 retrieved papers

The authors develop Action-ANP, a compact module based on Attentive Neural Processes that aggregates contextual demonstrations through self-attention and cross-attention mechanisms. This module enables the model to leverage both in-domain and auxiliary task data for improved adaptation while adding minimal computational overhead.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MetaVLA unified post-training framework

[42] Vision-Language-Action Models: Foundations, Techniques and Applications PDF

Cannot Refute

[51] Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models PDF

Cannot Refute

[52] MetaVLA: Unified Meta Co-training For Efficient Embodied Adaption PDF

Cannot Refute

[53] A vision-language-action-critic model for robotic real-world reinforcement learning PDF

Cannot Refute

[54] XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations PDF

Cannot Refute

[55] NORA-1.5: A Vision-Language-Action Model Trained using World Model- and Action-based Preference Rewards PDF

Cannot Refute

Contribution

Context-Aware Meta Co-Training mechanism

[66] Self-Supervised Generalisation with Meta Auxiliary Learning PDF

Cannot Refute

[67] Generalizable domain adaptation for sim-and-real policy co-training PDF

Cannot Refute

[68] Tadam: Task dependent adaptive metric for improved few-shot learning PDF

Cannot Refute

[69] Blockmix: meta regularization and self-calibrated inference for metric-based meta-learning PDF

Cannot Refute

[70] A multitask co-training framework for improving speech translation by leveraging speech recognition and machine translation tasks PDF

Cannot Refute

[71] Meta-Auxiliary Learning for Micro-Expression Recognition PDF

Cannot Refute

[72] Cross modal adaptive few-shot learning based on task dependence PDF

Cannot Refute

[73] Co-Training with Active Contrastive Learning and Meta-Pseudo-Labeling on 2D Projections for Deep Semi-Supervised Learning PDF

Cannot Refute

[74] Meta-Auxiliary Learning for Adaptive Human Pose Prediction PDF

Cannot Refute

[75] Physics-aware Spatiotemporal Modules with Auxiliary Tasks for Meta-Learning PDF

Cannot Refute

Contribution

Action-ANP module for meta-learning

[56] Chatvla: Unified multimodal understanding and robot control with vision-language-action model PDF

Cannot Refute

[57] Cheap and quick: Efficient vision-language instruction tuning for large language models PDF

Cannot Refute

[58] A-VL: Adaptive Attention for Large Vision-Language Models PDF

Cannot Refute

[59] Prompt-aware adapter: Towards learning adaptive visual tokens for multimodal large language models PDF

Cannot Refute

[60] ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts PDF

Cannot Refute

[61] An evolved universal transformer memory PDF

Cannot Refute

[62] TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models PDF

Cannot Refute

[63] Multimodal adaptive fusion for enhanced long-term action anticipation PDF

Cannot Refute

[64] Perceive, transform, and act: Multi-modal attention networks for vision-and-language navigation PDF

Cannot Refute

[65] MoS-VLA: A Vision-Language-Action Model with One-Shot Skill Adaptation PDF

Cannot Refute

MetaVLA: Unified Meta Co-Training for Efficient Embodied Adaptation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

MetaVLA unified post-training framework

[42] Vision-Language-Action Models: Foundations, Techniques and Applications PDF

[51] Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models PDF

[52] MetaVLA: Unified Meta Co-training For Efficient Embodied Adaption PDF

[53] A vision-language-action-critic model for robotic real-world reinforcement learning PDF

[54] XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations PDF

[55] NORA-1.5: A Vision-Language-Action Model Trained using World Model- and Action-based Preference Rewards PDF

Context-Aware Meta Co-Training mechanism

[66] Self-Supervised Generalisation with Meta Auxiliary Learning PDF

[67] Generalizable domain adaptation for sim-and-real policy co-training PDF

[68] Tadam: Task dependent adaptive metric for improved few-shot learning PDF

[69] Blockmix: meta regularization and self-calibrated inference for metric-based meta-learning PDF

[70] A multitask co-training framework for improving speech translation by leveraging speech recognition and machine translation tasks PDF

[71] Meta-Auxiliary Learning for Micro-Expression Recognition PDF

[72] Cross modal adaptive few-shot learning based on task dependence PDF

[73] Co-Training with Active Contrastive Learning and Meta-Pseudo-Labeling on 2D Projections for Deep Semi-Supervised Learning PDF

[74] Meta-Auxiliary Learning for Adaptive Human Pose Prediction PDF

[75] Physics-aware Spatiotemporal Modules with Auxiliary Tasks for Meta-Learning PDF

Action-ANP module for meta-learning

[56] Chatvla: Unified multimodal understanding and robot control with vision-language-action model PDF

[57] Cheap and quick: Efficient vision-language instruction tuning for large language models PDF

[58] A-VL: Adaptive Attention for Large Vision-Language Models PDF

[59] Prompt-aware adapter: Towards learning adaptive visual tokens for multimodal large language models PDF

[60] ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts PDF

[61] An evolved universal transformer memory PDF

[62] TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models PDF

[63] Multimodal adaptive fusion for enhanced long-term action anticipation PDF

[64] Perceive, transform, and act: Multi-modal attention networks for vision-and-language navigation PDF

[65] MoS-VLA: A Vision-Language-Action Model with One-Shot Skill Adaptation PDF

Table of Contents