Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation

ICLR 2026 Conference SubmissionAnonymous Authors
Embodied ReasoningVision-Language ModelEmbodied AIReinforcement LearningZero-shot Generalization
Abstract:

Generalization in embodied AI is hindered by the "seeing-to-doing gap", stemming from data scarcity and embodiment heterogeneity. To address this, we pioneer "pointing" as a unified, embodiment-agnostic intermediate representation, defining four core embodied pointing abilities that bridge high-level vision-language comprehension with low-level action primitives. We introduce Embodied-R1, a 3B Vision-Language Model (VLM) specifically designed for embodied reasoning and pointing. We use a wide range of embodied and general visual reasoning datasets as sources to construct a large-scale dataset, Embodied-Points-200K, which supports key embodied pointing capabilities. Then we train Embodied-R1 using a two-stage Reinforced Fine-tuning (RFT) curriculum with specialized multi-task reward design. Embodied-R1 achieves state-of-the-art performance on 11 embodied spatial and pointing benchmarks. Critically, it demonstrates robust zero-shot generalization by achieving a 56.2% success rate in the SIMPLEREnv and 87.5% across 8 real-world XArm tasks without any task-specific fine-tuning, representing a 62% improvement over strong baselines. Furthermore, the model exhibits high robustness against diverse visual disturbances. Our work shows that a pointing-centric representation, combined with an RFT training paradigm, offers an effective and generalizable pathway to closing the perception-action gap in robotics.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces pointing as a unified intermediate representation for embodied AI, proposing four core abilities that bridge vision-language understanding and low-level control. It sits within the Open-Vocabulary Object Manipulation leaf of the Vision-Language Grounding for Manipulation branch, alongside two sibling papers. This leaf represents a relatively sparse research direction within the broader taxonomy of 18 papers across multiple branches, suggesting the work addresses a focused problem area rather than a densely populated subfield.

The taxonomy reveals neighboring research directions that contextualize this contribution. The Interactive Language-Guided Manipulation leaf explores dialogue-based disambiguation, while Pointing-Based Object Teaching and Gesture-Based Teleoperation systems under Human-Robot Interaction Modalities use pointing for different purposes—object teaching and direct control respectively. The scope notes clarify that this work differs by targeting zero-shot manipulation without interactive disambiguation, distinguishing it from both dialogue-driven approaches and gesture teleoperation paradigms that do not emphasize open-vocabulary generalization.

Among 29 candidates examined across three contributions, none were identified as clearly refuting the proposed innovations. The pointing representation examined 9 candidates with 0 refutations, the Embodied-Points-200K dataset examined 10 candidates with 0 refutations, and the Embodied-R1 model examined 10 candidates with 0 refutations. This limited search scope suggests that within the top semantic matches analyzed, no substantial prior work directly overlaps with the specific combination of pointing as embodiment-agnostic representation, the dataset construction approach, or the two-stage reinforced fine-tuning paradigm.

Based on the analysis of 29 semantically related candidates, the work appears to occupy a distinct position combining vision-language grounding with pointing-based action primitives. The absence of refutable prior work within this search scope indicates potential novelty, though the limited candidate pool means the analysis does not cover exhaustive literature review. The sparse population of the Open-Vocabulary Object Manipulation leaf and the specific framing around embodied pointing suggest differentiation from existing approaches in neighboring research directions.

Taxonomy

Core-task Taxonomy Papers
18
3
Claimed Contributions
29
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Bridging perception and action in robotic manipulation through embodied pointing. The field encompasses multiple interconnected branches that address how robots can understand and execute manipulation tasks by integrating visual perception, language understanding, and physical action. Vision-Language Grounding for Manipulation focuses on enabling robots to identify and manipulate objects based on natural language descriptions or open-vocabulary queries, with works like Open-world Manipulation[1] and Track2Act[2] demonstrating methods for handling diverse object categories beyond fixed training sets. Human-Robot Interaction Modalities explores how humans can communicate intent through gestures, pointing, and other natural interfaces, as seen in Pointing Gestures Digit[3], RGB-D Hand Gesture[4], and Gesture-based Assistance[6]. Visuomotor Integration and Learning addresses the fundamental challenge of connecting visual input to motor control, spanning classical approaches like Visuomotor Learning[12] and Visual-Tactile-Motor Integration[13] to modern learning-based methods. Robotic System Design and Implementation covers the hardware and control architectures that enable these capabilities, including specialized hands like ALPHA Hand[14] and integrated systems such as Mobile Manipulation Control[15]. A particularly active line of work centers on open-vocabulary manipulation, where systems must generalize to novel objects and attributes without exhaustive pre-training. This contrasts with gesture-based interaction approaches that prioritize intuitive human communication over linguistic flexibility. Embodied-R1[0] sits at the intersection of these themes within the Vision-Language Grounding branch, specifically targeting open-vocabulary object manipulation through embodied pointing as a perceptual-action bridge. Compared to Open-world Manipulation[1], which emphasizes broad generalization across object categories, and Track2Act[2], which focuses on tracking-driven action policies, Embodied-R1[0] appears to leverage the physical act of pointing itself as a grounding mechanism that tightly couples perception with manipulation intent. This approach echoes earlier gesture-based methods while incorporating modern vision-language capabilities, suggesting a synthesis of human-robot interaction naturalness with the flexibility of open-vocabulary systems.

Claimed Contributions

Pointing as unified embodiment-agnostic intermediate representation with four core abilities

The authors introduce pointing as a unified representation for robotic manipulation and systematically define four fundamental embodied pointing capabilities: Referring Expression Grounding, Region Referring Grounding, Object Functional Grounding, and Visual Trace Generation. These abilities bridge semantic understanding with physical actions in an embodiment-agnostic manner.

9 retrieved papers
Embodied-Points-200K dataset for embodied pointing capabilities

The authors construct a large-scale dataset containing approximately 200K high-quality samples structured as question-verification pairs. This dataset is curated from diverse sources to support the four core embodied pointing capabilities and addresses the multi-solution dilemma inherent in pointing tasks.

10 retrieved papers
Embodied-R1 model trained with two-stage Reinforced Fine-tuning paradigm

The authors propose Embodied-R1, a 3B parameter VLM trained using a two-stage RFT curriculum with specialized multi-task reward design. This training paradigm enables flexible free-form reasoning beyond rigid CoT templates and resolves the multi-solution dilemma in embodied pointing, achieving superior generalization compared to standard SFT approaches.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Pointing as unified embodiment-agnostic intermediate representation with four core abilities

The authors introduce pointing as a unified representation for robotic manipulation and systematically define four fundamental embodied pointing capabilities: Referring Expression Grounding, Region Referring Grounding, Object Functional Grounding, and Visual Trace Generation. These abilities bridge semantic understanding with physical actions in an embodiment-agnostic manner.

Contribution

Embodied-Points-200K dataset for embodied pointing capabilities

The authors construct a large-scale dataset containing approximately 200K high-quality samples structured as question-verification pairs. This dataset is curated from diverse sources to support the four core embodied pointing capabilities and addresses the multi-solution dilemma inherent in pointing tasks.

Contribution

Embodied-R1 model trained with two-stage Reinforced Fine-tuning paradigm

The authors propose Embodied-R1, a 3B parameter VLM trained using a two-stage RFT curriculum with specialized multi-task reward design. This training paradigm enables flexible free-form reasoning beyond rigid CoT templates and resolves the multi-solution dilemma in embodied pointing, achieving superior generalization compared to standard SFT approaches.