Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Embodied ReasoningVision-Language ModelEmbodied AIReinforcement LearningZero-shot Generalization

Generalization in embodied AI is hindered by the "seeing-to-doing gap", stemming from data scarcity and embodiment heterogeneity. To address this, we pioneer "pointing" as a unified, embodiment-agnostic intermediate representation, defining four core embodied pointing abilities that bridge high-level vision-language comprehension with low-level action primitives. We introduce Embodied-R1, a 3B Vision-Language Model (VLM) specifically designed for embodied reasoning and pointing. We use a wide range of embodied and general visual reasoning datasets as sources to construct a large-scale dataset, Embodied-Points-200K, which supports key embodied pointing capabilities. Then we train Embodied-R1 using a two-stage Reinforced Fine-tuning (RFT) curriculum with specialized multi-task reward design. Embodied-R1 achieves state-of-the-art performance on 11 embodied spatial and pointing benchmarks. Critically, it demonstrates robust zero-shot generalization by achieving a 56.2% success rate in the SIMPLEREnv and 87.5% across 8 real-world XArm tasks without any task-specific fine-tuning, representing a 62% improvement over strong baselines. Furthermore, the model exhibits high robustness against diverse visual disturbances. Our work shows that a pointing-centric representation, combined with an RFT training paradigm, offers an effective and generalizable pathway to closing the perception-action gap in robotics.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces pointing as a unified intermediate representation for embodied AI, proposing four core abilities that bridge vision-language understanding and low-level control. It sits within the Open-Vocabulary Object Manipulation leaf of the Vision-Language Grounding for Manipulation branch, alongside two sibling papers. This leaf represents a relatively sparse research direction within the broader taxonomy of 18 papers across multiple branches, suggesting the work addresses a focused problem area rather than a densely populated subfield.

The taxonomy reveals neighboring research directions that contextualize this contribution. The Interactive Language-Guided Manipulation leaf explores dialogue-based disambiguation, while Pointing-Based Object Teaching and Gesture-Based Teleoperation systems under Human-Robot Interaction Modalities use pointing for different purposes—object teaching and direct control respectively. The scope notes clarify that this work differs by targeting zero-shot manipulation without interactive disambiguation, distinguishing it from both dialogue-driven approaches and gesture teleoperation paradigms that do not emphasize open-vocabulary generalization.

Among 29 candidates examined across three contributions, none were identified as clearly refuting the proposed innovations. The pointing representation examined 9 candidates with 0 refutations, the Embodied-Points-200K dataset examined 10 candidates with 0 refutations, and the Embodied-R1 model examined 10 candidates with 0 refutations. This limited search scope suggests that within the top semantic matches analyzed, no substantial prior work directly overlaps with the specific combination of pointing as embodiment-agnostic representation, the dataset construction approach, or the two-stage reinforced fine-tuning paradigm.

Based on the analysis of 29 semantically related candidates, the work appears to occupy a distinct position combining vision-language grounding with pointing-based action primitives. The absence of refutable prior work within this search scope indicates potential novelty, though the limited candidate pool means the analysis does not cover exhaustive literature review. The sparse population of the Open-Vocabulary Object Manipulation leaf and the specific framing around embodied pointing suggest differentiation from existing approaches in neighboring research directions.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Bridging perception and action in robotic manipulation through embodied pointing. The field encompasses multiple interconnected branches that address how robots can understand and execute manipulation tasks by integrating visual perception, language understanding, and physical action. Vision-Language Grounding for Manipulation focuses on enabling robots to identify and manipulate objects based on natural language descriptions or open-vocabulary queries, with works like Open-world Manipulation[1] and Track2Act[2] demonstrating methods for handling diverse object categories beyond fixed training sets. Human-Robot Interaction Modalities explores how humans can communicate intent through gestures, pointing, and other natural interfaces, as seen in Pointing Gestures Digit[3], RGB-D Hand Gesture[4], and Gesture-based Assistance[6]. Visuomotor Integration and Learning addresses the fundamental challenge of connecting visual input to motor control, spanning classical approaches like Visuomotor Learning[12] and Visual-Tactile-Motor Integration[13] to modern learning-based methods. Robotic System Design and Implementation covers the hardware and control architectures that enable these capabilities, including specialized hands like ALPHA Hand[14] and integrated systems such as Mobile Manipulation Control[15]. A particularly active line of work centers on open-vocabulary manipulation, where systems must generalize to novel objects and attributes without exhaustive pre-training. This contrasts with gesture-based interaction approaches that prioritize intuitive human communication over linguistic flexibility. Embodied-R1[0] sits at the intersection of these themes within the Vision-Language Grounding branch, specifically targeting open-vocabulary object manipulation through embodied pointing as a perceptual-action bridge. Compared to Open-world Manipulation[1], which emphasizes broad generalization across object categories, and Track2Act[2], which focuses on tracking-driven action policies, Embodied-R1[0] appears to leverage the physical act of pointing itself as a grounding mechanism that tightly couples perception with manipulation intent. This approach echoes earlier gesture-based methods while incorporating modern vision-language capabilities, suggesting a synthesis of human-robot interaction naturalness with the flexibility of open-vocabulary systems.

Claimed Contributions

Pointing as unified embodiment-agnostic intermediate representation with four core abilities

9 retrieved papers

The authors introduce pointing as a unified representation for robotic manipulation and systematically define four fundamental embodied pointing capabilities: Referring Expression Grounding, Region Referring Grounding, Object Functional Grounding, and Visual Trace Generation. These abilities bridge semantic understanding with physical actions in an embodiment-agnostic manner.

9 retrieved papers

Embodied-Points-200K dataset for embodied pointing capabilities

10 retrieved papers

The authors construct a large-scale dataset containing approximately 200K high-quality samples structured as question-verification pairs. This dataset is curated from diverse sources to support the four core embodied pointing capabilities and addresses the multi-solution dilemma inherent in pointing tasks.

10 retrieved papers

Embodied-R1 model trained with two-stage Reinforced Fine-tuning paradigm

10 retrieved papers

The authors propose Embodied-R1, a 3B parameter VLM trained using a two-stage RFT curriculum with specialized multi-task reward design. This training paradigm enables flexible free-form reasoning beyond rigid CoT templates and resolves the multi-solution dilemma in embodied pointing, achieving superior generalization compared to standard SFT approaches.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Open-world object manipulation using pre-trained vision-language models PDF

Stone, Austin, Austin V. Stone, Xiao, Ted, Ted Xiao, Austin Stone, Lu Yao, Yao Lu, Gopalakrishnan, Keerthana, Keerthana Gopalakrishnan, Lee, Kuang-Huei, Kuang-Huei Lee, K. Gopalakrishnan, Vuong, Quan, Quan Vuong, Wohlhart, Paul, Paul Wohlhart, Q. Vuong, Kirmani, Sean, Brianna Zitkovich, Zitkovich, Brianna, Fei Xia, Xia Fei, Chelsea Finn, F. Xia, Finn, Chelsea, Karol Hausman, Hausman, Karol (2023)

[2] Track2Act: Predicting Point Tracks from Internet Videos enables Diverse Zero-shot Robot Manipulation PDF

Bharadhwaj, Homanga, Mottaghi, Roozbeh, Homanga Bharadhwaj, Gupta, Abhinav, Roozbeh Mottaghi, Tulsiani, Shubham, Abhinav Gupta, Shubham Tulsiani (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Pointing as unified embodiment-agnostic intermediate representation with four core abilities

[19] Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations PDF

Cannot Refute

[20] Learning to Act Anywhere with Task-centric Latent Actions PDF

Cannot Refute

[21] Integrating Qt and LLMs on the NVIDIA Jetson board for controlling a patient-assisting robot arm PDF

Cannot Refute

[22] Align-then-steer: Adapting the vision-language action models through unified latent guidance PDF

Cannot Refute

[23] CALAMARI: Contact-aware and language conditioned spatial action MApping for contact-RIch manipulation PDF

Cannot Refute

[24] LatBot: Distilling Universal Latent Actions for Vision-Language-Action Models PDF

Cannot Refute

[25] Rethinking Intermediate Representation for VLM-based Robot Manipulation PDF

Cannot Refute

[26] CLEAR: Improving Vision-Language Navigation with Cross-Lingual, Environment-Agnostic Representations PDF

Cannot Refute

[27] Emergence of Human to Robot Transfer in Vision-Language-Action Models PDF

Cannot Refute

Contribution

Embodied-Points-200K dataset for embodied pointing capabilities

[38] LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models PDF

Cannot Refute

[39] Spatialbot: Precise spatial understanding with vision language models PDF

Cannot Refute

[40] SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding PDF

Cannot Refute

[41] Roboafford: A dataset and benchmark for enhancing object and spatial affordance learning in robot manipulation PDF

Cannot Refute

[42] Ges3ViG: Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding PDF

Cannot Refute

[43] Embodiedscan: A holistic multi-modal 3d perception suite towards embodied ai PDF

Cannot Refute

[44] Alfworld: Aligning text and embodied environments for interactive learning PDF

Cannot Refute

[45] RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics PDF

Cannot Refute

[46] Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation PDF

Cannot Refute

[47] Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI PDF

Cannot Refute

Contribution

Embodied-R1 model trained with two-stage Reinforced Fine-tuning paradigm

[28] Rl-vlm-f: Reinforcement learning from vision language foundation model feedback PDF

Cannot Refute

[29] Fincon: A synthesized llm multi-agent system with conceptual verbal reinforcement for enhanced financial decision making PDF

Cannot Refute

[30] Maniplvm-r1: Reinforcement learning for reasoning in embodied manipulation with large vision-language models PDF

Cannot Refute

[31] Alphablock: Embodied finetuning for vision-language reasoning in robot manipulation PDF

Cannot Refute

[32] Vla-r1: Enhancing reasoning in vision-language-action models PDF

Cannot Refute

[33] Cosmos-reason1: From physical common sense to embodied reasoning PDF

Cannot Refute

[34] Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models PDF

Cannot Refute

[35] Grounded Reinforcement Learning for Visual Reasoning PDF

Cannot Refute

[36] Co-rft: Efficient fine-tuning of vision-language-action models through chunked offline reinforcement learning PDF

Cannot Refute

[37] Reinforced Reasoning for Embodied Planning PDF

Cannot Refute

Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Open-world object manipulation using pre-trained vision-language models PDF

[2] Track2Act: Predicting Point Tracks from Internet Videos enables Diverse Zero-shot Robot Manipulation PDF

Contribution Analysis

Pointing as unified embodiment-agnostic intermediate representation with four core abilities

[19] Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations PDF

[20] Learning to Act Anywhere with Task-centric Latent Actions PDF

[21] Integrating Qt and LLMs on the NVIDIA Jetson board for controlling a patient-assisting robot arm PDF

[22] Align-then-steer: Adapting the vision-language action models through unified latent guidance PDF

[23] CALAMARI: Contact-aware and language conditioned spatial action MApping for contact-RIch manipulation PDF

[24] LatBot: Distilling Universal Latent Actions for Vision-Language-Action Models PDF

[25] Rethinking Intermediate Representation for VLM-based Robot Manipulation PDF

[26] CLEAR: Improving Vision-Language Navigation with Cross-Lingual, Environment-Agnostic Representations PDF

[27] Emergence of Human to Robot Transfer in Vision-Language-Action Models PDF

Embodied-Points-200K dataset for embodied pointing capabilities

[38] LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models PDF

[39] Spatialbot: Precise spatial understanding with vision language models PDF

[40] SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding PDF

[41] Roboafford: A dataset and benchmark for enhancing object and spatial affordance learning in robot manipulation PDF

[42] Ges3ViG: Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding PDF

[43] Embodiedscan: A holistic multi-modal 3d perception suite towards embodied ai PDF

[44] Alfworld: Aligning text and embodied environments for interactive learning PDF

[45] RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics PDF

[46] Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation PDF

[47] Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI PDF

Embodied-R1 model trained with two-stage Reinforced Fine-tuning paradigm

[28] Rl-vlm-f: Reinforcement learning from vision language foundation model feedback PDF

[29] Fincon: A synthesized llm multi-agent system with conceptual verbal reinforcement for enhanced financial decision making PDF

[30] Maniplvm-r1: Reinforcement learning for reasoning in embodied manipulation with large vision-language models PDF

[31] Alphablock: Embodied finetuning for vision-language reasoning in robot manipulation PDF

[32] Vla-r1: Enhancing reasoning in vision-language-action models PDF

[33] Cosmos-reason1: From physical common sense to embodied reasoning PDF

[34] Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models PDF

[35] Grounded Reinforcement Learning for Visual Reasoning PDF

[36] Co-rft: Efficient fine-tuning of vision-language-action models through chunked offline reinforcement learning PDF

[37] Reinforced Reasoning for Embodied Planning PDF

Table of Contents