DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Multimodal Large Language ModelsMultimodal ReasoningReinforcement Learning

Large Vision-Language Models excel at multimodal understanding but struggle to deeply integrate visual information into their predominantly text-based reasoning processes, a key challenge in mirroring human cognition. To address this, we introduce DeepEyes, a model that learns to ``think with images'', trained end-to-end with reinforcement learning and without pre-collected reasoning data for supervised fine-tuning (SFT) as a cold-start. Notably, this ability emerges natively, leveraging the model's own grounding capability as an intrinsic function rather than relying on external specialized models or APIs. We enable this capability through active perception, where the model learns to strategically ground its reasoning in visual information, guided by a tailored data selection and reward strategy. DeepEyes achieves significant performance gains on general perception and reasoning benchmarks and also demonstrates improvement in grounding, hallucination, and mathematical reasoning tasks. Interestingly, we observe the distinct evolution of active perception from initial exploration to efficient and accurate exploitation, and diverse thinking patterns that closely mirror human visual reasoning processes. Code is available at \url{https://anonymous.4open.science/r/DeepEyes-97FE/}.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces DeepEyes, a vision-language model trained end-to-end with reinforcement learning to perform multi-stage visual reasoning without supervised fine-tuning data. It resides in the 'Autonomous Multi-Stage Visual Reasoning' leaf of the taxonomy, which contains only two papers including the original work. This leaf sits within the broader 'Structured Reasoning and Chain-of-Thought Approaches' branch, indicating a relatively sparse but active research direction focused on models that independently decompose reasoning into sequential stages without external prompting or supervision.

The taxonomy reveals several neighboring research directions that contextualize DeepEyes. The sibling leaf 'Supervised Reasoning Decomposition with Visual Signals' explores similar multi-stage reasoning but relies on explicit supervision or reward mechanisms for intermediate steps. Nearby branches address 'Visual Chain-of-Thought and Sketching' (four papers generating visual artifacts as reasoning steps) and 'Long-Chain Visual Reasoning' (two papers handling extended reasoning sequences). The 'Training Paradigms and Model Architectures' branch, particularly 'Self-Improvement and Modality Alignment', shares conceptual overlap with DeepEyes' RL-based approach but focuses on self-generated data rather than active perception mechanisms.

Among 29 candidates examined across three contributions, the analysis reveals limited prior work overlap. The core 'end-to-end RL-based iMCoT' contribution examined 9 candidates with no clear refutations, suggesting novelty in the training paradigm. The 'active perception mechanism' contribution examined 10 candidates, also without refutation, indicating the native grounding capability may be distinctive. However, the 'data selection and reward strategy' contribution found 1 refutable candidate among 10 examined, suggesting some overlap with existing reward-shaping techniques. The limited search scope (29 papers, not exhaustive) means these findings reflect top-K semantic matches rather than comprehensive field coverage.

Given the sparse taxonomy leaf (two papers total) and limited refutations across most contributions, DeepEyes appears to occupy a relatively novel position within autonomous multi-stage visual reasoning. The single refutable candidate for reward strategy suggests incremental refinement in that component, while the core RL-based training and active perception mechanisms show stronger novelty signals. However, the analysis is constrained by examining only 29 candidates from semantic search, leaving open the possibility of relevant work outside this scope, particularly in adjacent RL-for-VLM or grounding literature.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: integrating visual information into vision-language model reasoning processes. The field has evolved into a rich landscape organized around several complementary directions. Structured Reasoning and Chain-of-Thought Approaches focus on making explicit the intermediate steps by which models combine visual and linguistic cues, often through multi-stage pipelines or self-reflective mechanisms. Spatial and Geometric Reasoning emphasizes understanding positional relationships and geometric properties, as seen in works like SpatialRGPT[3] and SpatialVLM[4]. Visual Feature Representation and Integration explores how to encode and fuse visual signals—ranging from early convolutional architectures to modern transformer-based embeddings—while Training Paradigms and Model Architectures address foundational questions of how to build and optimize these systems, exemplified by InstructBLIP[5] and related instruction-tuning methods. Meanwhile, Context and Demonstration Learning investigates few-shot and in-context strategies, Domain-Specific Applications target specialized tasks such as medical imaging or navigation, and Evaluation, Benchmarking, and Analysis provide the empirical grounding needed to compare approaches. Enhanced Capabilities and Auxiliary Mechanisms introduce tools like external memory or iterative refinement, and Cross-Domain and Multimodal Extensions push beyond vision-language pairs into audio-visual or embodied settings. A particularly active line of work centers on autonomous multi-stage visual reasoning, where models iteratively refine their understanding by generating intermediate reasoning traces or self-critiques. DeepEyes[0] exemplifies this direction by orchestrating multiple reasoning steps that dynamically integrate visual evidence, closely aligning with LLaVA-CoT[1], which also structures chain-of-thought processes for vision-language tasks. These methods contrast with approaches that rely on fixed feature extractors or single-pass inference, trading computational cost for improved interpretability and accuracy on complex visual questions. Nearby efforts such as Self-rewarding VLM[2] explore self-improvement through reward-based learning, while works in spatial reasoning like SpatialRGPT[3] emphasize grounding in geometric relationships rather than general-purpose reasoning chains. DeepEyes[0] sits squarely within the autonomous multi-stage cluster, sharing with LLaVA-CoT[1] an emphasis on explicit intermediate steps, yet it distinguishes itself by deeper integration of visual cues at each reasoning stage, reflecting ongoing debates about how tightly vision and language should be coupled during inference.

Claimed Contributions

DeepEyes model with end-to-end RL-based iMCoT

9 retrieved papers

The authors propose DeepEyes, a vision-language model that learns to integrate visual information into reasoning through end-to-end reinforcement learning. This approach eliminates the need for supervised fine-tuning with pre-collected reasoning data and enables interleaved multimodal chain-of-thought (iMCoT) reasoning.

9 retrieved papers

Active perception mechanism with native grounding capability

10 retrieved papers

The authors introduce an active perception mechanism that encapsulates the model's native visual grounding capability as an internal tool. This allows the model to strategically ground its reasoning in visual information without depending on external specialized models or APIs.

10 retrieved papers

Data selection and reward strategy for active perception

Can Refute

10 retrieved papers

The authors design a data selection mechanism to choose training samples that encourage active perception behavior, along with a conditional reward strategy that assigns bonuses to trajectories successfully completing tasks through active perception. These components are crucial for optimizing the efficiency and accuracy of the model's visual reasoning.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] LLaVA-CoT: Let Vision Language Models Reason Step-by-Step PDF

Xu Guowei, Guowei Xu, Jin Peng, Peng Jin, Wu Ziang, Li Hao, Yibing Song, Hao Li, Song, Yibing, Lichao Sun, Sun, Lichao, Yuan Li, Li Yuan (2024) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DeepEyes model with end-to-end RL-based iMCoT

[51] LLM-I: LLMs are Naturally Interleaved Multimodal Creators PDF

Cannot Refute

[52] Point-rft: Improving multimodal reasoning with visually grounded reinforcement finetuning PDF

Cannot Refute

[53] VLM-R: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought PDF

Cannot Refute

[54] Deer-vla: Dynamic inference of multimodal large language models for efficient robot execution PDF

Cannot Refute

[55] Reasoning-VLA: A Fast and General Vision-Language-Action Reasoning Model for Autonomous Driving PDF

Cannot Refute

[56] PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning PDF

Cannot Refute

[57] Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens PDF

Cannot Refute

[58] UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning PDF

Cannot Refute

[59] Bridging Formal Language with Chain-of-Thought Reasoning to Geometry Problem Solving PDF

Cannot Refute

Contribution

Active perception mechanism with native grounding capability

[34] Visual In-Context Learning for Large Vision-Language Models PDF

Cannot Refute

[70] Contrastive region guidance: Improving grounding in vision-language models without training PDF

Cannot Refute

[71] Cogvlm: Visual expert for pretrained language models PDF

Cannot Refute

[72] Navgpt: Explicit reasoning in vision-and-language navigation with large language models PDF

Cannot Refute

[73] ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling PDF

Cannot Refute

[74] Robot navigation using physically grounded vision-language models in outdoor environments PDF

Cannot Refute

[75] Learning Visual Grounding from Generative Vision and Language Model PDF

Cannot Refute

[76] VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks PDF

Cannot Refute

[77] Direct visual grounding by directing attention of visual tokens PDF

Cannot Refute

[78] Geoground: A unified large vision-language model for remote sensing visual grounding PDF

Cannot Refute

Contribution

Data selection and reward strategy for active perception

[60] Visrl: Intention-driven visual perception via reinforced reasoning PDF

Can Refute

[61] Vr-thinker: Boosting video reward models through thinking-with-image reasoning PDF

Cannot Refute

[62] GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning PDF

Cannot Refute

[63] Learn 3D VQA Better with Active Selection and Reannotation PDF

Cannot Refute

[64] Training Vision-Language Process Reward Models for Test-Time Scaling in Multimodal Reasoning: Key Insights and Lessons Learned PDF

Cannot Refute

[65] 3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding PDF

Cannot Refute

[66] Learning Active Perception via Self-Evolving Preference Optimization for GUI Grounding PDF

Cannot Refute

[67] Adaptive important region selection with reinforced hierarchical search for dense object detection PDF

Cannot Refute

[68] SR-AIF: Solving Sparse-Reward Robotic Tasks From Pixels with Active Inference and World Models PDF

Cannot Refute

[69] Active Vision for Embodied Agents Using Reinforcement Learning PDF

Cannot Refute

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] LLaVA-CoT: Let Vision Language Models Reason Step-by-Step PDF

Contribution Analysis

DeepEyes model with end-to-end RL-based iMCoT

[51] LLM-I: LLMs are Naturally Interleaved Multimodal Creators PDF

[52] Point-rft: Improving multimodal reasoning with visually grounded reinforcement finetuning PDF

[53] VLM-R: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought PDF

[54] Deer-vla: Dynamic inference of multimodal large language models for efficient robot execution PDF

[55] Reasoning-VLA: A Fast and General Vision-Language-Action Reasoning Model for Autonomous Driving PDF

[56] PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning PDF

[57] Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens PDF

[58] UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning PDF

[59] Bridging Formal Language with Chain-of-Thought Reasoning to Geometry Problem Solving PDF

Active perception mechanism with native grounding capability

[34] Visual In-Context Learning for Large Vision-Language Models PDF

[70] Contrastive region guidance: Improving grounding in vision-language models without training PDF

[71] Cogvlm: Visual expert for pretrained language models PDF

[72] Navgpt: Explicit reasoning in vision-and-language navigation with large language models PDF

[73] ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling PDF

[74] Robot navigation using physically grounded vision-language models in outdoor environments PDF

[75] Learning Visual Grounding from Generative Vision and Language Model PDF

[76] VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks PDF

[77] Direct visual grounding by directing attention of visual tokens PDF

[78] Geoground: A unified large vision-language model for remote sensing visual grounding PDF

Data selection and reward strategy for active perception

[60] Visrl: Intention-driven visual perception via reinforced reasoning PDF

[61] Vr-thinker: Boosting video reward models through thinking-with-image reasoning PDF

[62] GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning PDF

[63] Learn 3D VQA Better with Active Selection and Reannotation PDF

[64] Training Vision-Language Process Reward Models for Test-Time Scaling in Multimodal Reasoning: Key Insights and Lessons Learned PDF

[65] 3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding PDF

[66] Learning Active Perception via Self-Evolving Preference Optimization for GUI Grounding PDF

[67] Adaptive important region selection with reinforced hierarchical search for dense object detection PDF

[68] SR-AIF: Solving Sparse-Reward Robotic Tasks From Pixels with Active Inference and World Models PDF

[69] Active Vision for Embodied Agents Using Reinforcement Learning PDF

Table of Contents