DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors
Multimodal Large Language ModelsMultimodal ReasoningReinforcement Learning
Abstract:

Large Vision-Language Models excel at multimodal understanding but struggle to deeply integrate visual information into their predominantly text-based reasoning processes, a key challenge in mirroring human cognition. To address this, we introduce DeepEyes, a model that learns to ``think with images'', trained end-to-end with reinforcement learning and without pre-collected reasoning data for supervised fine-tuning (SFT) as a cold-start. Notably, this ability emerges natively, leveraging the model's own grounding capability as an intrinsic function rather than relying on external specialized models or APIs. We enable this capability through active perception, where the model learns to strategically ground its reasoning in visual information, guided by a tailored data selection and reward strategy. DeepEyes achieves significant performance gains on general perception and reasoning benchmarks and also demonstrates improvement in grounding, hallucination, and mathematical reasoning tasks. Interestingly, we observe the distinct evolution of active perception from initial exploration to efficient and accurate exploitation, and diverse thinking patterns that closely mirror human visual reasoning processes. Code is available at \url{https://anonymous.4open.science/r/DeepEyes-97FE/}.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces DeepEyes, a vision-language model trained end-to-end with reinforcement learning to perform multi-stage visual reasoning without supervised fine-tuning data. It resides in the 'Autonomous Multi-Stage Visual Reasoning' leaf of the taxonomy, which contains only two papers including the original work. This leaf sits within the broader 'Structured Reasoning and Chain-of-Thought Approaches' branch, indicating a relatively sparse but active research direction focused on models that independently decompose reasoning into sequential stages without external prompting or supervision.

The taxonomy reveals several neighboring research directions that contextualize DeepEyes. The sibling leaf 'Supervised Reasoning Decomposition with Visual Signals' explores similar multi-stage reasoning but relies on explicit supervision or reward mechanisms for intermediate steps. Nearby branches address 'Visual Chain-of-Thought and Sketching' (four papers generating visual artifacts as reasoning steps) and 'Long-Chain Visual Reasoning' (two papers handling extended reasoning sequences). The 'Training Paradigms and Model Architectures' branch, particularly 'Self-Improvement and Modality Alignment', shares conceptual overlap with DeepEyes' RL-based approach but focuses on self-generated data rather than active perception mechanisms.

Among 29 candidates examined across three contributions, the analysis reveals limited prior work overlap. The core 'end-to-end RL-based iMCoT' contribution examined 9 candidates with no clear refutations, suggesting novelty in the training paradigm. The 'active perception mechanism' contribution examined 10 candidates, also without refutation, indicating the native grounding capability may be distinctive. However, the 'data selection and reward strategy' contribution found 1 refutable candidate among 10 examined, suggesting some overlap with existing reward-shaping techniques. The limited search scope (29 papers, not exhaustive) means these findings reflect top-K semantic matches rather than comprehensive field coverage.

Given the sparse taxonomy leaf (two papers total) and limited refutations across most contributions, DeepEyes appears to occupy a relatively novel position within autonomous multi-stage visual reasoning. The single refutable candidate for reward strategy suggests incremental refinement in that component, while the core RL-based training and active perception mechanisms show stronger novelty signals. However, the analysis is constrained by examining only 29 candidates from semantic search, leaving open the possibility of relevant work outside this scope, particularly in adjacent RL-for-VLM or grounding literature.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: integrating visual information into vision-language model reasoning processes. The field has evolved into a rich landscape organized around several complementary directions. Structured Reasoning and Chain-of-Thought Approaches focus on making explicit the intermediate steps by which models combine visual and linguistic cues, often through multi-stage pipelines or self-reflective mechanisms. Spatial and Geometric Reasoning emphasizes understanding positional relationships and geometric properties, as seen in works like SpatialRGPT[3] and SpatialVLM[4]. Visual Feature Representation and Integration explores how to encode and fuse visual signals—ranging from early convolutional architectures to modern transformer-based embeddings—while Training Paradigms and Model Architectures address foundational questions of how to build and optimize these systems, exemplified by InstructBLIP[5] and related instruction-tuning methods. Meanwhile, Context and Demonstration Learning investigates few-shot and in-context strategies, Domain-Specific Applications target specialized tasks such as medical imaging or navigation, and Evaluation, Benchmarking, and Analysis provide the empirical grounding needed to compare approaches. Enhanced Capabilities and Auxiliary Mechanisms introduce tools like external memory or iterative refinement, and Cross-Domain and Multimodal Extensions push beyond vision-language pairs into audio-visual or embodied settings. A particularly active line of work centers on autonomous multi-stage visual reasoning, where models iteratively refine their understanding by generating intermediate reasoning traces or self-critiques. DeepEyes[0] exemplifies this direction by orchestrating multiple reasoning steps that dynamically integrate visual evidence, closely aligning with LLaVA-CoT[1], which also structures chain-of-thought processes for vision-language tasks. These methods contrast with approaches that rely on fixed feature extractors or single-pass inference, trading computational cost for improved interpretability and accuracy on complex visual questions. Nearby efforts such as Self-rewarding VLM[2] explore self-improvement through reward-based learning, while works in spatial reasoning like SpatialRGPT[3] emphasize grounding in geometric relationships rather than general-purpose reasoning chains. DeepEyes[0] sits squarely within the autonomous multi-stage cluster, sharing with LLaVA-CoT[1] an emphasis on explicit intermediate steps, yet it distinguishes itself by deeper integration of visual cues at each reasoning stage, reflecting ongoing debates about how tightly vision and language should be coupled during inference.

Claimed Contributions

DeepEyes model with end-to-end RL-based iMCoT

The authors propose DeepEyes, a vision-language model that learns to integrate visual information into reasoning through end-to-end reinforcement learning. This approach eliminates the need for supervised fine-tuning with pre-collected reasoning data and enables interleaved multimodal chain-of-thought (iMCoT) reasoning.

9 retrieved papers
Active perception mechanism with native grounding capability

The authors introduce an active perception mechanism that encapsulates the model's native visual grounding capability as an internal tool. This allows the model to strategically ground its reasoning in visual information without depending on external specialized models or APIs.

10 retrieved papers
Data selection and reward strategy for active perception

The authors design a data selection mechanism to choose training samples that encourage active perception behavior, along with a conditional reward strategy that assigns bonuses to trajectories successfully completing tasks through active perception. These components are crucial for optimizing the efficiency and accuracy of the model's visual reasoning.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DeepEyes model with end-to-end RL-based iMCoT

The authors propose DeepEyes, a vision-language model that learns to integrate visual information into reasoning through end-to-end reinforcement learning. This approach eliminates the need for supervised fine-tuning with pre-collected reasoning data and enables interleaved multimodal chain-of-thought (iMCoT) reasoning.

Contribution

Active perception mechanism with native grounding capability

The authors introduce an active perception mechanism that encapsulates the model's native visual grounding capability as an internal tool. This allows the model to strategically ground its reasoning in visual information without depending on external specialized models or APIs.

Contribution

Data selection and reward strategy for active perception

The authors design a data selection mechanism to choose training samples that encourage active perception behavior, along with a conditional reward strategy that assigns bonuses to trajectories successfully completing tasks through active perception. These components are crucial for optimizing the efficiency and accuracy of the model's visual reasoning.