DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Overview
Overall Novelty Assessment
The paper introduces DeepEyes, a vision-language model trained end-to-end with reinforcement learning to perform multi-stage visual reasoning without supervised fine-tuning data. It resides in the 'Autonomous Multi-Stage Visual Reasoning' leaf of the taxonomy, which contains only two papers including the original work. This leaf sits within the broader 'Structured Reasoning and Chain-of-Thought Approaches' branch, indicating a relatively sparse but active research direction focused on models that independently decompose reasoning into sequential stages without external prompting or supervision.
The taxonomy reveals several neighboring research directions that contextualize DeepEyes. The sibling leaf 'Supervised Reasoning Decomposition with Visual Signals' explores similar multi-stage reasoning but relies on explicit supervision or reward mechanisms for intermediate steps. Nearby branches address 'Visual Chain-of-Thought and Sketching' (four papers generating visual artifacts as reasoning steps) and 'Long-Chain Visual Reasoning' (two papers handling extended reasoning sequences). The 'Training Paradigms and Model Architectures' branch, particularly 'Self-Improvement and Modality Alignment', shares conceptual overlap with DeepEyes' RL-based approach but focuses on self-generated data rather than active perception mechanisms.
Among 29 candidates examined across three contributions, the analysis reveals limited prior work overlap. The core 'end-to-end RL-based iMCoT' contribution examined 9 candidates with no clear refutations, suggesting novelty in the training paradigm. The 'active perception mechanism' contribution examined 10 candidates, also without refutation, indicating the native grounding capability may be distinctive. However, the 'data selection and reward strategy' contribution found 1 refutable candidate among 10 examined, suggesting some overlap with existing reward-shaping techniques. The limited search scope (29 papers, not exhaustive) means these findings reflect top-K semantic matches rather than comprehensive field coverage.
Given the sparse taxonomy leaf (two papers total) and limited refutations across most contributions, DeepEyes appears to occupy a relatively novel position within autonomous multi-stage visual reasoning. The single refutable candidate for reward strategy suggests incremental refinement in that component, while the core RL-based training and active perception mechanisms show stronger novelty signals. However, the analysis is constrained by examining only 29 candidates from semantic search, leaving open the possibility of relevant work outside this scope, particularly in adjacent RL-for-VLM or grounding literature.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors propose DeepEyes, a vision-language model that learns to integrate visual information into reasoning through end-to-end reinforcement learning. This approach eliminates the need for supervised fine-tuning with pre-collected reasoning data and enables interleaved multimodal chain-of-thought (iMCoT) reasoning.
The authors introduce an active perception mechanism that encapsulates the model's native visual grounding capability as an internal tool. This allows the model to strategically ground its reasoning in visual information without depending on external specialized models or APIs.
The authors design a data selection mechanism to choose training samples that encourage active perception behavior, along with a conditional reward strategy that assigns bonuses to trajectories successfully completing tasks through active perception. These components are crucial for optimizing the efficiency and accuracy of the model's visual reasoning.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[1] LLaVA-CoT: Let Vision Language Models Reason Step-by-Step PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
DeepEyes model with end-to-end RL-based iMCoT
The authors propose DeepEyes, a vision-language model that learns to integrate visual information into reasoning through end-to-end reinforcement learning. This approach eliminates the need for supervised fine-tuning with pre-collected reasoning data and enables interleaved multimodal chain-of-thought (iMCoT) reasoning.
[51] LLM-I: LLMs are Naturally Interleaved Multimodal Creators PDF
[52] Point-rft: Improving multimodal reasoning with visually grounded reinforcement finetuning PDF
[53] VLM-R: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought PDF
[54] Deer-vla: Dynamic inference of multimodal large language models for efficient robot execution PDF
[55] Reasoning-VLA: A Fast and General Vision-Language-Action Reasoning Model for Autonomous Driving PDF
[56] PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning PDF
[57] Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens PDF
[58] UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning PDF
[59] Bridging Formal Language with Chain-of-Thought Reasoning to Geometry Problem Solving PDF
Active perception mechanism with native grounding capability
The authors introduce an active perception mechanism that encapsulates the model's native visual grounding capability as an internal tool. This allows the model to strategically ground its reasoning in visual information without depending on external specialized models or APIs.
[34] Visual In-Context Learning for Large Vision-Language Models PDF
[70] Contrastive region guidance: Improving grounding in vision-language models without training PDF
[71] Cogvlm: Visual expert for pretrained language models PDF
[72] Navgpt: Explicit reasoning in vision-and-language navigation with large language models PDF
[73] ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling PDF
[74] Robot navigation using physically grounded vision-language models in outdoor environments PDF
[75] Learning Visual Grounding from Generative Vision and Language Model PDF
[76] VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks PDF
[77] Direct visual grounding by directing attention of visual tokens PDF
[78] Geoground: A unified large vision-language model for remote sensing visual grounding PDF
Data selection and reward strategy for active perception
The authors design a data selection mechanism to choose training samples that encourage active perception behavior, along with a conditional reward strategy that assigns bonuses to trajectories successfully completing tasks through active perception. These components are crucial for optimizing the efficiency and accuracy of the model's visual reasoning.