VGR: Visual Grounded Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors
VLMMultiModalCot
Abstract:

In the field of multimodal chain-of-thought (CoT) reasoning, existing approaches predominantly rely on reasoning on pure linguistic space, which inherently suffers from language bias and is largely confined to math or science domains. This narrow focus limits their ability to handle complex visual reasoning tasks that demand comprehensive understanding of image details. To address these limitations, this paper introduces VGR, a novel reasoning multimodal large language model (MLLM) that can replay the visual memory during thinking just like humans. Unlike traditional MLLMs, VGR first thinks the question and detects relevant regions that may help solve problems, then, the visual memory from the critical area is extracted to assist reasoning. To achieve this, we curate a large-scale SFT dataset called VGR-SFT that contains reasoning data with mixed vision grounding and language deduction. This teaches VGR to think and actively choose grounding areas for key information before answering, and we propose a dynamic visual memory replay stage to integrates the corresponding information into the reasoning process, enhancing multimodel comprehension. Experiments on the LLaVA-NeXT-7B baseline show that VGR achieves superior performance on multimodal benchmarks requiring comprehensive image detail understanding. Compared to the baseline, VGR uses only 30% of the image token count while delivering scores of +4.1 on MMStar, +7.1 on AI2D, and +12.9 improvement on ChartQA.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes VGR, a framework that integrates visual grounding into multimodal chain-of-thought reasoning through dynamic visual memory replay. It resides in the 'Two-Stage Rationale Generation and Answer Inference' leaf alongside two sibling papers: Multimodal Chain-of-Thought and Duty-Distinct Chain-of-Thought. This leaf represents a foundational but relatively sparse research direction within the broader taxonomy of fifty papers across thirty-six topics, suggesting the two-stage paradigm is established but not overcrowded. The approach emphasizes active region detection before reasoning, distinguishing it from siblings that focus on role separation or general two-stage decomposition.

The taxonomy reveals neighboring leaves addressing unified reasoning mechanisms and compositional prompting strategies within the same parent branch of Core Multimodal CoT Frameworks. Adjacent branches include Visual Grounding and Spatial Reasoning Mechanisms, which explicitly anchor reasoning to image regions through post-hoc verification or active referencing. VGR's dynamic memory replay appears to bridge these areas, combining two-stage architecture with active visual referencing during reasoning generation. The taxonomy's scope notes clarify that two-stage methods exclude single-stage unified approaches, positioning VGR's explicit separation of detection and reasoning as a deliberate architectural choice within this design space.

Among twenty-eight candidates examined through limited semantic search, the VGR framework itself shows no clear refutation across ten candidates reviewed. However, the VGR-SFT dataset contribution faces potential overlap, with two of ten candidates appearing to provide similar large-scale reasoning datasets mixing vision grounding and language deduction. The expand-then-compress visual processing strategy examined eight candidates without refutation. These statistics suggest the core framework may occupy relatively novel ground within the examined scope, while the dataset construction approach encounters more substantial prior work among the limited candidates reviewed.

The analysis reflects a focused literature search rather than exhaustive coverage, examining top-K semantic matches and citation expansions. The two-stage architecture with dynamic memory replay appears positioned in a moderately explored area, while dataset construction faces clearer precedents within the examined scope. The taxonomy structure indicates active research across multiple grounding paradigms, suggesting VGR's specific combination of two-stage reasoning with active region selection may offer incremental refinement rather than paradigm shift, though definitive assessment would require broader literature coverage.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
28
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: Multimodal chain-of-thought reasoning with visual grounding. The field centers on enabling models to perform step-by-step reasoning over both textual and visual inputs while explicitly anchoring intermediate steps to spatial or semantic regions in images. The taxonomy reveals several major branches: Core Multimodal CoT Frameworks and Architectures establish foundational two-stage pipelines that separate rationale generation from answer inference, exemplified by Multimodal Chain-of-Thought[1] and Duty-Distinct Chain-of-Thought[13]. Visual Grounding and Spatial Reasoning Mechanisms focus on techniques that link reasoning steps to image regions, such as CoRGI[2] and Plug-and-play Grounding[10]. Visual Reasoning Enhancement Techniques explore methods like contrastive prompting (Contrastive Chain-of-Thought[15]) and compositional decomposition (Compositional Chain-of-Thought[5]). Training Methodologies and Optimization address learning strategies including reinforcement learning with grounding feedback (Point-RFT[14]), while Domain-Specific Applications adapt these frameworks to specialized tasks like geometry (GeoChain[7]) or chart understanding (ChartSketcher[32]). Analysis, Evaluation, and Robustness branches examine failure modes and dataset construction (Visual CoT Dataset[18]). A particularly active line of work investigates how to effectively integrate visual grounding into reasoning chains, balancing explicit spatial anchoring with fluent multi-step inference. Visual Grounded Reasoning[0] sits within the two-stage rationale generation cluster, closely aligned with Multimodal Chain-of-Thought[1] and Duty-Distinct Chain-of-Thought[13], which similarly decompose reasoning into separate phases. While Multimodal Chain-of-Thought[1] pioneered the two-stage approach and Duty-Distinct Chain-of-Thought[13] emphasized role separation between vision and language modules, Visual Grounded Reasoning[0] appears to emphasize tighter coupling between generated rationales and visual evidence through grounding mechanisms. Contrasting approaches like PEAR Chain-of-Thought[3] explore planning-execution-action-reflection cycles, and CoT-VLA[4] integrates vision-language-action modeling, highlighting ongoing debates about whether grounding should be post-hoc, interleaved during generation, or embedded in action-oriented architectures. Open questions persist around the trade-offs between interpretability, computational cost, and reasoning accuracy across diverse visual reasoning scenarios.

Claimed Contributions

VGR: Visual Grounded Reasoning framework with dynamic visual memory replay

The authors propose VGR, a new framework that enables MLLMs to dynamically retrieve and replay visual memory from specific image regions during reasoning. Unlike traditional MLLMs that reason purely in linguistic space, VGR allows the model to selectively attend to visual content on demand by generating replay signals and retrieving corresponding visual tokens from a visual memory pool.

10 retrieved papers
VGR-SFT: large-scale visual grounded reasoning dataset with mixed vision grounding and language deduction

The authors construct a new supervised fine-tuning dataset that integrates visual grounding signals (bounding boxes marking regions of interest) into reasoning chains. This dataset is created through a three-stage pipeline involving cold-start annotation with existing MLLMs, reject sampling refinement, and scaling with a trained annotation model.

10 retrieved papers
Can Refute
Expand-then-compress strategy with pooling for efficient high-resolution visual processing

The authors introduce a strategy that expands the number of image crops to support higher resolutions while applying different pooling rates (2×2 for snapshots, 4×4 for high-resolution crops) to compress visual features. This design reduces token usage by 70% compared to the baseline while expanding supported resolutions by 5×.

8 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

VGR: Visual Grounded Reasoning framework with dynamic visual memory replay

The authors propose VGR, a new framework that enables MLLMs to dynamically retrieve and replay visual memory from specific image regions during reasoning. Unlike traditional MLLMs that reason purely in linguistic space, VGR allows the model to selectively attend to visual content on demand by generating replay signals and retrieving corresponding visual tokens from a visual memory pool.

Contribution

VGR-SFT: large-scale visual grounded reasoning dataset with mixed vision grounding and language deduction

The authors construct a new supervised fine-tuning dataset that integrates visual grounding signals (bounding boxes marking regions of interest) into reasoning chains. This dataset is created through a three-stage pipeline involving cold-start annotation with existing MLLMs, reject sampling refinement, and scaling with a trained annotation model.

Contribution

Expand-then-compress strategy with pooling for efficient high-resolution visual processing

The authors introduce a strategy that expands the number of image crops to support higher resolutions while applying different pooling rates (2×2 for snapshots, 4×4 for high-resolution crops) to compress visual features. This design reduces token usage by 70% compared to the baseline while expanding supported resolutions by 5×.