Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.4 Download Report PDF

ReasoningVision-Language ModelsContrasting

Reasoning has emerged as a key capability of large language models. In linguistic tasks, this capability can be enhanced by self-improving techniques that refine reasoning paths for subsequent fine-tuning. However, extending these language-based self-improving approaches to vision language models (VLMs) presents a unique challenge: visual hallucinations in reasoning paths cannot be effectively verified or rectified. Our solution starts with a key observation about visual contrast: when presented with a contrastive VQA pair, i.e., two visually similar images with synonymous questions, VLMs identify relevant visual cues more precisely compared with when given a single VQA sample. Motivated by this observation, we propose Visual Contrastive Self-Taught Reasoner (VC-STaR), a novel self-improving framework that leverages visual contrast to mitigate hallucinations in model-generated rationales. We collect a diverse suite of VQA datasets, curate contrastive pairs according to multi-modal similarity, and generate rationales using VC-STaR. Consequently, we obtain a new visual reasoning dataset, VisCoR- $55$ K, which is then used to boost the reasoning capability of various VLMs through supervised finetuning. Extensive experiments show that VC-STaR not only outperforms existing self-improving approaches but also surpasses models finetuned on the SoTA visual reasoning datasets, demonstrating that the inherent contrastive ability of VLMs can bootstrap their own visual reasoning. The code, dataset and trained models will be released upon acceptance.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces VC-STaR, a self-improving framework that uses visual contrast to reduce hallucinations in VLM-generated reasoning paths, and produces VisCoR-55K, a visual reasoning dataset for fine-tuning. According to the taxonomy, this work resides in the 'Contrastive Learning and Visual Contrast' leaf under 'Evaluation, Benchmarking, and Auxiliary Techniques'. Notably, this leaf contains only one paper (the original work itself), indicating a sparse research direction within the broader self-improving VLM landscape, which encompasses 50 papers across approximately 36 topics.

The taxonomy reveals that neighboring leaves focus on prompt optimization, reasoning evaluation, and architectural enhancements, while sibling branches address iterative refinement (e.g., self-correction mechanisms with 5 papers) and synthetic data generation (4 papers). The scope note for this leaf emphasizes 'leveraging visual contrast or contrastive pairs to enhance visual reasoning and mitigate hallucinations', explicitly excluding non-contrastive self-improvement methods. The taxonomy narrative mentions Contrast Lens as an exemplar, positioning contrastive approaches as diagnostic and interpretive tools complementary to end-to-end training loops found in denser branches like actor-critic frameworks or reward-based optimization.

Among 30 candidates examined, the VC-STaR framework and contrastive pair curation framework each showed no clear refutations across 10 candidates, suggesting these contributions occupy relatively unexplored methodological territory. However, the VisCoR-55K dataset contribution encountered 1 refutable candidate among 10 examined, indicating some overlap with existing visual reasoning datasets. The limited search scope (30 candidates total, not exhaustive) means these statistics reflect top-K semantic matches and citation expansion rather than comprehensive field coverage. The framework contributions appear more distinctive than the dataset contribution within this bounded search.

Given the sparse taxonomy leaf (1 paper) and the absence of sibling papers, the contrastive self-improvement angle appears underexplored relative to denser branches like self-correction (5 papers) or reward-based optimization (5 papers). The analysis covers top-30 semantic matches, so conclusions about novelty are provisional. The framework's emphasis on visual contrast as a hallucination mitigation strategy distinguishes it from iterative refinement or synthetic data generation approaches, though the dataset contribution shows more overlap with prior work within the examined scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: self-improving visual reasoning in vision language models. The field organizes around several complementary strategies for enhancing VLM performance without extensive human annotation. One major branch focuses on iterative refinement and feedback mechanisms, where models learn to critique and revise their own outputs through self-play or internal verification loops (e.g., Calibrated Self-Rewarding[1], Self-Improving Teacher[10]). A second branch emphasizes external supervision or synthetic data generation, leveraging large-scale automated pipelines to produce training signals that guide model improvement (e.g., Self-Bootstrapped Knowledge[28], Self-Training Comprehension[21]). Modality alignment and perception enhancement address the core challenge of bridging vision and language representations, often through contrastive objectives or architectural innovations (e.g., Modality Alignment Enhancement[11], Perceiver-vl[40]). Task-specific applications demonstrate these principles in domains such as navigation, GUI interaction, and video understanding (e.g., EvolveNav[5], Active Perception GUI[22]). Finally, evaluation and auxiliary techniques provide the infrastructure for measuring progress and supporting self-improvement, including benchmarking frameworks, contrastive learning methods, and tool-augmented reasoning (e.g., Measuring Chain-of-Thought[4], Viper[14]). Within the evaluation and auxiliary techniques branch, contrastive learning and visual contrast methods have emerged as a small but important cluster. These approaches use contrastive objectives to sharpen visual discrimination and improve reasoning by highlighting differences between similar inputs or outputs. Contrast Lens[0] exemplifies this direction by introducing mechanisms that explicitly leverage visual contrasts to enhance interpretability and reasoning quality. This work shares thematic connections with broader evaluation efforts like Measuring Chain-of-Thought[4], which probes reasoning transparency, and with perception-focused methods such as Cropper[41], which refines visual attention. Compared to iterative refinement approaches like Calibrated Self-Rewarding[1] or task-specific systems like Spatial Reasoning Drawing[3], Contrast Lens[0] emphasizes diagnostic and interpretive tools rather than end-to-end training loops, positioning itself as a complementary technique for understanding and improving how VLMs process visual information.

Claimed Contributions

Visual Contrastive Self-Taught Reasoner (VC-STaR) framework

10 retrieved papers

The authors introduce VC-STaR, a self-improving framework that uses contrastive VQA pairs (two visually similar images with synonymous questions) to help VLMs identify relevant visual cues more precisely and rectify visual hallucinations in reasoning paths. The framework includes three steps: generating a coarse rationale, performing contrastive analysis, and rethinking to refine the rationale.

10 retrieved papers

Task-agnostic contrastive VQA pair curation framework

10 retrieved papers

The authors develop a flexible pipeline for curating contrastive VQA pairs across diverse VQA tasks including reasoning, math, chart, and OCR. The pipeline involves data collection from 21 datasets, similarity-based pair hunting using image and question embeddings, and difficulty-based sampling to select median-difficulty samples suitable for reasoning enhancement.

10 retrieved papers

VisCoR-55K visual reasoning dataset

Can Refute

10 retrieved papers

The authors create VisCoR-55K, a new dataset containing 55K high-quality visual reasoning samples with faithful rationales generated using VC-STaR. The dataset spans five categories (general VQA, reasoning, math, graph/chart, and OCR) and is used to improve VLM reasoning capabilities through supervised finetuning.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Visual Contrastive Self-Taught Reasoner (VC-STaR) framework

[68] Reflective instruction tuning: Mitigating hallucinations in large vision-language models PDF

Cannot Refute

[69] Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models PDF

Cannot Refute

[70] Hallucination Augmented Contrastive Learning for Multimodal Large Language Model PDF

Cannot Refute

[71] Retrieve-then-compare mitigates visual hallucination in multi-modal large language models PDF

Cannot Refute

[72] Contrastive Learning Reduces Hallucination in Conversations PDF

Cannot Refute

[73] Mitigating object hallucinations in large vision-language models through visual contrastive decoding PDF

Cannot Refute

[74] Delve into Visual Contrastive Decoding for Hallucination Mitigation of Large Vision-Language Models PDF

Cannot Refute

[75] ConVis: Contrastive Decoding with Hallucination Visualization for Mitigating Hallucinations in Multimodal Large Language Models PDF

Cannot Refute

[76] See different, think better: Visual variations mitigating hallucinations in lvlms PDF

Cannot Refute

[77] HSCL-RL: Mitigating Hallucinations in Multimodal Large Language Models PDF

Cannot Refute

Contribution

Task-agnostic contrastive VQA pair curation framework

[58] Language-guided Bias Generation Contrastive Strategy for Visual Question Answering PDF

Cannot Refute

[59] Counterfactual samples synthesizing and training for robust visual question answering PDF

Cannot Refute

[60] Simple contrastive learning in a self-supervised manner for robust visual question answering PDF

Cannot Refute

[61] Surgical-VQLA++: Adversarial contrastive learning for calibrated robust visual question-localized answering in robotic surgery PDF

Cannot Refute

[62] HCCL: Hierarchical Counterfactual Contrastive Learning for Robust Visual Question Answering PDF

Cannot Refute

[63] Ciem: Contrastive instruction evaluation method for better instruction tuning PDF

Cannot Refute

[64] Design as Desired: Utilizing Visual Question Answering for Multimodal Pre-training PDF

Cannot Refute

[65] Overcoming language priors with self-contrastive learning for visual question answering PDF

Cannot Refute

[66] Contrastive video question answering via video graph transformer PDF

Cannot Refute

[67] A Logic-based Approach to Contrastive Explainability for Neurosymbolic Visual Question Answering PDF

Cannot Refute

Contribution

VisCoR-55K visual reasoning dataset

[52] Llava-cot: Let vision language models reason step-by-step PDF

Can Refute

[4] Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models PDF

Cannot Refute

[6] Insight-v: Exploring long-chain visual reasoning with multimodal large language models PDF

Cannot Refute

[33] Vision-Language Models Can Self-Improve Reasoning via Reflection PDF

Cannot Refute

[51] Fine-tuning large vision-language models as decision-making agents via reinforcement learning PDF

Cannot Refute

[53] Is a picture worth a thousand words? delving into spatial reasoning for vision language models PDF

Cannot Refute

[54] Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models PDF

Cannot Refute

[55] Vlr-driver: Large vision-language-reasoning models for embodied autonomous driving PDF

Cannot Refute

[56] Learn to explain: Multimodal reasoning via thought chains for science question answering PDF

Cannot Refute

[57] Empowering vision-language models for reasoning ability through large language models PDF

Cannot Refute

Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Visual Contrastive Self-Taught Reasoner (VC-STaR) framework

[68] Reflective instruction tuning: Mitigating hallucinations in large vision-language models PDF

[69] Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models PDF

[70] Hallucination Augmented Contrastive Learning for Multimodal Large Language Model PDF

[71] Retrieve-then-compare mitigates visual hallucination in multi-modal large language models PDF

[72] Contrastive Learning Reduces Hallucination in Conversations PDF

[73] Mitigating object hallucinations in large vision-language models through visual contrastive decoding PDF

[74] Delve into Visual Contrastive Decoding for Hallucination Mitigation of Large Vision-Language Models PDF

[75] ConVis: Contrastive Decoding with Hallucination Visualization for Mitigating Hallucinations in Multimodal Large Language Models PDF

[76] See different, think better: Visual variations mitigating hallucinations in lvlms PDF

[77] HSCL-RL: Mitigating Hallucinations in Multimodal Large Language Models PDF

Task-agnostic contrastive VQA pair curation framework

[58] Language-guided Bias Generation Contrastive Strategy for Visual Question Answering PDF

[59] Counterfactual samples synthesizing and training for robust visual question answering PDF

[60] Simple contrastive learning in a self-supervised manner for robust visual question answering PDF

[61] Surgical-VQLA++: Adversarial contrastive learning for calibrated robust visual question-localized answering in robotic surgery PDF

[62] HCCL: Hierarchical Counterfactual Contrastive Learning for Robust Visual Question Answering PDF

[63] Ciem: Contrastive instruction evaluation method for better instruction tuning PDF

[64] Design as Desired: Utilizing Visual Question Answering for Multimodal Pre-training PDF

[65] Overcoming language priors with self-contrastive learning for visual question answering PDF

[66] Contrastive video question answering via video graph transformer PDF

[67] A Logic-based Approach to Contrastive Explainability for Neurosymbolic Visual Question Answering PDF

VisCoR-55K visual reasoning dataset

[52] Llava-cot: Let vision language models reason step-by-step PDF

[4] Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models PDF

[6] Insight-v: Exploring long-chain visual reasoning with multimodal large language models PDF

[33] Vision-Language Models Can Self-Improve Reasoning via Reflection PDF

[51] Fine-tuning large vision-language models as decision-making agents via reinforcement learning PDF

[53] Is a picture worth a thousand words? delving into spatial reasoning for vision language models PDF

[54] Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models PDF

[55] Vlr-driver: Large vision-language-reasoning models for embodied autonomous driving PDF

[56] Learn to explain: Multimodal reasoning via thought chains for science question answering PDF

[57] Empowering vision-language models for reasoning ability through large language models PDF

Table of Contents