Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs

ICLR 2026 Conference SubmissionAnonymous Authors
ReasoningVision-Language ModelsContrasting
Abstract:

Reasoning has emerged as a key capability of large language models. In linguistic tasks, this capability can be enhanced by self-improving techniques that refine reasoning paths for subsequent fine-tuning. However, extending these language-based self-improving approaches to vision language models (VLMs) presents a unique challenge: visual hallucinations in reasoning paths cannot be effectively verified or rectified. Our solution starts with a key observation about visual contrast: when presented with a contrastive VQA pair, i.e., two visually similar images with synonymous questions, VLMs identify relevant visual cues more precisely compared with when given a single VQA sample. Motivated by this observation, we propose Visual Contrastive Self-Taught Reasoner (VC-STaR), a novel self-improving framework that leverages visual contrast to mitigate hallucinations in model-generated rationales. We collect a diverse suite of VQA datasets, curate contrastive pairs according to multi-modal similarity, and generate rationales using VC-STaR. Consequently, we obtain a new visual reasoning dataset, VisCoR-5555K, which is then used to boost the reasoning capability of various VLMs through supervised finetuning. Extensive experiments show that VC-STaR not only outperforms existing self-improving approaches but also surpasses models finetuned on the SoTA visual reasoning datasets, demonstrating that the inherent contrastive ability of VLMs can bootstrap their own visual reasoning. The code, dataset and trained models will be released upon acceptance.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces VC-STaR, a self-improving framework that uses visual contrast to reduce hallucinations in VLM-generated reasoning paths, and produces VisCoR-55K, a visual reasoning dataset for fine-tuning. According to the taxonomy, this work resides in the 'Contrastive Learning and Visual Contrast' leaf under 'Evaluation, Benchmarking, and Auxiliary Techniques'. Notably, this leaf contains only one paper (the original work itself), indicating a sparse research direction within the broader self-improving VLM landscape, which encompasses 50 papers across approximately 36 topics.

The taxonomy reveals that neighboring leaves focus on prompt optimization, reasoning evaluation, and architectural enhancements, while sibling branches address iterative refinement (e.g., self-correction mechanisms with 5 papers) and synthetic data generation (4 papers). The scope note for this leaf emphasizes 'leveraging visual contrast or contrastive pairs to enhance visual reasoning and mitigate hallucinations', explicitly excluding non-contrastive self-improvement methods. The taxonomy narrative mentions Contrast Lens as an exemplar, positioning contrastive approaches as diagnostic and interpretive tools complementary to end-to-end training loops found in denser branches like actor-critic frameworks or reward-based optimization.

Among 30 candidates examined, the VC-STaR framework and contrastive pair curation framework each showed no clear refutations across 10 candidates, suggesting these contributions occupy relatively unexplored methodological territory. However, the VisCoR-55K dataset contribution encountered 1 refutable candidate among 10 examined, indicating some overlap with existing visual reasoning datasets. The limited search scope (30 candidates total, not exhaustive) means these statistics reflect top-K semantic matches and citation expansion rather than comprehensive field coverage. The framework contributions appear more distinctive than the dataset contribution within this bounded search.

Given the sparse taxonomy leaf (1 paper) and the absence of sibling papers, the contrastive self-improvement angle appears underexplored relative to denser branches like self-correction (5 papers) or reward-based optimization (5 papers). The analysis covers top-30 semantic matches, so conclusions about novelty are provisional. The framework's emphasis on visual contrast as a hallucination mitigation strategy distinguishes it from iterative refinement or synthetic data generation approaches, though the dataset contribution shows more overlap with prior work within the examined scope.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: self-improving visual reasoning in vision language models. The field organizes around several complementary strategies for enhancing VLM performance without extensive human annotation. One major branch focuses on iterative refinement and feedback mechanisms, where models learn to critique and revise their own outputs through self-play or internal verification loops (e.g., Calibrated Self-Rewarding[1], Self-Improving Teacher[10]). A second branch emphasizes external supervision or synthetic data generation, leveraging large-scale automated pipelines to produce training signals that guide model improvement (e.g., Self-Bootstrapped Knowledge[28], Self-Training Comprehension[21]). Modality alignment and perception enhancement address the core challenge of bridging vision and language representations, often through contrastive objectives or architectural innovations (e.g., Modality Alignment Enhancement[11], Perceiver-vl[40]). Task-specific applications demonstrate these principles in domains such as navigation, GUI interaction, and video understanding (e.g., EvolveNav[5], Active Perception GUI[22]). Finally, evaluation and auxiliary techniques provide the infrastructure for measuring progress and supporting self-improvement, including benchmarking frameworks, contrastive learning methods, and tool-augmented reasoning (e.g., Measuring Chain-of-Thought[4], Viper[14]). Within the evaluation and auxiliary techniques branch, contrastive learning and visual contrast methods have emerged as a small but important cluster. These approaches use contrastive objectives to sharpen visual discrimination and improve reasoning by highlighting differences between similar inputs or outputs. Contrast Lens[0] exemplifies this direction by introducing mechanisms that explicitly leverage visual contrasts to enhance interpretability and reasoning quality. This work shares thematic connections with broader evaluation efforts like Measuring Chain-of-Thought[4], which probes reasoning transparency, and with perception-focused methods such as Cropper[41], which refines visual attention. Compared to iterative refinement approaches like Calibrated Self-Rewarding[1] or task-specific systems like Spatial Reasoning Drawing[3], Contrast Lens[0] emphasizes diagnostic and interpretive tools rather than end-to-end training loops, positioning itself as a complementary technique for understanding and improving how VLMs process visual information.

Claimed Contributions

Visual Contrastive Self-Taught Reasoner (VC-STaR) framework

The authors introduce VC-STaR, a self-improving framework that uses contrastive VQA pairs (two visually similar images with synonymous questions) to help VLMs identify relevant visual cues more precisely and rectify visual hallucinations in reasoning paths. The framework includes three steps: generating a coarse rationale, performing contrastive analysis, and rethinking to refine the rationale.

10 retrieved papers
Task-agnostic contrastive VQA pair curation framework

The authors develop a flexible pipeline for curating contrastive VQA pairs across diverse VQA tasks including reasoning, math, chart, and OCR. The pipeline involves data collection from 21 datasets, similarity-based pair hunting using image and question embeddings, and difficulty-based sampling to select median-difficulty samples suitable for reasoning enhancement.

10 retrieved papers
VisCoR-55K visual reasoning dataset

The authors create VisCoR-55K, a new dataset containing 55K high-quality visual reasoning samples with faithful rationales generated using VC-STaR. The dataset spans five categories (general VQA, reasoning, math, graph/chart, and OCR) and is used to improve VLM reasoning capabilities through supervised finetuning.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Visual Contrastive Self-Taught Reasoner (VC-STaR) framework

The authors introduce VC-STaR, a self-improving framework that uses contrastive VQA pairs (two visually similar images with synonymous questions) to help VLMs identify relevant visual cues more precisely and rectify visual hallucinations in reasoning paths. The framework includes three steps: generating a coarse rationale, performing contrastive analysis, and rethinking to refine the rationale.

Contribution

Task-agnostic contrastive VQA pair curation framework

The authors develop a flexible pipeline for curating contrastive VQA pairs across diverse VQA tasks including reasoning, math, chart, and OCR. The pipeline involves data collection from 21 datasets, similarity-based pair hunting using image and question embeddings, and difficulty-based sampling to select median-difficulty samples suitable for reasoning enhancement.

Contribution

VisCoR-55K visual reasoning dataset

The authors create VisCoR-55K, a new dataset containing 55K high-quality visual reasoning samples with faithful rationales generated using VC-STaR. The dataset spans five categories (general VQA, reasoning, math, graph/chart, and OCR) and is used to improve VLM reasoning capabilities through supervised finetuning.