VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models

ICLR 2026 Conference SubmissionAnonymous Authors
Multi-modal Large Language ModelsBenchmarkVisual Reasoning
Abstract:

Visual reasoning is a core component of human intelligence and a critical capability for advanced multimodal models. Yet current reasoning evaluations of multimodal large language models (MLLMs) often rely on text descriptions and allow language-based reasoning shortcuts, failing to measure genuine vision-centric reasoning. To address this, we introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories (e.g., quantitative shifts, spatial relations, attribute comparisons). These various types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives. We evaluate leading MLLMs on this benchmark and analyze their results to identify common failure modes. Most models score below 30% accuracy—only slightly above the 25% random baseline and far below the 51.4% achieved by humans—revealing significant gaps in visual reasoning.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces VisuLogic, a benchmark of 1,000 human-verified problems designed to evaluate visual reasoning in multimodal large language models. It resides in the General Reasoning and Comprehension Benchmarks leaf, which contains five papers including the original work. This leaf sits within the broader Evaluation and Benchmarking branch, indicating a moderately populated research direction focused on standardized assessment of MLLM capabilities. The benchmark targets a specific gap: measuring genuine vision-centric reasoning without language-based shortcuts, positioning it alongside other evaluation frameworks that probe diverse reasoning skills.

The taxonomy reveals neighboring leaves focused on Visual Perception and Pattern Recognition Benchmarks and Domain-Specific Benchmarks, suggesting the field has organized evaluation efforts along task-type boundaries. Sibling papers in the same leaf include SEED-Bench and MDK12-Bench, which offer broad multimodal understanding and domain-specific mathematical assessments respectively. VisuLogic differentiates itself by emphasizing structured logical inference across six reasoning categories, whereas neighbors tend toward either comprehensive coverage or specialized domains. This positioning reflects an ongoing field tension between general-purpose and capability-specific evaluation design.

Among 30 candidates examined across three contributions, none were identified as clearly refuting the work. The VisuLogic benchmark contribution examined 10 candidates with zero refutable overlaps, as did the data curation pipeline and reasoning taxonomy contributions. This suggests that within the limited search scope, no prior work directly anticipates the specific combination of human-verified visual reasoning problems organized into six categories with explicit controls against language shortcuts. The absence of refutable candidates across all contributions indicates potential novelty, though the search scale means undiscovered prior work remains possible.

Based on the limited literature search of 30 semantically similar papers, the work appears to occupy a distinct position within the evaluation landscape. The taxonomy structure shows a moderately populated leaf with clear boundaries separating general reasoning benchmarks from perception-focused and domain-specific alternatives. However, the analysis cannot rule out relevant prior work outside the top-30 semantic matches or in adjacent research communities not captured by the taxonomy.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: visual reasoning in multimodal large language models. The field has evolved into a rich ecosystem organized around several major branches. Reasoning Paradigms and Methodologies explores how models perform step-by-step inference, often drawing on chain-of-thought techniques and visual prompting strategies (Visual Prompting Survey[4]). Cognitive Capabilities and Perception examines the perceptual foundations that enable models to understand spatial relations, temporal dynamics, and compositional structures (Visual Cognition[2]). Model Architectures and Training addresses the design choices and learning strategies that underpin these systems (Multimodal LLM Survey[1]). Evaluation and Benchmarking provides the testbeds and metrics needed to measure progress, while Application Domains and Specialized Tasks targets real-world scenarios such as robotics, mathematics, and embodied agents. Security and Adversarial Analysis investigates robustness, and Surveys and Overviews synthesize cross-cutting themes. Within Evaluation and Benchmarking, a dense cluster of works has emerged around General Reasoning and Comprehension Benchmarks, reflecting the community's need for standardized assessments that probe diverse reasoning skills. Some benchmarks emphasize broad multimodal understanding (SEED-Bench[37], Benchmark Survey[6]), while others target domain-specific challenges like mathematical problem-solving (MDK12-Bench[24]) or temporal reasoning (Visual Temporal Understanding[8]). VisuLogic[0] sits squarely in this evaluation-focused branch, contributing a benchmark designed to test logical reasoning over visual inputs. Compared to neighbors such as SEED-Bench[37], which offers a wide-ranging testbed, and MDK12-Bench[24], which zeroes in on educational content, VisuLogic[0] carves out a niche by emphasizing structured logical inference. This positioning highlights an ongoing tension in the field: whether to build comprehensive, general-purpose benchmarks or to design targeted evaluations that isolate specific reasoning capabilities, a question that continues to shape how researchers measure and improve multimodal models.

Claimed Contributions

VisuLogic benchmark for visual reasoning evaluation

The authors introduce VisuLogic, a new benchmark consisting of 1,000 human-verified visual reasoning problems organized into six categories. This benchmark is designed to evaluate genuine vision-centric reasoning in multimodal large language models without allowing language-based reasoning shortcuts.

10 retrieved papers
Data curation pipeline with quality control

The authors develop a three-stage automated data processing pipeline for collecting and structuring benchmark data, combined with a quality control procedure involving image verification, duplicate removal, and manual inspection to ensure dataset reliability.

10 retrieved papers
Taxonomy of visual reasoning categories

The authors establish a taxonomy that classifies visual reasoning questions into six primary categories (Quantitative, Spatial, Positional, Attribute, Stylistic, and Other) based on expert annotation of required reasoning competencies, providing a structured framework for evaluating different aspects of visual reasoning.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

VisuLogic benchmark for visual reasoning evaluation

The authors introduce VisuLogic, a new benchmark consisting of 1,000 human-verified visual reasoning problems organized into six categories. This benchmark is designed to evaluate genuine vision-centric reasoning in multimodal large language models without allowing language-based reasoning shortcuts.

Contribution

Data curation pipeline with quality control

The authors develop a three-stage automated data processing pipeline for collecting and structuring benchmark data, combined with a quality control procedure involving image verification, duplicate removal, and manual inspection to ensure dataset reliability.

Contribution

Taxonomy of visual reasoning categories

The authors establish a taxonomy that classifies visual reasoning questions into six primary categories (Quantitative, Spatial, Positional, Attribute, Stylistic, and Other) based on expert annotation of required reasoning competencies, providing a structured framework for evaluating different aspects of visual reasoning.