VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Multi-modal Large Language ModelsBenchmarkVisual Reasoning

Visual reasoning is a core component of human intelligence and a critical capability for advanced multimodal models. Yet current reasoning evaluations of multimodal large language models (MLLMs) often rely on text descriptions and allow language-based reasoning shortcuts, failing to measure genuine vision-centric reasoning. To address this, we introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories (e.g., quantitative shifts, spatial relations, attribute comparisons). These various types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives. We evaluate leading MLLMs on this benchmark and analyze their results to identify common failure modes. Most models score below 30% accuracy—only slightly above the 25% random baseline and far below the 51.4% achieved by humans—revealing significant gaps in visual reasoning.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces VisuLogic, a benchmark of 1,000 human-verified problems designed to evaluate visual reasoning in multimodal large language models. It resides in the General Reasoning and Comprehension Benchmarks leaf, which contains five papers including the original work. This leaf sits within the broader Evaluation and Benchmarking branch, indicating a moderately populated research direction focused on standardized assessment of MLLM capabilities. The benchmark targets a specific gap: measuring genuine vision-centric reasoning without language-based shortcuts, positioning it alongside other evaluation frameworks that probe diverse reasoning skills.

The taxonomy reveals neighboring leaves focused on Visual Perception and Pattern Recognition Benchmarks and Domain-Specific Benchmarks, suggesting the field has organized evaluation efforts along task-type boundaries. Sibling papers in the same leaf include SEED-Bench and MDK12-Bench, which offer broad multimodal understanding and domain-specific mathematical assessments respectively. VisuLogic differentiates itself by emphasizing structured logical inference across six reasoning categories, whereas neighbors tend toward either comprehensive coverage or specialized domains. This positioning reflects an ongoing field tension between general-purpose and capability-specific evaluation design.

Among 30 candidates examined across three contributions, none were identified as clearly refuting the work. The VisuLogic benchmark contribution examined 10 candidates with zero refutable overlaps, as did the data curation pipeline and reasoning taxonomy contributions. This suggests that within the limited search scope, no prior work directly anticipates the specific combination of human-verified visual reasoning problems organized into six categories with explicit controls against language shortcuts. The absence of refutable candidates across all contributions indicates potential novelty, though the search scale means undiscovered prior work remains possible.

Based on the limited literature search of 30 semantically similar papers, the work appears to occupy a distinct position within the evaluation landscape. The taxonomy structure shows a moderately populated leaf with clear boundaries separating general reasoning benchmarks from perception-focused and domain-specific alternatives. However, the analysis cannot rule out relevant prior work outside the top-30 semantic matches or in adjacent research communities not captured by the taxonomy.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: visual reasoning in multimodal large language models. The field has evolved into a rich ecosystem organized around several major branches. Reasoning Paradigms and Methodologies explores how models perform step-by-step inference, often drawing on chain-of-thought techniques and visual prompting strategies (Visual Prompting Survey[4]). Cognitive Capabilities and Perception examines the perceptual foundations that enable models to understand spatial relations, temporal dynamics, and compositional structures (Visual Cognition[2]). Model Architectures and Training addresses the design choices and learning strategies that underpin these systems (Multimodal LLM Survey[1]). Evaluation and Benchmarking provides the testbeds and metrics needed to measure progress, while Application Domains and Specialized Tasks targets real-world scenarios such as robotics, mathematics, and embodied agents. Security and Adversarial Analysis investigates robustness, and Surveys and Overviews synthesize cross-cutting themes. Within Evaluation and Benchmarking, a dense cluster of works has emerged around General Reasoning and Comprehension Benchmarks, reflecting the community's need for standardized assessments that probe diverse reasoning skills. Some benchmarks emphasize broad multimodal understanding (SEED-Bench[37], Benchmark Survey[6]), while others target domain-specific challenges like mathematical problem-solving (MDK12-Bench[24]) or temporal reasoning (Visual Temporal Understanding[8]). VisuLogic[0] sits squarely in this evaluation-focused branch, contributing a benchmark designed to test logical reasoning over visual inputs. Compared to neighbors such as SEED-Bench[37], which offers a wide-ranging testbed, and MDK12-Bench[24], which zeroes in on educational content, VisuLogic[0] carves out a niche by emphasizing structured logical inference. This positioning highlights an ongoing tension in the field: whether to build comprehensive, general-purpose benchmarks or to design targeted evaluations that isolate specific reasoning capabilities, a question that continues to shape how researchers measure and improve multimodal models.

Claimed Contributions

VisuLogic benchmark for visual reasoning evaluation

10 retrieved papers

The authors introduce VisuLogic, a new benchmark consisting of 1,000 human-verified visual reasoning problems organized into six categories. This benchmark is designed to evaluate genuine vision-centric reasoning in multimodal large language models without allowing language-based reasoning shortcuts.

10 retrieved papers

Data curation pipeline with quality control

10 retrieved papers

The authors develop a three-stage automated data processing pipeline for collecting and structuring benchmark data, combined with a quality control procedure involving image verification, duplicate removal, and manual inspection to ensure dataset reliability.

10 retrieved papers

Taxonomy of visual reasoning categories

10 retrieved papers

The authors establish a taxonomy that classifies visual reasoning questions into six primary categories (Quantitative, Spatial, Positional, Attribute, Stylistic, and Other) based on expert annotation of required reasoning competencies, providing a structured framework for evaluating different aspects of visual reasoning.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[6] A survey on benchmarks of multimodal large language models PDF

Li Jian, Jian Li, Fei Hao, Weiheng Lu, Luo Meng, Dai, Ming, Xia Min, Gan, Zhenye, Qi Ding, Fu, Chaoyou, Tai, Ying, Yang, Wankou, Wang, Yabiao, Chengjie (2024)

[24] MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models PDF

Zhou Pengfei, Zhang, Fanrui, Peng Xiao-peng, Xu, Zhaopan, Ai Jiaxin, Qiu Yansheng, Li Chuan-hao, Li Zhen, Li Ming, Feng Yu-kang, Sun Jian-wen, Zhang Hao-quan, Li ZiZhen, Mao, Xiaofeng, Zhao, Wangbo, Wang Kai, Chang, Xiaojun, Shao, Wenqi, You, Yang, Zhang Kai-peng (2025) • arXiv.org

[37] SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension PDF

LI Bohao, Bohao Li, Wang Rui, Rui Wang, Wang Guang-zhi, Guangzhi Wang, Ge, Yuying, Yuying Ge, Yixiao, Yixiao Ge, Shan, Ying, Ying Shan (2023)

[44] Llamav-o1: Rethinking step-by-step visual reasoning in llms PDF

Ahsan Noor, Cholakkal, Hisham, Heakl, Ahmed, Khan, Salman, Lahoud, Jean, Laptev, Ivan, Li Yu-hao, Muhammad Anwer Rao, Shah, Mubarak, Thawakar, Omkar (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

VisuLogic benchmark for visual reasoning evaluation

[5] Insight-v: Exploring long-chain visual reasoning with multimodal large language models PDF

Cannot Refute

[43] Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs PDF

Cannot Refute

[51] Compbench: A comparative reasoning benchmark for multimodal llms PDF

Cannot Refute

[52] R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization PDF

Cannot Refute

[53] Mllm-compbench: A comparative reasoning benchmark for multimodal llms PDF

Cannot Refute

[54] An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models PDF

Cannot Refute

[55] Benchlmm: Benchmarking cross-style visual capability of large multimodal models PDF

Cannot Refute

[56] RBench-V: A primary assessment for visual reasoning models with multi-modal outputs PDF

Cannot Refute

[57] Mm-iq: Benchmarking human-like abstraction and reasoning in multimodal models PDF

Cannot Refute

[58] Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi PDF

Cannot Refute

Contribution

Data curation pipeline with quality control

[59] From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline PDF

Cannot Refute

[60] ImgEdit: A Unified Image Editing Dataset and Benchmark PDF

Cannot Refute

[61] An atlas of healthy and injured cell states and niches in the human kidney PDF

Cannot Refute

[62] APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets PDF

Cannot Refute

[63] The influence of preprocessing on text classification using a bag-of-words representation PDF

Cannot Refute

[64] Performance and Scalability of Data Cleaning and Preprocessing Tools: A Benchmark on Large Real-World Datasets PDF

Cannot Refute

[65] An Automated Data Processing Pipeline for Coral Reef Monitoring PDF

Cannot Refute

[66] DialogCC: An Automated Pipeline for Creating High-Quality Multi-Modal Dialogue Dataset PDF

Cannot Refute

[67] PreQual: An automated pipeline for integrated preprocessing and quality assurance of diffusion weighted MRI images PDF

Cannot Refute

[68] Benchmarking benchmark leakage in large language models PDF

Cannot Refute

Contribution

Taxonomy of visual reasoning categories

[69] ChartBench: A Benchmark for Complex Visual Reasoning in Charts PDF

Cannot Refute

[70] Understanding the computational demands underlying visual reasoning PDF

Cannot Refute

[71] Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning PDF

Cannot Refute

[72] GRAFT: GRaPH and Table Reasoning for Textual Alignment - A Benchmark for Structured Instruction Following and Visual Reasoning PDF

Cannot Refute

[73] Adacoder: Adaptive prompt compression for programmatic visual question answering PDF

Cannot Refute

[74] Prototype classification and logical classification: The two systems PDF

Cannot Refute

[75] TaiwanVQA: A Benchmark for Visual Question Answering for Taiwanese Daily Life PDF

Cannot Refute

[76] Automatic question generation: a review of methodologies, datasets, evaluation metrics, and applications PDF

Cannot Refute

[77] How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer Decoding PDF

Cannot Refute

[78] All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark PDF

Cannot Refute

VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[6] A survey on benchmarks of multimodal large language models PDF

[24] MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models PDF

[37] SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension PDF

[44] Llamav-o1: Rethinking step-by-step visual reasoning in llms PDF

Contribution Analysis

VisuLogic benchmark for visual reasoning evaluation

[5] Insight-v: Exploring long-chain visual reasoning with multimodal large language models PDF

[43] Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs PDF

[51] Compbench: A comparative reasoning benchmark for multimodal llms PDF

[52] R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization PDF

[53] Mllm-compbench: A comparative reasoning benchmark for multimodal llms PDF

[54] An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models PDF

[55] Benchlmm: Benchmarking cross-style visual capability of large multimodal models PDF

[56] RBench-V: A primary assessment for visual reasoning models with multi-modal outputs PDF

[57] Mm-iq: Benchmarking human-like abstraction and reasoning in multimodal models PDF

[58] Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi PDF

Data curation pipeline with quality control

[59] From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline PDF

[60] ImgEdit: A Unified Image Editing Dataset and Benchmark PDF

[61] An atlas of healthy and injured cell states and niches in the human kidney PDF

[62] APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets PDF

[63] The influence of preprocessing on text classification using a bag-of-words representation PDF

[64] Performance and Scalability of Data Cleaning and Preprocessing Tools: A Benchmark on Large Real-World Datasets PDF

[65] An Automated Data Processing Pipeline for Coral Reef Monitoring PDF

[66] DialogCC: An Automated Pipeline for Creating High-Quality Multi-Modal Dialogue Dataset PDF

[67] PreQual: An automated pipeline for integrated preprocessing and quality assurance of diffusion weighted MRI images PDF

[68] Benchmarking benchmark leakage in large language models PDF

Taxonomy of visual reasoning categories

[69] ChartBench: A Benchmark for Complex Visual Reasoning in Charts PDF

[70] Understanding the computational demands underlying visual reasoning PDF

[71] Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning PDF

[72] GRAFT: GRaPH and Table Reasoning for Textual Alignment - A Benchmark for Structured Instruction Following and Visual Reasoning PDF

[73] Adacoder: Adaptive prompt compression for programmatic visual question answering PDF

[74] Prototype classification and logical classification: The two systems PDF

[75] TaiwanVQA: A Benchmark for Visual Question Answering for Taiwanese Daily Life PDF

[76] Automatic question generation: a review of methodologies, datasets, evaluation metrics, and applications PDF

[77] How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer Decoding PDF

[78] All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark PDF

Table of Contents