VisioMath: Benchmarking Figure-based Mathematical Reasoning in LMMs

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

Figure-based Mathematical ReasoningLarge Multimodal ModelsMathematical Benchmark

Large Multimodal Models have achieved remarkable progress in integrating vision and language, enabling strong performance across perception, reasoning, and domain-specific tasks. However, their capacity to reason over multiple, visually similar inputs remains insufficiently explored. Such fine-grained comparative reasoning is central to real-world tasks, especially in mathematics and education, where learners must often distinguish between nearly identical diagrams to identify correct solutions. To address this gap, we present VisioMath, a curated benchmark of 1,800 high-quality K–12 mathematics problems in which all candidate answers are diagrams with subtle visual similarities. A comprehensive evaluation of state-of-the-art LMMs, covering both leading closed-source systems and widely adopted open-source models, reveals a consistent decline in accuracy as inter-image similarity increases. Analysis indicates that the dominant failure mode stems from image–text misalignment: rather than grounding reasoning in textual cues, models often resort to shallow positional heuristics, resulting in systematic errors. We further explore three alignment-oriented strategies, spanning training-free approaches and finetuning, and achieve substantial accuracy gains. We hope that VisioMath will serve as a rigorous benchmark and catalyst for developing LMMs toward deeper diagram understanding, precise comparative reasoning, and grounded multi-image–text integration.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces VisioMath, a benchmark of 1,800 K–12 mathematics problems where all answer choices are visually similar diagrams, targeting fine-grained comparative reasoning. It resides in the Visual Perception and Dependency Evaluation leaf, which contains five papers total. This leaf focuses on isolating visual comprehension from end-to-end reasoning, making it a moderately populated but highly specialized research direction. The taxonomy shows this is one of five benchmark-focused leaves, indicating active but not overcrowded development in diagnostic evaluation tools for multimodal mathematical reasoning.

The Visual Perception and Dependency Evaluation leaf sits within the broader Benchmark Development and Evaluation branch, which also includes General Mathematical Reasoning Benchmarks (five papers) and Domain-Specific Benchmarks (four papers). Neighboring leaves emphasize breadth or domain specialization, whereas VisioMath's leaf explicitly targets visual dependency diagnosis. The scope note clarifies this leaf excludes end-to-end reasoning benchmarks, positioning VisioMath alongside works that probe whether models genuinely ground reasoning in visual inputs. This structural placement suggests the paper addresses a recognized gap in isolating perceptual failures from broader reasoning errors.

Among 30 candidates examined, the benchmark contribution shows one refutable candidate out of 10 examined, while the evaluation and alignment strategy contributions found no clear refutations across 10 candidates each. The single refutable candidate for the benchmark suggests some prior work in visually similar diagram evaluation exists, though the limited search scope (top-30 semantic matches) means this does not constitute exhaustive coverage. The evaluation and strategy contributions appear more novel within the examined set, with no candidates providing overlapping prior work among the 10 reviewed for each.

Based on the limited search scope, the work appears to occupy a recognized but not densely populated niche. The taxonomy structure and sibling papers indicate active interest in visual dependency diagnosis, yet the contribution-level statistics suggest the specific focus on visually similar diagrams and alignment-oriented interventions may offer incremental advances over existing diagnostic benchmarks. The analysis covers top-30 semantic matches and does not claim exhaustive field coverage.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: figure-based mathematical reasoning with visually similar diagrams. This field addresses the challenge of enabling models to solve mathematical problems where visual diagrams play a critical role, particularly when diagrams appear superficially similar but encode different mathematical relationships. The taxonomy reveals six main branches that collectively map the landscape: Benchmark Development and Evaluation focuses on creating datasets and metrics to assess visual-mathematical capabilities, with works like MATH-Vision[1] and Mathverse[3] establishing standardized testbeds; Model Training and Alignment explores techniques for improving multimodal model performance through specialized training regimes; Reasoning Frameworks and Methodologies investigates structured approaches such as visual programming and neuro-symbolic methods to bridge perception and symbolic reasoning; Cognitive and Educational Perspectives examines how humans process mathematical diagrams, drawing on studies like Visual Heuristics Primary[2]; Theoretical Foundations provides formal grounding for diagrammatic reasoning; and Failure Analysis identifies systematic weaknesses in current systems, as highlighted by Math Blind[49]. Recent activity concentrates on benchmark construction and evaluation methods that probe visual dependency—whether models genuinely rely on diagram content or exploit spurious textual cues. VisioMath[0] sits squarely within the Visual Perception and Dependency Evaluation cluster, emphasizing the challenge of visually similar diagrams that require fine-grained perceptual discrimination. This contrasts with broader geometric benchmarks like GeoPQA[20], which covers diverse problem types, and complements diagnostic tools such as Visual Dependency Benchmark[14] and VisAidMath[27], which systematically ablate visual information to expose model reliance patterns. A key tension across these works involves balancing dataset scale with controlled diagnostic power: while large-scale benchmarks like We-Math[23] offer breadth, targeted evaluations reveal that many models still struggle with genuine visual grounding, motivating ongoing efforts to design benchmarks that isolate perceptual reasoning from pattern matching.

Claimed Contributions

VisioMath benchmark for figure-based mathematical reasoning

Can Refute

10 retrieved papers

The authors introduce VisioMath, a novel benchmark dataset containing 1,800 K–12 mathematics problems where all answer options are presented as visually similar diagrams. This benchmark is designed to evaluate LMMs' capacity for fine-grained comparative reasoning and diagram understanding in educational contexts.

10 retrieved papers

Can Refute

Comprehensive evaluation revealing image–text misalignment failures

10 retrieved papers

The authors conduct a systematic evaluation of leading closed-source and open-source LMMs on VisioMath, demonstrating that models exhibit consistent accuracy decline as inter-image similarity increases. Their analysis identifies image–text misalignment as the dominant failure mode, where models resort to shallow positional heuristics rather than grounding reasoning in textual cues.

10 retrieved papers

Alignment-oriented strategies for improving multi-image reasoning

10 retrieved papers

The authors explore three complementary strategies to mitigate image–text misalignment: consolidating multiple images into a single layout, establishing explicit visual–textual anchors, and fine-tuning with an alignment-oriented multi-image chain-of-thought dataset. These methods achieve substantial accuracy gains, with limited CoT fine-tuning yielding a +12.6% improvement.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[14] Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency PDF

Wang Zhikai, Sun, Jiashuo, Zhang Wen-qi, Hu Zhiqiang, LI Xin, Wang Fan, Zhao, Deli (2025) • arXiv.org

[20] GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning PDF

Guizhen Chen, Wei-Wen Xu, Hao Zhang, Hou Pong Chan, Deli Zhao, Anh Tuan Luu, Yu Rong (2025) • Conference on Empirical Methods in Natural Language Processing

[21] MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems PDF

Yuan, Hangjie, Xu, Yunqiu, Feng Tao, Cen Jun, Liu Peng-wei, Huang Ze-ying, Yang Yi (2025)

[27] Visaidmath: Benchmarking visual-aided mathematical reasoning PDF

MA Jingkun, Zhan, Runzhe, Jingkun Ma, Wong, Derek F., Runzhe Zhan, Li Yang, Derek F. Wong, Sun Di, Yang Li, Chan, Hou Pong, Di Sun, Chao, Lidia S., Hou Pong Chan, Lidia S. Chao (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

VisioMath benchmark for figure-based mathematical reasoning

[11] Mv-math: Evaluating multimodal math reasoning in multi-visual contexts PDF

Can Refute

[1] Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset PDF

Cannot Refute

[3] Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? PDF

Cannot Refute

[10] MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models PDF

Cannot Refute

[14] Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency PDF

Cannot Refute

[36] Mathematikz: A dataset and benchmark for mathematical diagram generation PDF

Cannot Refute

[61] Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models PDF

Cannot Refute

[62] Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts PDF

Cannot Refute

[63] ChartBench: A Benchmark for Complex Visual Reasoning in Charts PDF

Cannot Refute

[64] ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning PDF

Cannot Refute

Contribution

Comprehensive evaluation revealing image–text misalignment failures

[65] MultiSkill: Evaluating large multimodal models for fine-grained alignment skills PDF

Cannot Refute

[66] Examining gender and racial bias in large vision-language models using a novel dataset of parallel images PDF

Cannot Refute

[67] HueManity: Probing Fine-Grained Visual Perception in MLLMs PDF

Cannot Refute

[68] Benchmarking large vision-language models on fine-grained image tasks: A comprehensive evaluation PDF

Cannot Refute

[69] Delving into Multimodal Prompting for Fine-grained Visual Classification PDF

Cannot Refute

[70] Visual Entailment: A Novel Task for Fine-Grained Image Understanding PDF

Cannot Refute

[71] ChartLens: Fine-grained Visual Attribution in Charts PDF

Cannot Refute

[72] Fine-Grained Visual Prompting PDF

Cannot Refute

[73] Multi-Modal Instruction Tuned LLMs with Fine-Grained Visual Perception PDF

Cannot Refute

[74] Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps PDF

Cannot Refute

Contribution

Alignment-oriented strategies for improving multi-image reasoning

[51] Cot-vla: Visual chain-of-thought reasoning for vision-language-action models PDF

Cannot Refute

[52] Measuring and improving chain-of-thought reasoning in vision-language models PDF

Cannot Refute

[53] Enhancing advanced visual reasoning ability of large language models PDF

Cannot Refute

[54] Anomalygpt: Detecting industrial anomalies using large vision-language models PDF

Cannot Refute

[55] MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens PDF

Cannot Refute

[56] Thinkact: Vision-language-action reasoning via reinforced visual latent planning PDF

Cannot Refute

[57] RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics PDF

Cannot Refute

[58] GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks PDF

Cannot Refute

[59] Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks PDF

Cannot Refute

[60] Unibench: Visual reasoning requires rethinking vision-language beyond scaling PDF

Cannot Refute

VisioMath: Benchmarking Figure-based Mathematical Reasoning in LMMs

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[14] Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency PDF

[20] GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning PDF

[21] MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems PDF

[27] Visaidmath: Benchmarking visual-aided mathematical reasoning PDF

Contribution Analysis

VisioMath benchmark for figure-based mathematical reasoning

[11] Mv-math: Evaluating multimodal math reasoning in multi-visual contexts PDF

[1] Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset PDF

[3] Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? PDF

[10] MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models PDF

[14] Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency PDF

[36] Mathematikz: A dataset and benchmark for mathematical diagram generation PDF

[61] Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models PDF

[62] Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts PDF

[63] ChartBench: A Benchmark for Complex Visual Reasoning in Charts PDF

[64] ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning PDF

Comprehensive evaluation revealing image–text misalignment failures

[65] MultiSkill: Evaluating large multimodal models for fine-grained alignment skills PDF

[66] Examining gender and racial bias in large vision-language models using a novel dataset of parallel images PDF

[67] HueManity: Probing Fine-Grained Visual Perception in MLLMs PDF

[68] Benchmarking large vision-language models on fine-grained image tasks: A comprehensive evaluation PDF

[69] Delving into Multimodal Prompting for Fine-grained Visual Classification PDF

[70] Visual Entailment: A Novel Task for Fine-Grained Image Understanding PDF

[71] ChartLens: Fine-grained Visual Attribution in Charts PDF

[72] Fine-Grained Visual Prompting PDF

[73] Multi-Modal Instruction Tuned LLMs with Fine-Grained Visual Perception PDF

[74] Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps PDF

Alignment-oriented strategies for improving multi-image reasoning

[51] Cot-vla: Visual chain-of-thought reasoning for vision-language-action models PDF

[52] Measuring and improving chain-of-thought reasoning in vision-language models PDF

[53] Enhancing advanced visual reasoning ability of large language models PDF

[54] Anomalygpt: Detecting industrial anomalies using large vision-language models PDF

[55] MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens PDF

[56] Thinkact: Vision-language-action reasoning via reinforced visual latent planning PDF

[57] RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics PDF

[58] GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks PDF

[59] Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks PDF

[60] Unibench: Visual reasoning requires rethinking vision-language beyond scaling PDF

Table of Contents