VisioMath: Benchmarking Figure-based Mathematical Reasoning in LMMs

ICLR 2026 Conference SubmissionAnonymous Authors
Figure-based Mathematical ReasoningLarge Multimodal ModelsMathematical Benchmark
Abstract:

Large Multimodal Models have achieved remarkable progress in integrating vision and language, enabling strong performance across perception, reasoning, and domain-specific tasks. However, their capacity to reason over multiple, visually similar inputs remains insufficiently explored. Such fine-grained comparative reasoning is central to real-world tasks, especially in mathematics and education, where learners must often distinguish between nearly identical diagrams to identify correct solutions. To address this gap, we present VisioMath, a curated benchmark of 1,800 high-quality K–12 mathematics problems in which all candidate answers are diagrams with subtle visual similarities. A comprehensive evaluation of state-of-the-art LMMs, covering both leading closed-source systems and widely adopted open-source models, reveals a consistent decline in accuracy as inter-image similarity increases. Analysis indicates that the dominant failure mode stems from image–text misalignment: rather than grounding reasoning in textual cues, models often resort to shallow positional heuristics, resulting in systematic errors. We further explore three alignment-oriented strategies, spanning training-free approaches and finetuning, and achieve substantial accuracy gains. We hope that VisioMath will serve as a rigorous benchmark and catalyst for developing LMMs toward deeper diagram understanding, precise comparative reasoning, and grounded multi-image–text integration.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces VisioMath, a benchmark of 1,800 K–12 mathematics problems where all answer choices are visually similar diagrams, targeting fine-grained comparative reasoning. It resides in the Visual Perception and Dependency Evaluation leaf, which contains five papers total. This leaf focuses on isolating visual comprehension from end-to-end reasoning, making it a moderately populated but highly specialized research direction. The taxonomy shows this is one of five benchmark-focused leaves, indicating active but not overcrowded development in diagnostic evaluation tools for multimodal mathematical reasoning.

The Visual Perception and Dependency Evaluation leaf sits within the broader Benchmark Development and Evaluation branch, which also includes General Mathematical Reasoning Benchmarks (five papers) and Domain-Specific Benchmarks (four papers). Neighboring leaves emphasize breadth or domain specialization, whereas VisioMath's leaf explicitly targets visual dependency diagnosis. The scope note clarifies this leaf excludes end-to-end reasoning benchmarks, positioning VisioMath alongside works that probe whether models genuinely ground reasoning in visual inputs. This structural placement suggests the paper addresses a recognized gap in isolating perceptual failures from broader reasoning errors.

Among 30 candidates examined, the benchmark contribution shows one refutable candidate out of 10 examined, while the evaluation and alignment strategy contributions found no clear refutations across 10 candidates each. The single refutable candidate for the benchmark suggests some prior work in visually similar diagram evaluation exists, though the limited search scope (top-30 semantic matches) means this does not constitute exhaustive coverage. The evaluation and strategy contributions appear more novel within the examined set, with no candidates providing overlapping prior work among the 10 reviewed for each.

Based on the limited search scope, the work appears to occupy a recognized but not densely populated niche. The taxonomy structure and sibling papers indicate active interest in visual dependency diagnosis, yet the contribution-level statistics suggest the specific focus on visually similar diagrams and alignment-oriented interventions may offer incremental advances over existing diagnostic benchmarks. The analysis covers top-30 semantic matches and does not claim exhaustive field coverage.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: figure-based mathematical reasoning with visually similar diagrams. This field addresses the challenge of enabling models to solve mathematical problems where visual diagrams play a critical role, particularly when diagrams appear superficially similar but encode different mathematical relationships. The taxonomy reveals six main branches that collectively map the landscape: Benchmark Development and Evaluation focuses on creating datasets and metrics to assess visual-mathematical capabilities, with works like MATH-Vision[1] and Mathverse[3] establishing standardized testbeds; Model Training and Alignment explores techniques for improving multimodal model performance through specialized training regimes; Reasoning Frameworks and Methodologies investigates structured approaches such as visual programming and neuro-symbolic methods to bridge perception and symbolic reasoning; Cognitive and Educational Perspectives examines how humans process mathematical diagrams, drawing on studies like Visual Heuristics Primary[2]; Theoretical Foundations provides formal grounding for diagrammatic reasoning; and Failure Analysis identifies systematic weaknesses in current systems, as highlighted by Math Blind[49]. Recent activity concentrates on benchmark construction and evaluation methods that probe visual dependency—whether models genuinely rely on diagram content or exploit spurious textual cues. VisioMath[0] sits squarely within the Visual Perception and Dependency Evaluation cluster, emphasizing the challenge of visually similar diagrams that require fine-grained perceptual discrimination. This contrasts with broader geometric benchmarks like GeoPQA[20], which covers diverse problem types, and complements diagnostic tools such as Visual Dependency Benchmark[14] and VisAidMath[27], which systematically ablate visual information to expose model reliance patterns. A key tension across these works involves balancing dataset scale with controlled diagnostic power: while large-scale benchmarks like We-Math[23] offer breadth, targeted evaluations reveal that many models still struggle with genuine visual grounding, motivating ongoing efforts to design benchmarks that isolate perceptual reasoning from pattern matching.

Claimed Contributions

VisioMath benchmark for figure-based mathematical reasoning

The authors introduce VisioMath, a novel benchmark dataset containing 1,800 K–12 mathematics problems where all answer options are presented as visually similar diagrams. This benchmark is designed to evaluate LMMs' capacity for fine-grained comparative reasoning and diagram understanding in educational contexts.

10 retrieved papers
Can Refute
Comprehensive evaluation revealing image–text misalignment failures

The authors conduct a systematic evaluation of leading closed-source and open-source LMMs on VisioMath, demonstrating that models exhibit consistent accuracy decline as inter-image similarity increases. Their analysis identifies image–text misalignment as the dominant failure mode, where models resort to shallow positional heuristics rather than grounding reasoning in textual cues.

10 retrieved papers
Alignment-oriented strategies for improving multi-image reasoning

The authors explore three complementary strategies to mitigate image–text misalignment: consolidating multiple images into a single layout, establishing explicit visual–textual anchors, and fine-tuning with an alignment-oriented multi-image chain-of-thought dataset. These methods achieve substantial accuracy gains, with limited CoT fine-tuning yielding a +12.6% improvement.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

VisioMath benchmark for figure-based mathematical reasoning

The authors introduce VisioMath, a novel benchmark dataset containing 1,800 K–12 mathematics problems where all answer options are presented as visually similar diagrams. This benchmark is designed to evaluate LMMs' capacity for fine-grained comparative reasoning and diagram understanding in educational contexts.

Contribution

Comprehensive evaluation revealing image–text misalignment failures

The authors conduct a systematic evaluation of leading closed-source and open-source LMMs on VisioMath, demonstrating that models exhibit consistent accuracy decline as inter-image similarity increases. Their analysis identifies image–text misalignment as the dominant failure mode, where models resort to shallow positional heuristics rather than grounding reasoning in textual cues.

Contribution

Alignment-oriented strategies for improving multi-image reasoning

The authors explore three complementary strategies to mitigate image–text misalignment: consolidating multiple images into a single layout, establishing explicit visual–textual anchors, and fine-tuning with an alignment-oriented multi-image chain-of-thought dataset. These methods achieve substantial accuracy gains, with limited CoT fine-tuning yielding a +12.6% improvement.

VisioMath: Benchmarking Figure-based Mathematical Reasoning in LMMs | Novelty Validation