We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Mathematical ReasoningMultimodal Large Language Models

Multimodal large language models (MLLMs) have demonstrated impressive capabilities across various tasks but still struggle with complex mathematical reasoning. Prior work has mainly focused on dataset construction and method optimization, while often overlooking two critical aspects: comprehensive knowledge-driven design and model-centric data space modeling. We introduce WE-MATH 2.0, a unified system that integrates a structured mathematical knowledge hierarchy, model-centric data space modeling, and a reinforcement learning (RL)-based training paradigm to enhance the mathematical reasoning abilities of MLLMs. Our contributions are fourfold: (1) MathBook Knowledge System: a five-level hierarchy covering 491 knowledge points and 1,819 fundamental principles; (2) MathBook-Standard and MathBook-Pro: datasets that ensure broad conceptual coverage and robust training through dual expansion, a three-dimensional difficulty space, and seven progressive variants per problem; (3) MathBook-RL: a two-stage RL framework including Cold-Start Fine-Tuning to align models with knowledge-oriented chain-of-thought reasoning, and Progressive Alignment RL leveraging average-reward learning with dynamic data scheduling for progressive difficulty alignment; (4) MathBookEval: a benchmark covering all 491 knowledge points with diverse reasoning step distributions. Experimental results show that MathBook-RL achieves competitive performance on four widely used benchmarks and demonstrates strong results on MathBookEval, suggesting promising generalization in mathematical reasoning.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces WE-MATH 2.0, a unified system combining a five-level knowledge hierarchy (491 knowledge points, 1,819 principles), dual datasets with progressive difficulty variants, and a two-stage reinforcement learning framework. It resides in the 'Knowledge-Driven and Hierarchical Evaluation Frameworks' leaf alongside three sibling papers. This leaf represents a focused research direction within the broader Benchmark Development branch, emphasizing structured knowledge organization over flat evaluation protocols. The taxonomy contains 50 papers across multiple branches, indicating a moderately populated field with distinct methodological clusters.

The taxonomy reveals that the paper's leaf sits within Benchmark Development and Evaluation, adjacent to leaves covering comprehensive multi-domain benchmarks (five papers), specialized domain benchmarks (six papers), and multi-visual context benchmarks (two papers). Neighboring branches include Model Training and Optimization (with supervised fine-tuning, reinforcement learning, and multimodal pre-training subcategories) and Reasoning Enhancement Techniques (chain-of-thought, visual reasoning, modular architectures). The scope_note for the paper's leaf explicitly includes 'structured knowledge hierarchies' and 'multi-level difficulty modeling,' distinguishing it from flat benchmark construction. This placement suggests the work bridges evaluation and training concerns through its knowledge-driven design.

Among 30 candidates examined, the MathBook Knowledge System contribution shows one refutable candidate out of ten examined, indicating some prior work on hierarchical knowledge structures exists within the limited search scope. The MathBook-Standard/Pro datasets and MathBook-RL framework each examined ten candidates with zero refutations, suggesting these contributions may occupy less crowded territory. The statistics reflect a targeted semantic search rather than exhaustive coverage, so the absence of refutations for two contributions does not guarantee absolute novelty but indicates limited overlap within the examined candidate pool. The knowledge system's single refutation suggests incremental refinement of existing hierarchical approaches.

Based on the limited search scope of 30 semantically similar papers, the work appears to integrate multiple established research threads—knowledge hierarchies, dataset construction, and reinforcement learning—into a unified system. The contribution-level statistics suggest the training framework and dual-dataset design may be less directly anticipated by prior work than the knowledge hierarchy component. However, the analysis cannot rule out relevant work outside the top-30 semantic matches or in adjacent research communities not captured by the search strategy.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: visual mathematical reasoning in multimodal large language models. The field has organized itself around five major branches that reflect distinct research priorities. Benchmark Development and Evaluation focuses on creating datasets and evaluation protocols to measure model capabilities, often emphasizing knowledge-driven or hierarchical frameworks that test diverse mathematical skills across visual modalities—works like MathVista[5], MathVerse[6], and Math-Vision Dataset[2] exemplify this direction. Model Training and Optimization addresses how to build and refine multimodal architectures, including data curation strategies and training recipes, as seen in Math-LLaVA[3] and MultiMath[4]. Reasoning Enhancement Techniques explores methods to improve step-by-step problem solving, such as chain-of-thought prompting, visual intermediate representations in Visual Sketchpad[7], and reinforcement learning approaches like Vision-r1[1]. Analysis and Characterization Studies investigate what models learn and where they fail, probing error patterns and representational properties. Finally, Application and Domain-Specific Systems target specialized contexts like geometry or real-world scenarios, tailoring models to particular mathematical subdomains. Across these branches, a central tension emerges between scaling general-purpose benchmarks versus designing fine-grained diagnostic tools that isolate specific reasoning challenges. Many studies pursue broad coverage to stress-test models on diverse problem types, while others adopt hierarchical or knowledge-driven evaluation frameworks to systematically probe conceptual understanding and visual dependency. The WeMath System[0] sits within the Benchmark Development branch, specifically under Knowledge-Driven and Hierarchical Evaluation Frameworks, alongside neighbors like the Visual Dependency Benchmark[24] and CMMath[33]. Where MathVista[5] and MathVerse[6] provide wide-ranging testbeds, WeMath System[0] emphasizes structured assessment that maps problems to underlying knowledge components, offering a more granular lens on model strengths and weaknesses. This positions it as part of a growing effort to move beyond aggregate accuracy scores toward interpretable, knowledge-grounded evaluation that can guide targeted model improvements.

Claimed Contributions

MathBook Knowledge System

Can Refute

10 retrieved papers

A structured five-level hierarchical framework that systematically organizes mathematical knowledge, covering 491 knowledge points and 1,819 fundamental principles. This system enables comprehensive and systematic mathematical knowledge supervision for training multimodal large language models.

10 retrieved papers

Can Refute

MathBook-Standard and MathBook-Pro datasets

10 retrieved papers

Two novel datasets: MathBook-Standard provides comprehensive step-wise annotations with dual expansions (multi-images per question and multi-questions per image) for conceptual flexibility, while MathBook-Pro introduces a three-dimensional difficulty modeling framework (step complexity, visual complexity, contextual complexity) that generates seven progressive difficulty variants per problem for structured learning.

10 retrieved papers

MathBook-RL training framework

10 retrieved papers

A two-stage reinforcement learning framework that first performs cold-start fine-tuning to establish knowledge-oriented chain-of-thought reasoning, then applies progressive alignment RL with average-reward learning and dynamic data scheduling strategies (Knowledge Increment Scheduling and Modality Increment Scheduling) to achieve progressive alignment across difficulty levels.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] Measuring multimodal mathematical reasoning with math-vision dataset PDF

Hongsheng Li, Zimu Lu, Junting Pan, Houxing Ren, Weikang Shi, Ke Wang, Ming-Jie Zhan, Aojun Zhou, Mingjie Zhan (2024)

[24] Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency PDF

Wang Zhikai, Sun, Jiashuo, Zhikai Wang, Zhang Wen-qi, Jiashuo Sun, Hu Zhiqiang, Wenqiao Zhang, LI Xin, Zhiqiang Hu, Wang Fan, Xin Li, Zhao, Deli, Fan Wang, Deli Zhao (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MathBook Knowledge System

[52] Mathbench: Evaluating the theory and application proficiency of llms with a hierarchical mathematics benchmark PDF

Can Refute

[51] A framework of mathematical thinking experience: Core components, hierarchical levels, and group differences in mathematical creativity: Li et al. PDF

Cannot Refute

[53] A multi-level approach to exploring the associations between reading, spelling, and math skills PDF

Cannot Refute

[54] Multi-objective math problem generation using large language model through an adaptive multi-level retrieval augmentation framework PDF

Cannot Refute

[55] The impact of Taiwan adaptive learning platform (TALP) on self-regulated learning and mathematics achievement PDF

Cannot Refute

[56] Teachers' understanding and use of mathematical structure PDF

Cannot Refute

[57] The mathematics teacher's specialised knowledge (MTSK) model PDF

Cannot Refute

[58] Structural knowledge: Techniques for representing, conveying, and acquiring structural knowledge PDF

Cannot Refute

[59] Hierarchical Organization in Concept Maps as a path to explain the Elaboration of Knowledge in the History of Science PDF

Cannot Refute

[60] Concept lattices and conceptual knowledge systems PDF

Cannot Refute

Contribution

MathBook-Standard and MathBook-Pro datasets

[61] Videomathqa: Benchmarking mathematical reasoning via multimodal understanding in videos PDF

Cannot Refute

[62] MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision PDF

Cannot Refute

[63] Rewarding graph reasoning process makes llms more generalized reasoners PDF

Cannot Refute

[64] Sciverse: Unveiling the knowledge comprehension and visual reasoning of lmms on multi-modal scientific problems PDF

Cannot Refute

[65] StepMathAgent: A Step-Wise Agent for Evaluating Mathematical Processes through Tree-of-Error PDF

Cannot Refute

[66] Reasoning with OmniThought: A Large CoT Dataset with Verbosity and Cognitive Difficulty Annotations PDF

Cannot Refute

[67] An integrated model of skill in solving elementary word problems PDF

Cannot Refute

[68] Can LLMs Math? -- Exploring the Pitfalls in Mathematical Reasoning PDF

Cannot Refute

[69] Conic10K: A Challenging Math Problem Understanding and Reasoning Dataset PDF

Cannot Refute

[70] Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations PDF

Cannot Refute

Contribution

MathBook-RL training framework

[71] Two-stage deep reinforcement learning method for agile optical satellite scheduling problem PDF

Cannot Refute

[72] Exploring multi-agent reinforcement learning for unrelated parallel machine scheduling PDF

Cannot Refute

[73] Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning PDF

Cannot Refute

[74] Efficient deep-reinforcement learning aware resource allocation in SDN-enabled fog paradigm PDF

Cannot Refute

[75] A two-stage framework and reinforcement learning-based optimization algorithms for complex scheduling problems PDF

Cannot Refute

[76] A reinforcement learning approach for flexible job shop scheduling problem with crane transportation and setup times PDF

Cannot Refute

[77] RL for Reasoning by Adaptively Revealing Rationales PDF

Cannot Refute

[78] A reinforcement learning-based approach for online optimal control of self-adaptive real-time systems PDF

Cannot Refute

[79] FaVChat: Hierarchical Prompt-Query Guided Facial Video Understanding with Data-Efficient GRPO PDF

Cannot Refute

[80] Cost-aware dynamic multi-workflow scheduling in cloud data center using evolutionary reinforcement learning PDF

Cannot Refute

We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] Measuring multimodal mathematical reasoning with math-vision dataset PDF

[24] Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency PDF

Contribution Analysis

MathBook Knowledge System

[52] Mathbench: Evaluating the theory and application proficiency of llms with a hierarchical mathematics benchmark PDF

[51] A framework of mathematical thinking experience: Core components, hierarchical levels, and group differences in mathematical creativity: Li et al. PDF

[53] A multi-level approach to exploring the associations between reading, spelling, and math skills PDF

[54] Multi-objective math problem generation using large language model through an adaptive multi-level retrieval augmentation framework PDF

[55] The impact of Taiwan adaptive learning platform (TALP) on self-regulated learning and mathematics achievement PDF

[56] Teachers' understanding and use of mathematical structure PDF

[57] The mathematics teacher's specialised knowledge (MTSK) model PDF

[58] Structural knowledge: Techniques for representing, conveying, and acquiring structural knowledge PDF

[59] Hierarchical Organization in Concept Maps as a path to explain the Elaboration of Knowledge in the History of Science PDF

[60] Concept lattices and conceptual knowledge systems PDF

MathBook-Standard and MathBook-Pro datasets

[61] Videomathqa: Benchmarking mathematical reasoning via multimodal understanding in videos PDF

[62] MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision PDF

[63] Rewarding graph reasoning process makes llms more generalized reasoners PDF

[64] Sciverse: Unveiling the knowledge comprehension and visual reasoning of lmms on multi-modal scientific problems PDF

[65] StepMathAgent: A Step-Wise Agent for Evaluating Mathematical Processes through Tree-of-Error PDF

[66] Reasoning with OmniThought: A Large CoT Dataset with Verbosity and Cognitive Difficulty Annotations PDF

[67] An integrated model of skill in solving elementary word problems PDF

[68] Can LLMs Math? -- Exploring the Pitfalls in Mathematical Reasoning PDF

[69] Conic10K: A Challenging Math Problem Understanding and Reasoning Dataset PDF

[70] Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations PDF

MathBook-RL training framework

[71] Two-stage deep reinforcement learning method for agile optical satellite scheduling problem PDF

[72] Exploring multi-agent reinforcement learning for unrelated parallel machine scheduling PDF

[73] Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning PDF

[74] Efficient deep-reinforcement learning aware resource allocation in SDN-enabled fog paradigm PDF

[75] A two-stage framework and reinforcement learning-based optimization algorithms for complex scheduling problems PDF

[76] A reinforcement learning approach for flexible job shop scheduling problem with crane transportation and setup times PDF

[77] RL for Reasoning by Adaptively Revealing Rationales PDF

[78] A reinforcement learning-based approach for online optimal control of self-adaptive real-time systems PDF

[79] FaVChat: Hierarchical Prompt-Query Guided Facial Video Understanding with Data-Efficient GRPO PDF

[80] Cost-aware dynamic multi-workflow scheduling in cloud data center using evolutionary reinforcement learning PDF

Table of Contents