We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors
Mathematical ReasoningMultimodal Large Language Models
Abstract:

Multimodal large language models (MLLMs) have demonstrated impressive capabilities across various tasks but still struggle with complex mathematical reasoning. Prior work has mainly focused on dataset construction and method optimization, while often overlooking two critical aspects: comprehensive knowledge-driven design and model-centric data space modeling. We introduce WE-MATH 2.0, a unified system that integrates a structured mathematical knowledge hierarchy, model-centric data space modeling, and a reinforcement learning (RL)-based training paradigm to enhance the mathematical reasoning abilities of MLLMs. Our contributions are fourfold: (1) MathBook Knowledge System: a five-level hierarchy covering 491 knowledge points and 1,819 fundamental principles; (2) MathBook-Standard and MathBook-Pro: datasets that ensure broad conceptual coverage and robust training through dual expansion, a three-dimensional difficulty space, and seven progressive variants per problem; (3) MathBook-RL: a two-stage RL framework including Cold-Start Fine-Tuning to align models with knowledge-oriented chain-of-thought reasoning, and Progressive Alignment RL leveraging average-reward learning with dynamic data scheduling for progressive difficulty alignment; (4) MathBookEval: a benchmark covering all 491 knowledge points with diverse reasoning step distributions. Experimental results show that MathBook-RL achieves competitive performance on four widely used benchmarks and demonstrates strong results on MathBookEval, suggesting promising generalization in mathematical reasoning.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces WE-MATH 2.0, a unified system combining a five-level knowledge hierarchy (491 knowledge points, 1,819 principles), dual datasets with progressive difficulty variants, and a two-stage reinforcement learning framework. It resides in the 'Knowledge-Driven and Hierarchical Evaluation Frameworks' leaf alongside three sibling papers. This leaf represents a focused research direction within the broader Benchmark Development branch, emphasizing structured knowledge organization over flat evaluation protocols. The taxonomy contains 50 papers across multiple branches, indicating a moderately populated field with distinct methodological clusters.

The taxonomy reveals that the paper's leaf sits within Benchmark Development and Evaluation, adjacent to leaves covering comprehensive multi-domain benchmarks (five papers), specialized domain benchmarks (six papers), and multi-visual context benchmarks (two papers). Neighboring branches include Model Training and Optimization (with supervised fine-tuning, reinforcement learning, and multimodal pre-training subcategories) and Reasoning Enhancement Techniques (chain-of-thought, visual reasoning, modular architectures). The scope_note for the paper's leaf explicitly includes 'structured knowledge hierarchies' and 'multi-level difficulty modeling,' distinguishing it from flat benchmark construction. This placement suggests the work bridges evaluation and training concerns through its knowledge-driven design.

Among 30 candidates examined, the MathBook Knowledge System contribution shows one refutable candidate out of ten examined, indicating some prior work on hierarchical knowledge structures exists within the limited search scope. The MathBook-Standard/Pro datasets and MathBook-RL framework each examined ten candidates with zero refutations, suggesting these contributions may occupy less crowded territory. The statistics reflect a targeted semantic search rather than exhaustive coverage, so the absence of refutations for two contributions does not guarantee absolute novelty but indicates limited overlap within the examined candidate pool. The knowledge system's single refutation suggests incremental refinement of existing hierarchical approaches.

Based on the limited search scope of 30 semantically similar papers, the work appears to integrate multiple established research threads—knowledge hierarchies, dataset construction, and reinforcement learning—into a unified system. The contribution-level statistics suggest the training framework and dual-dataset design may be less directly anticipated by prior work than the knowledge hierarchy component. However, the analysis cannot rule out relevant work outside the top-30 semantic matches or in adjacent research communities not captured by the search strategy.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: visual mathematical reasoning in multimodal large language models. The field has organized itself around five major branches that reflect distinct research priorities. Benchmark Development and Evaluation focuses on creating datasets and evaluation protocols to measure model capabilities, often emphasizing knowledge-driven or hierarchical frameworks that test diverse mathematical skills across visual modalities—works like MathVista[5], MathVerse[6], and Math-Vision Dataset[2] exemplify this direction. Model Training and Optimization addresses how to build and refine multimodal architectures, including data curation strategies and training recipes, as seen in Math-LLaVA[3] and MultiMath[4]. Reasoning Enhancement Techniques explores methods to improve step-by-step problem solving, such as chain-of-thought prompting, visual intermediate representations in Visual Sketchpad[7], and reinforcement learning approaches like Vision-r1[1]. Analysis and Characterization Studies investigate what models learn and where they fail, probing error patterns and representational properties. Finally, Application and Domain-Specific Systems target specialized contexts like geometry or real-world scenarios, tailoring models to particular mathematical subdomains. Across these branches, a central tension emerges between scaling general-purpose benchmarks versus designing fine-grained diagnostic tools that isolate specific reasoning challenges. Many studies pursue broad coverage to stress-test models on diverse problem types, while others adopt hierarchical or knowledge-driven evaluation frameworks to systematically probe conceptual understanding and visual dependency. The WeMath System[0] sits within the Benchmark Development branch, specifically under Knowledge-Driven and Hierarchical Evaluation Frameworks, alongside neighbors like the Visual Dependency Benchmark[24] and CMMath[33]. Where MathVista[5] and MathVerse[6] provide wide-ranging testbeds, WeMath System[0] emphasizes structured assessment that maps problems to underlying knowledge components, offering a more granular lens on model strengths and weaknesses. This positions it as part of a growing effort to move beyond aggregate accuracy scores toward interpretable, knowledge-grounded evaluation that can guide targeted model improvements.

Claimed Contributions

MathBook Knowledge System

A structured five-level hierarchical framework that systematically organizes mathematical knowledge, covering 491 knowledge points and 1,819 fundamental principles. This system enables comprehensive and systematic mathematical knowledge supervision for training multimodal large language models.

10 retrieved papers
Can Refute
MathBook-Standard and MathBook-Pro datasets

Two novel datasets: MathBook-Standard provides comprehensive step-wise annotations with dual expansions (multi-images per question and multi-questions per image) for conceptual flexibility, while MathBook-Pro introduces a three-dimensional difficulty modeling framework (step complexity, visual complexity, contextual complexity) that generates seven progressive difficulty variants per problem for structured learning.

10 retrieved papers
MathBook-RL training framework

A two-stage reinforcement learning framework that first performs cold-start fine-tuning to establish knowledge-oriented chain-of-thought reasoning, then applies progressive alignment RL with average-reward learning and dynamic data scheduling strategies (Knowledge Increment Scheduling and Modality Increment Scheduling) to achieve progressive alignment across difficulty levels.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MathBook Knowledge System

A structured five-level hierarchical framework that systematically organizes mathematical knowledge, covering 491 knowledge points and 1,819 fundamental principles. This system enables comprehensive and systematic mathematical knowledge supervision for training multimodal large language models.

Contribution

MathBook-Standard and MathBook-Pro datasets

Two novel datasets: MathBook-Standard provides comprehensive step-wise annotations with dual expansions (multi-images per question and multi-questions per image) for conceptual flexibility, while MathBook-Pro introduces a three-dimensional difficulty modeling framework (step complexity, visual complexity, contextual complexity) that generates seven progressive difficulty variants per problem for structured learning.

Contribution

MathBook-RL training framework

A two-stage reinforcement learning framework that first performs cold-start fine-tuning to establish knowledge-oriented chain-of-thought reasoning, then applies progressive alignment RL with average-reward learning and dynamic data scheduling strategies (Knowledge Increment Scheduling and Modality Increment Scheduling) to achieve progressive alignment across difficulty levels.