Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.3 Download Report PDF

Mathematical ReasoningWeb-Scale Data CurationLLM-Based CleaningPretraining datasetsDeduplication

Pretraining large language models (LLMs) on high-quality, structured data such as mathematics and code substantially enhances reasoning capabilities. However, existing math-focused datasets built from Common Crawl suffer from degraded quality due to brittle extraction heuristics, lossy HTML-to-text conversion, and the failure to reliably preserve mathematical structure. In this work, we intro- duce Nemotron-CC-Math, a large-scale, high-quality mathematical corpus constructed from Common Crawl using a novel, domain-agnostic pipeline specifically designed for robust scientific text extraction. Unlike previous efforts, our pipeline recovers math across various formats (e.g., MathJax, KaTeX, MathML) by leveraging layout-aware rendering with lynx and a targeted LLM-based cleaning stage. This approach preserves the structural integrity of equations and code blocks while removing boilerplate, standardizing notation into L A T EX representation, and correcting inconsistencies. We collected a large, high-quality math corpus, namely Nemotron-CC-Math-3+(133B tokens) and Nemotron-CC-Math-4+ (52B tokens). Notably, Nemotron-CC-Math-4+ not only surpasses all prior open math datasets-including Mega-Math, FineMath, and OpenWebMath-but also contains 5.5× more tokens than FineMath-4+, which was previously the highest-quality math pretraining dataset. When used to pretrain a Nemotron-T 8B model, our corpus yields +4.8 to +12.6. gains on MATH and +4.6 to +14.3 gains on MBPP+ over strong baselines, while also improving general-domain performance on MMLU and MMLU-Stem. We present the first pipeline to reliably extract scientific content—including math—from noisy web-scale data, yielding measurable gains in math, code, and general reasoning, and setting a new state of the art among open math pretraining corpora. To support open-source efforts, we release our code1 and datasets 2 .

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Nemotron-CC-Math, a large-scale mathematical corpus extracted from Common Crawl using a novel pipeline that emphasizes layout-aware rendering and LLM-based cleaning. It resides in the 'General Mathematical Pretraining Corpora from Web Sources' leaf, which contains five papers total including this one. This leaf sits within the broader 'Mathematical Dataset Construction and Curation' branch, indicating a moderately populated research direction focused on building web-scale mathematical datasets for language model pretraining.

The taxonomy reveals neighboring work in reinforcement learning datasets and evaluation benchmarks, both sibling categories under dataset construction. The paper's leaf is distinguished by its focus on pretraining corpora rather than problem-solution pairs or evaluation sets. Related extraction techniques appear in the 'Mathematical Content Extraction and Entity Recognition' branch, which addresses formula and concept extraction but excludes full dataset construction pipelines. The scope notes clarify that this work targets general pretraining rather than domain-specific scientific databases or formal proof systems.

Among 22 candidates examined across three contributions, no clearly refutable prior work was identified. The novel pipeline contribution examined 8 candidates with 0 refutations, the Lynx-based conversion approach examined 4 candidates with 0 refutations, and the dataset itself examined 10 candidates with 0 refutations. This limited search scope suggests that within the top-K semantic matches and citation expansion, no single prior work directly overlaps with the combination of layout-aware rendering, LLM-based standardization, and the resulting corpus scale.

Based on examination of 22 candidates, the work appears to occupy a distinct position combining pipeline methodology with corpus scale. The analysis covers semantically proximate papers but does not constitute exhaustive coverage of all web-scale mathematical extraction efforts. The absence of refutable candidates within this scope suggests differentiation from immediate neighbors, though broader field coverage would strengthen confidence in novelty claims.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Extracting high-quality mathematical content from web-scale data. This field addresses the challenge of identifying, filtering, and curating mathematical material from the vast and noisy expanse of the internet. The taxonomy reveals a multifaceted landscape organized around several complementary branches. Mathematical Dataset Construction and Curation focuses on building large-scale pretraining corpora from web sources, exemplified by efforts like OpenWebMath[45] and Infimm-webmath[7], which aggregate mathematical text at scale. Mathematical Content Extraction and Entity Recognition targets the identification of formulas, symbols, and structured mathematical entities within documents. Mathematical Information Retrieval and Search Systems develop specialized engines for querying mathematical knowledge, while Mathematical Knowledge Representation and Formalization work on encoding this content in machine-readable formats. Mathematical Reasoning and Model Capabilities examine how models leverage these datasets, and Domain-Specific Data Mining explores applications in education, finance, and scientific domains. General Data Mining Theory and Infrastructure provides foundational techniques, and Auxiliary Methods bring in cross-domain tools for quality assessment and filtering. A particularly active line of work centers on constructing general mathematical pretraining corpora from web sources, where researchers grapple with trade-offs between scale, quality, and domain coverage. Some efforts prioritize breadth by harvesting diverse web pages containing LaTeX or MathML, while others emphasize rigorous filtering to ensure correctness and pedagogical value. Nemotron-CC-Math[0] situates itself within this corpus-building cluster, sharing goals with neighbors like RedStone[3], which also targets web-scale mathematical data extraction, and OpenWebMath[45], an earlier large-scale effort. Compared to these works, Nemotron-CC-Math[0] appears to emphasize refined curation strategies that balance volume with content quality, addressing the perennial challenge of separating high-value mathematical exposition from incidental or low-quality mentions. This positioning reflects broader tensions in the field: whether to cast a wide net and rely on downstream model robustness, or to invest heavily in upfront filtering and verification.

Claimed Contributions

Novel domain-agnostic pipeline for robust scientific text extraction from Common Crawl

8 retrieved papers

The authors propose a modular extraction pipeline that combines layout-aware rendering using Lynx with LLM-based cleaning to preserve mathematical equations, code blocks, and structural integrity while removing boilerplate and standardizing notation into LaTeX format.

8 retrieved papers

First use of Lynx text-based browser and LLM-based standardization for HTML-to-text conversion

4 retrieved papers

The authors introduce a two-stage approach where Lynx renders HTML pages to preserve mathematical and code structure, followed by an LLM cleanup pass that standardizes diverse mathematical representations into consistent LaTeX format while removing non-essential content.

4 retrieved papers

Nemotron-CC-Math dataset: 133B tokens of high-quality mathematical corpus

10 retrieved papers

The authors construct and release Nemotron-CC-Math, comprising 133B tokens (Nemotron-CC-Math-3+) with a highest-quality subset of 52B tokens (Nemotron-CC-Math-4+), which is substantially larger than existing open math pretraining datasets and demonstrates measurable improvements in math, code, and general reasoning tasks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[3] RedStone: Curating general, code, math, and QA data for large language models PDF

Cui Lei, Yaoyao Chang, Dong Li, Lei Cui, Huang, Shaohan, Li Dong, Yangyu, Shaohan Huang, Yupan, Yangyu Huang, Yupan Huang, Lv, Tengchao, Scarlett Li, Ma, Shuming, Tengchao Lv, Shuming Ma, Wang Wen-hui, Qinzheng Sun, Wei, Furu, Wenhui Wang, Xin Ying, Furu Wei, Yang Mao, Ying Xin, Mao Yang, Zhang, Xingxing, Qiufeng Yin, Xingxing Zhang (2024)

[7] Infimm-webmath-40b: Advancing multimodal pre-training for enhanced mathematical reasoning PDF

Han XiaoTian, Jian, Yiren, Xiaotian Han, Hu XueFeng, Yiren Jian, Liu, Haogeng, Xuefeng Hu, Wang YiQi, Haogeng Liu, Fan, Qihang, Yiqi Wang, Ai, Yuang, Qihang Fan, Huang, Huaibo, Yuang Ai, He, Ran, Huaibo Huang, Yang Zhenheng, Ran He, You, Quanzeng, Zhenheng Yang, Quanzeng You (2024)

[24] Essential-Web v1.0: 24T tokens of organized web data PDF

- -, Essential AI Andrew Hojel, Hojel, Andrew, Michael Pust, T. Romanski, Yash Vanjani, Ritvik Kapila, Kapila, Ritvik, Mohit Parmar, Adarsh Chaluvaraju, Alok Tripathy, Tripathy, Alok, Anil Thomas, Thomas Anil, A. Tanwer, Tanwer, Ashish, Darsh J Shah, Shah, Darsh J, Ishaan Shah, Shah Ishaan, Karl Stratos, Stratos, Karl, Khoi Nguyen, NguyÃªn KhÃ´i, Kurt Smith, Smith, Kurt, Michael Callahan, Callahan, Michael, Peter Rushton, Philip Monk, Platon Mazarakis, Saad Jamal, Jamal Saad, Saurabh Srivastava, Srivastava, Saurabh, Somanshu Singla, Singla, Somanshu, Ashish Vaswani, Vaswani (2025)

[45] OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text PDF

Paster, Keiran, Keiran Paster, Marco Dos Santos, Azerbayev, Zhangir, Zhangir Azerbayev, Ba, Jimmy, Jimmy Ba (2023)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Novel domain-agnostic pipeline for robust scientific text extraction from Common Crawl

[11] Design and Data Mining Techniques for Large-Scale Scholarly Digital Libraries and Search Engines PDF

Cannot Refute

[45] OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text PDF

Cannot Refute

[51] MegaMath: Pushing the Limits of Open Math Corpora PDF

Cannot Refute

[63] A novel combining method of dynamic and static web crawler with parallel computing PDF

Cannot Refute

[64] Leveraging Web-Crawled Data for High-Quality Fine-Tuning PDF

Cannot Refute

[65] MATESC: Metadata-Analytic Text Extractor and Section Classifier for Scientific Publications. PDF

Cannot Refute

[66] An approach to mathematical search through query formulation and data normalization PDF

Cannot Refute

[67] Focused crawling: an approach for URL queue optimization using link score PDF

Cannot Refute

Contribution

First use of Lynx text-based browser and LLM-based standardization for HTML-to-text conversion

[45] OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text PDF

Cannot Refute

[51] MegaMath: Pushing the Limits of Open Math Corpora PDF

Cannot Refute

[52] The Latex Web Companion: Integrating TEX, HTML, and XML PDF

Cannot Refute

[53] AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser PDF

Cannot Refute

Contribution

Nemotron-CC-Math dataset: 133B tokens of high-quality mathematical corpus

[1] Big-math: A large-scale, high-quality math dataset for reinforcement learning in language models PDF

Cannot Refute

[54] Deepseekmath: Pushing the limits of mathematical reasoning in open language models PDF

Cannot Refute

[55] Metamath: Bootstrap your own mathematical questions for large language models PDF

Cannot Refute

[56] Large language models for mathematical reasoning: Progresses and challenges PDF

Cannot Refute

[57] Sciinstruct: a self-reflective instruction annotated dataset for training scientific language models PDF

Cannot Refute

[58] SmolLM2: When Smol Goes Big--Data-Centric Training of a Small Language Model PDF

Cannot Refute

[59] Llemma: An open language model for mathematics PDF

Cannot Refute

[60] MASS: Mathematical Data Selection via Skill Graphs for Pretraining Large Language Models PDF

Cannot Refute

[61] MAMUT: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training PDF

Cannot Refute

[62] Mathematical language models: A survey PDF

Cannot Refute

Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[3] RedStone: Curating general, code, math, and QA data for large language models PDF

[7] Infimm-webmath-40b: Advancing multimodal pre-training for enhanced mathematical reasoning PDF

[24] Essential-Web v1.0: 24T tokens of organized web data PDF

[45] OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text PDF

Contribution Analysis

Novel domain-agnostic pipeline for robust scientific text extraction from Common Crawl

[11] Design and Data Mining Techniques for Large-Scale Scholarly Digital Libraries and Search Engines PDF

[45] OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text PDF

[51] MegaMath: Pushing the Limits of Open Math Corpora PDF

[63] A novel combining method of dynamic and static web crawler with parallel computing PDF

[64] Leveraging Web-Crawled Data for High-Quality Fine-Tuning PDF

[65] MATESC: Metadata-Analytic Text Extractor and Section Classifier for Scientific Publications. PDF

[66] An approach to mathematical search through query formulation and data normalization PDF

[67] Focused crawling: an approach for URL queue optimization using link score PDF

First use of Lynx text-based browser and LLM-based standardization for HTML-to-text conversion

[45] OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text PDF

[51] MegaMath: Pushing the Limits of Open Math Corpora PDF

[52] The Latex Web Companion: Integrating TEX, HTML, and XML PDF

[53] AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser PDF

Nemotron-CC-Math dataset: 133B tokens of high-quality mathematical corpus

[1] Big-math: A large-scale, high-quality math dataset for reinforcement learning in language models PDF

[54] Deepseekmath: Pushing the limits of mathematical reasoning in open language models PDF

[55] Metamath: Bootstrap your own mathematical questions for large language models PDF

[56] Large language models for mathematical reasoning: Progresses and challenges PDF

[57] Sciinstruct: a self-reflective instruction annotated dataset for training scientific language models PDF

[58] SmolLM2: When Smol Goes Big--Data-Centric Training of a Small Language Model PDF

[59] Llemma: An open language model for mathematics PDF

[60] MASS: Mathematical Data Selection via Skill Graphs for Pretraining Large Language Models PDF

[61] MAMUT: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training PDF

[62] Mathematical language models: A survey PDF

Table of Contents