Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset

ICLR 2026 Conference SubmissionAnonymous Authors
Mathematical ReasoningWeb-Scale Data CurationLLM-Based CleaningPretraining datasetsDeduplication
Abstract:

Pretraining large language models (LLMs) on high-quality, structured data such as mathematics and code substantially enhances reasoning capabilities. However, existing math-focused datasets built from Common Crawl suffer from degraded quality due to brittle extraction heuristics, lossy HTML-to-text conversion, and the failure to reliably preserve mathematical structure. In this work, we intro- duce Nemotron-CC-Math, a large-scale, high-quality mathematical corpus constructed from Common Crawl using a novel, domain-agnostic pipeline specifically designed for robust scientific text extraction. Unlike previous efforts, our pipeline recovers math across various formats (e.g., MathJax, KaTeX, MathML) by leveraging layout-aware rendering with lynx and a targeted LLM-based cleaning stage. This approach preserves the structural integrity of equations and code blocks while removing boilerplate, standardizing notation into L A T EX representation, and correcting inconsistencies. We collected a large, high-quality math corpus, namely Nemotron-CC-Math-3+(133B tokens) and Nemotron-CC-Math-4+ (52B tokens). Notably, Nemotron-CC-Math-4+ not only surpasses all prior open math datasets-including Mega-Math, FineMath, and OpenWebMath-but also contains 5.5× more tokens than FineMath-4+, which was previously the highest-quality math pretraining dataset. When used to pretrain a Nemotron-T 8B model, our corpus yields +4.8 to +12.6. gains on MATH and +4.6 to +14.3 gains on MBPP+ over strong baselines, while also improving general-domain performance on MMLU and MMLU-Stem. We present the first pipeline to reliably extract scientific content—including math—from noisy web-scale data, yielding measurable gains in math, code, and general reasoning, and setting a new state of the art among open math pretraining corpora. To support open-source efforts, we release our code1 and datasets 2 .

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Nemotron-CC-Math, a large-scale mathematical corpus extracted from Common Crawl using a novel pipeline that emphasizes layout-aware rendering and LLM-based cleaning. It resides in the 'General Mathematical Pretraining Corpora from Web Sources' leaf, which contains five papers total including this one. This leaf sits within the broader 'Mathematical Dataset Construction and Curation' branch, indicating a moderately populated research direction focused on building web-scale mathematical datasets for language model pretraining.

The taxonomy reveals neighboring work in reinforcement learning datasets and evaluation benchmarks, both sibling categories under dataset construction. The paper's leaf is distinguished by its focus on pretraining corpora rather than problem-solution pairs or evaluation sets. Related extraction techniques appear in the 'Mathematical Content Extraction and Entity Recognition' branch, which addresses formula and concept extraction but excludes full dataset construction pipelines. The scope notes clarify that this work targets general pretraining rather than domain-specific scientific databases or formal proof systems.

Among 22 candidates examined across three contributions, no clearly refutable prior work was identified. The novel pipeline contribution examined 8 candidates with 0 refutations, the Lynx-based conversion approach examined 4 candidates with 0 refutations, and the dataset itself examined 10 candidates with 0 refutations. This limited search scope suggests that within the top-K semantic matches and citation expansion, no single prior work directly overlaps with the combination of layout-aware rendering, LLM-based standardization, and the resulting corpus scale.

Based on examination of 22 candidates, the work appears to occupy a distinct position combining pipeline methodology with corpus scale. The analysis covers semantically proximate papers but does not constitute exhaustive coverage of all web-scale mathematical extraction efforts. The absence of refutable candidates within this scope suggests differentiation from immediate neighbors, though broader field coverage would strengthen confidence in novelty claims.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
22
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Extracting high-quality mathematical content from web-scale data. This field addresses the challenge of identifying, filtering, and curating mathematical material from the vast and noisy expanse of the internet. The taxonomy reveals a multifaceted landscape organized around several complementary branches. Mathematical Dataset Construction and Curation focuses on building large-scale pretraining corpora from web sources, exemplified by efforts like OpenWebMath[45] and Infimm-webmath[7], which aggregate mathematical text at scale. Mathematical Content Extraction and Entity Recognition targets the identification of formulas, symbols, and structured mathematical entities within documents. Mathematical Information Retrieval and Search Systems develop specialized engines for querying mathematical knowledge, while Mathematical Knowledge Representation and Formalization work on encoding this content in machine-readable formats. Mathematical Reasoning and Model Capabilities examine how models leverage these datasets, and Domain-Specific Data Mining explores applications in education, finance, and scientific domains. General Data Mining Theory and Infrastructure provides foundational techniques, and Auxiliary Methods bring in cross-domain tools for quality assessment and filtering. A particularly active line of work centers on constructing general mathematical pretraining corpora from web sources, where researchers grapple with trade-offs between scale, quality, and domain coverage. Some efforts prioritize breadth by harvesting diverse web pages containing LaTeX or MathML, while others emphasize rigorous filtering to ensure correctness and pedagogical value. Nemotron-CC-Math[0] situates itself within this corpus-building cluster, sharing goals with neighbors like RedStone[3], which also targets web-scale mathematical data extraction, and OpenWebMath[45], an earlier large-scale effort. Compared to these works, Nemotron-CC-Math[0] appears to emphasize refined curation strategies that balance volume with content quality, addressing the perennial challenge of separating high-value mathematical exposition from incidental or low-quality mentions. This positioning reflects broader tensions in the field: whether to cast a wide net and rely on downstream model robustness, or to invest heavily in upfront filtering and verification.

Claimed Contributions

Novel domain-agnostic pipeline for robust scientific text extraction from Common Crawl

The authors propose a modular extraction pipeline that combines layout-aware rendering using Lynx with LLM-based cleaning to preserve mathematical equations, code blocks, and structural integrity while removing boilerplate and standardizing notation into LaTeX format.

8 retrieved papers
First use of Lynx text-based browser and LLM-based standardization for HTML-to-text conversion

The authors introduce a two-stage approach where Lynx renders HTML pages to preserve mathematical and code structure, followed by an LLM cleanup pass that standardizes diverse mathematical representations into consistent LaTeX format while removing non-essential content.

4 retrieved papers
Nemotron-CC-Math dataset: 133B tokens of high-quality mathematical corpus

The authors construct and release Nemotron-CC-Math, comprising 133B tokens (Nemotron-CC-Math-3+) with a highest-quality subset of 52B tokens (Nemotron-CC-Math-4+), which is substantially larger than existing open math pretraining datasets and demonstrates measurable improvements in math, code, and general reasoning tasks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Novel domain-agnostic pipeline for robust scientific text extraction from Common Crawl

The authors propose a modular extraction pipeline that combines layout-aware rendering using Lynx with LLM-based cleaning to preserve mathematical equations, code blocks, and structural integrity while removing boilerplate and standardizing notation into LaTeX format.

Contribution

First use of Lynx text-based browser and LLM-based standardization for HTML-to-text conversion

The authors introduce a two-stage approach where Lynx renders HTML pages to preserve mathematical and code structure, followed by an LLM cleanup pass that standardizes diverse mathematical representations into consistent LaTeX format while removing non-essential content.

Contribution

Nemotron-CC-Math dataset: 133B tokens of high-quality mathematical corpus

The authors construct and release Nemotron-CC-Math, comprising 133B tokens (Nemotron-CC-Math-3+) with a highest-quality subset of 52B tokens (Nemotron-CC-Math-4+), which is substantially larger than existing open math pretraining datasets and demonstrates measurable improvements in math, code, and general reasoning tasks.

Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset | Novelty Validation