Reformulation for Pretraining Data Augmentation

ICLR 2026 Conference SubmissionAnonymous Authors
Large Language ModelsData AugmentationSynthetic Pretraining Data
Abstract:

Despite the impressive capabilities of large language models across various tasks, their continued scaling is severely hampered not only by data scarcity but also by the performance degradation associated with excessive data repetition during training. To overcome this critical bottleneck, we introduce the Massive Genre-Audience (MGA) reformulation method, a framework designed to augment corpora in a way that supports more effective model performance scaling. Instead of relying on complex, predefined seed systems, MGA systematically reformulates existing corpora into diverse, contextually-rich variations by adaptively generating genre-audience pairs. We present this framework and the resulting 770 billion token MGACorpus, created as a practical instantiation of our methodology. We experimentally validate MGA's core benefits by demonstrating superior scaling properties, in terms of both model size and data budget, against data repetition and upsampling (up to 13B parameters). Furthermore, our comprehensive analysis investigates the role of synthesis principles in generation quality and reveals nuances in evaluating model capabilities using standard loss metrics. Our work shows that a systematic framework like MGA provides a reliable pathway to substantially augment training datasets, effectively alleviating repetition bottlenecks and enabling more efficient scaling of large language models.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces the Massive Genre-Audience (MGA) reformulation framework to augment pretraining corpora by generating diverse genre-audience variations, producing a 770 billion token MGACorpus. Within the taxonomy, it resides in the 'Genre-Audience and Multi-Document Reformulation' leaf under 'Systematic Reformulation Frameworks for Pretraining'. This leaf contains only two papers total, including the original work, indicating a relatively sparse research direction. The sibling paper focuses on multi-document paraphrasing, suggesting this area is still emerging rather than crowded.

The taxonomy reveals that neighboring leaves address related but distinct approaches: 'Synthetic Continued Pretraining' synthesizes domain-specific corpora from small documents, while 'Web Data Recycling and Quality Enhancement' focuses on filtering and rewriting web-crawled data. The broader 'Systematic Reformulation Frameworks' branch contrasts with 'Application-Driven Reformulation', which targets downstream tasks rather than pretraining. The MGA framework's emphasis on adaptive genre-audience generation distinguishes it from fixed paraphrasing schemes in the 'Paraphrasing Techniques' branch, positioning it as a structured, pretraining-focused methodology.

Among 24 candidates examined across three contributions, the MGA reformulation framework (Contribution A) shows one refutable candidate out of seven examined, suggesting some prior work overlap in the limited search scope. The MGACorpus dataset (Contribution B) and Limited Consistency principle (Contribution C) examined 10 and 7 candidates respectively, with zero refutable matches, indicating these contributions appear more novel within the examined literature. The statistics reflect a focused semantic search rather than exhaustive coverage, so unexamined work may exist beyond the top-K matches.

Based on the limited search scope of 24 candidates, the framework's core novelty appears moderate given one overlapping prior work, while the dataset and synthesis principle show stronger novelty signals. The sparse taxonomy leaf (two papers) and absence of extensive prior work in genre-audience reformulation suggest this direction is relatively unexplored. However, the analysis covers top semantic matches and immediate citations, not the full field, so definitive novelty claims require broader literature review.

Taxonomy

Core-task Taxonomy Papers
47
3
Claimed Contributions
24
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Pretraining data augmentation through text reformulation. The field explores how systematically rewriting or paraphrasing text can enrich pretraining corpora and improve downstream model performance. The taxonomy reveals several main branches: Systematic Reformulation Frameworks for Pretraining develop principled methods for generating diverse textual variants at scale, often targeting genre shifts or multi-document synthesis; Multimodal Reformulation extends these ideas to vision-language settings, where caption paraphrasing or image-text alignment benefits from textual diversity; Application-Driven Reformulation tailors augmentation strategies to specific downstream tasks such as question answering or sentiment analysis; Paraphrasing Techniques and Foundations provide the algorithmic and evaluation underpinnings, including neural paraphrase generation and identification; Meta-Learning and Optimization for Augmentation investigate how to learn augmentation policies themselves; and Specialized Pretraining Contexts address domain-specific needs like clinical text or low-resource languages. Representative works such as Pretraining via Paraphrasing[5] and Synthetic Continued Pretraining[16] illustrate how reformulation can be integrated into the pretraining pipeline, while CLIP Language Rewrites[1] and Paraphrasing Image Captioning[9] show multimodal applications. A particularly active line of work focuses on leveraging large language models to generate high-quality paraphrases or style-shifted variants for pretraining, as seen in LLM Text-Pair Augmentation[3] and LLM Style Transfer[21], which trade off generation cost against data diversity. Another contrasting direction emphasizes lightweight, rule-based or model-driven paraphrasing that can scale to massive corpora without expensive LLM calls, exemplified by Recycling Web Pretraining[4] and DAIL Self-Paraphrase[2]. The original paper, Reformulation Pretraining Augmentation[0], sits within the Systematic Reformulation Frameworks branch, specifically targeting genre-audience and multi-document reformulation. Its emphasis on structured, multi-document synthesis aligns it closely with Pretraining via Paraphrasing[5], which also explores paraphrase-driven pretraining, but Reformulation Pretraining Augmentation[0] appears to push further into controlled genre and audience adaptation. This positions it as a bridge between foundational paraphrasing techniques and the emerging trend of using LLMs for targeted, high-fidelity text transformation in pretraining contexts.

Claimed Contributions

MGA reformulation framework for corpus augmentation

The authors propose a systematic two-stage framework that reformulates existing text corpora by adaptively generating diverse genre-audience pairs, avoiding complex seed systems and using lightweight models. This framework addresses data scarcity and repetition issues in LLM pretraining by creating contextually-rich variations of source documents.

7 retrieved papers
Can Refute
MGACorpus dataset

The authors release a 770 billion token dataset generated by applying their MGA framework to reformulate the fineweb-edu-dedup corpus, achieving a 3.9x token expansion. This dataset serves as a concrete validation of their methodology and will be made publicly available.

10 retrieved papers
Limited Consistency principle for data synthesis

The authors introduce a design principle that balances generating diverse stylistic variations (variance) while preserving factual accuracy (invariance) in reformulated text. This principle is implemented through careful prompt engineering and guides the entire reformulation process to avoid both excessive repetition and factual degradation.

7 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MGA reformulation framework for corpus augmentation

The authors propose a systematic two-stage framework that reformulates existing text corpora by adaptively generating diverse genre-audience pairs, avoiding complex seed systems and using lightweight models. This framework addresses data scarcity and repetition issues in LLM pretraining by creating contextually-rich variations of source documents.

Contribution

MGACorpus dataset

The authors release a 770 billion token dataset generated by applying their MGA framework to reformulate the fineweb-edu-dedup corpus, achieving a 3.9x token expansion. This dataset serves as a concrete validation of their methodology and will be made publicly available.

Contribution

Limited Consistency principle for data synthesis

The authors introduce a design principle that balances generating diverse stylistic variations (variance) while preserving factual accuracy (invariance) in reformulated text. This principle is implemented through careful prompt engineering and guides the entire reformulation process to avoid both excessive repetition and factual degradation.

Reformulation for Pretraining Data Augmentation | Novelty Validation