RePro: Training Language Models to Faithfully Recycle the Web for Pretraining

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Large Language ModelPretrainingSynthetic Data

High-quality data is a cornerstone of large language model (LLM) pretraining, yet its growth has not kept pace with the needs of frontier models. In this paper, we introduce RePro, a novel web recycling method that trains a relatively small LM with reinforcement learning to generate effective and faithful rephrasings of pretraining data. Specifically, we design one quality reward and three faithfulness rewards, optimizing the LM rephraser to convert organic data into high-quality rephrasings while maintaining its core semantics and structure. In our experiment, we train a 4B rephraser to recycle 72B tokens sampled from DCLM-RefinedWeb. Pretraining results on 400M and 1.4B models demonstrate that RePro delivers 4.7%-14.0% relative accuracy gains over organic-only baseline on 22 downstream tasks. RePro also outperforms ReWire, the state-of-the-art web recycling method that prompts a 70B rephraser, as well as the organic baseline with a 4 $\times$ larger data pool. Experiments with different amounts of recycled data highlight that RePro improves organic data efficiency by 2-3 $\times$ . Individual and distributional analyses validate that RePro preserves more critical information and faithfully reflects the characteristics of organic data compared to prompting-based methods. Together, these results show that RePro provides an efficient and controllable path to effectively harness organic pretraining data. Our anonymized code is available at https://anonymous.4open.science/r/RePro. We will open-source our rephraser and recycled data.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces RePro, a reinforcement learning-based method for generating high-quality rephrasings of web pretraining data using a 4B parameter model. It resides in the 'Semantic Recycling via Rephrasing' leaf of the taxonomy, which contains only two papers total: RePro itself and one sibling work ('Rephrasing the Web'). This represents a sparse, emerging research direction within the broader data recycling landscape, suggesting the approach addresses a relatively underexplored strategy for augmenting pretraining corpora through controlled paraphrasing rather than collecting new data or applying static filtering heuristics.

The taxonomy reveals that semantic recycling sits alongside two other data reuse strategies: embedding/component recycling (which reuses learned representations) and cross-dataset aggregation (which combines existing corpora via deduplication). Neighboring branches include model-based quality scoring and curriculum scheduling under 'Data Selection and Filtering', which focus on identifying valuable subsets rather than generating new text. The scope note for semantic recycling explicitly excludes embedding reuse and continued pretraining from checkpoints, positioning RePro's text generation approach as distinct from methods that manipulate model components or training schedules without producing new linguistic variations.

Among the 24 candidates examined across three contributions, no clearly refuting prior work was identified. The core RePro method examined 4 candidates with no refutations; the reward design contribution examined 10 candidates with none refuting; and the data efficiency demonstration also examined 10 candidates without refutation. This suggests that within the limited search scope—focused on top semantic matches and citation expansion—the specific combination of RL-based rephraser training, multi-faceted reward design (quality plus three faithfulness rewards), and empirical validation of 2-3× data efficiency gains appears not to have direct precedent in the examined literature.

The analysis reflects a targeted search of 24 papers rather than exhaustive coverage of all paraphrasing or data augmentation research. The sparse taxonomy leaf (only one sibling paper) and absence of refuting candidates among examined works suggest RePro occupies a relatively novel position within the specific framing of web data recycling for LLM pretraining. However, the limited search scope means broader connections to general paraphrasing, data augmentation, or synthetic data generation outside this taxonomy may not be fully captured.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: web data recycling for language model pretraining. The field addresses how to maximize the utility of web-scale corpora when training large language models, particularly as high-quality data becomes scarce. The taxonomy reveals several complementary branches: Data Recycling and Reuse Methods explore techniques for revisiting existing data through rephrasing or multi-epoch strategies; Data Selection and Filtering focus on identifying high-value subsets from noisy web crawls (e.g., DoReMi[2], FineWeb[7]); Web Corpus Construction and Curation address the engineering of large-scale datasets like RefinedWeb[11]; Pretraining Data Characteristics and Impact examine how data properties influence model behavior; Specialized Pretraining Paradigms adapt web data for domain-specific needs such as mathematics (Llemma[5]) or robotics (RT-2[16]); and Cross-Domain Applications, Environmental Considerations, and Peripheral topics round out the landscape. These branches collectively reflect a shift from simply scaling data volume to strategically curating and reusing what is already available. Within this ecosystem, semantic recycling via rephrasing has emerged as a particularly active direction, seeking to generate synthetic variations of web text to effectively multiply training data without new crawls. RePro[0] sits squarely in this branch alongside Rephrasing the Web[22], both exploring how paraphrasing or reformulation can yield fresh training signal from existing corpora. This contrasts with approaches like Photon[3], which emphasizes curriculum-based reuse of the same data across training stages, or Embedding Recycling[1], which focuses on reusing learned representations rather than text itself. A key open question is whether semantic transformations genuinely provide new information or merely reinforce existing patterns. RePro[0] and similar works must balance the cost of generating rephrased data against potential gains in model robustness and generalization, situating them at the intersection of data efficiency and synthetic augmentation strategies.

Claimed Contributions

RePro web recycling method using RL-trained rephraser

4 retrieved papers

The authors introduce RePro, a method that trains a small language model (4B parameters) using reinforcement learning to rephrase pretraining data. This approach recycles web data to increase the amount of high-quality pretraining data while maintaining semantic meaning and structural characteristics of the original content.

4 retrieved papers

Quality and faithfulness reward design for rephraser optimization

10 retrieved papers

The authors design a reward system consisting of one quality reward (DataMan score) and three faithfulness rewards (BERTScore for semantics, structure preservation, and length alignment). These rewards guide the rephraser to produce high-quality rephrasings while faithfully preserving the core characteristics of organic data.

10 retrieved papers

Demonstration of improved data efficiency and faithful preservation

10 retrieved papers

The authors demonstrate that RePro achieves superior performance compared to existing methods while using a much smaller model, improving organic data efficiency by 2-3 times. Their analyses validate that the method preserves critical information and faithfully reflects the characteristics of organic data better than prompting-based approaches.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[22] Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling PDF

Pratyush Maini, Skyler Seto, Richard Bai, David Grangier, Yizhe Zhang, Navdeep Jaitly (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

RePro web recycling method using RL-trained rephraser

[69] Rewritelm: An instruction-tuned large language model for text rewriting PDF

Cannot Refute

[70] AugmenToxic: Leveraging Reinforcement Learning to Optimize LLM Instruction Fine-Tuning for Data Augmentation to Enhance Toxicity Detection PDF

Cannot Refute

[71] Deep Learning with TensorFlow and Keras: Build and deploy supervised, unsupervised, deep, and reinforcement learning models PDF

Cannot Refute

[72] Keyword-aware Abstractive Summarization by Extracting Set-level Intermediate Summaries PDF

Cannot Refute

Contribution

Quality and faithfulness reward design for rephraser optimization

[49] Boosting reward model with preference-conditional multi-aspect synthetic data generation PDF

Cannot Refute

[50] Beyond Correctness: Rewarding Faithful Reasoning in Retrieval-Augmented Generation PDF

Cannot Refute

[51] Calibrated self-rewarding vision language models PDF

Cannot Refute

[52] Reno: Enhancing one-step text-to-image models through reward-based noise optimization PDF

Cannot Refute

[53] Reason-SVG: Hybrid Reward RL for Aha-Moments in Vector Graphics Generation PDF

Cannot Refute

[54] Parrot: Pareto-optimal multi-reward reinforcement learning framework for text-to-image generation PDF

Cannot Refute

[55] Symbolic graphics programming with large language models PDF

Cannot Refute

[56] Rag-rewardbench: Benchmarking reward models in retrieval augmented generation for preference alignment PDF

Cannot Refute

[57] Multi-metric preference alignment for generative speech restoration PDF

Cannot Refute

[58] Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning PDF

Cannot Refute

Contribution

Demonstration of improved data efficiency and faithful preservation

[59] A Review: Data Pre-Processing and Data Augmentation Techniques PDF

Cannot Refute

[60] Keepaugment: A simple information-preserving data augmentation approach PDF

Cannot Refute

[61] Regularizing Deep Networks With Semantic Data Augmentation PDF

Cannot Refute

[62] Fine-Grained Recognition With Learnable Semantic Data Augmentation PDF

Cannot Refute

[63] A MongolianâChinese Neural Machine Translation Method Based on Semantic-Context Data Augmentation PDF

Cannot Refute

[64] Phymask: An adaptive masking paradigm for efficient self-supervised learning in iot PDF

Cannot Refute

[65] Semantic data augmentation with generative models PDF

Cannot Refute

[66] Grappa: Grammar-augmented pre-training for table semantic parsing PDF

Cannot Refute

[67] Through the Dual-Prism: A Spectral Perspective on Graph Data Augmentation for Graph Classification PDF

Cannot Refute

[68] Improving Medical Vision-Language Contrastive Pretraining With Semantics-Aware Triage PDF

Cannot Refute

RePro: Training Language Models to Faithfully Recycle the Web for Pretraining

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[22] Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling PDF

Contribution Analysis

RePro web recycling method using RL-trained rephraser

[69] Rewritelm: An instruction-tuned large language model for text rewriting PDF

[70] AugmenToxic: Leveraging Reinforcement Learning to Optimize LLM Instruction Fine-Tuning for Data Augmentation to Enhance Toxicity Detection PDF

[71] Deep Learning with TensorFlow and Keras: Build and deploy supervised, unsupervised, deep, and reinforcement learning models PDF

[72] Keyword-aware Abstractive Summarization by Extracting Set-level Intermediate Summaries PDF

Quality and faithfulness reward design for rephraser optimization

[49] Boosting reward model with preference-conditional multi-aspect synthetic data generation PDF

[50] Beyond Correctness: Rewarding Faithful Reasoning in Retrieval-Augmented Generation PDF

[51] Calibrated self-rewarding vision language models PDF

[52] Reno: Enhancing one-step text-to-image models through reward-based noise optimization PDF

[53] Reason-SVG: Hybrid Reward RL for Aha-Moments in Vector Graphics Generation PDF

[54] Parrot: Pareto-optimal multi-reward reinforcement learning framework for text-to-image generation PDF

[55] Symbolic graphics programming with large language models PDF

[56] Rag-rewardbench: Benchmarking reward models in retrieval augmented generation for preference alignment PDF

[57] Multi-metric preference alignment for generative speech restoration PDF

[58] Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning PDF

Demonstration of improved data efficiency and faithful preservation

[59] A Review: Data Pre-Processing and Data Augmentation Techniques PDF

[60] Keepaugment: A simple information-preserving data augmentation approach PDF

[61] Regularizing Deep Networks With Semantic Data Augmentation PDF

[62] Fine-Grained Recognition With Learnable Semantic Data Augmentation PDF

[63] A MongolianâChinese Neural Machine Translation Method Based on Semantic-Context Data Augmentation PDF

[64] Phymask: An adaptive masking paradigm for efficient self-supervised learning in iot PDF

[65] Semantic data augmentation with generative models PDF

[66] Grappa: Grammar-augmented pre-training for table semantic parsing PDF

[67] Through the Dual-Prism: A Spectral Perspective on Graph Data Augmentation for Graph Classification PDF

[68] Improving Medical Vision-Language Contrastive Pretraining With Semantics-Aware Triage PDF

Table of Contents

[63] A MongolianâChinese Neural Machine Translation Method Based on Semantic-Context Data Augmentation PDF