RePro: Training Language Models to Faithfully Recycle the Web for Pretraining

ICLR 2026 Conference SubmissionAnonymous Authors
Large Language ModelPretrainingSynthetic Data
Abstract:

High-quality data is a cornerstone of large language model (LLM) pretraining, yet its growth has not kept pace with the needs of frontier models. In this paper, we introduce RePro, a novel web recycling method that trains a relatively small LM with reinforcement learning to generate effective and faithful rephrasings of pretraining data. Specifically, we design one quality reward and three faithfulness rewards, optimizing the LM rephraser to convert organic data into high-quality rephrasings while maintaining its core semantics and structure. In our experiment, we train a 4B rephraser to recycle 72B tokens sampled from DCLM-RefinedWeb. Pretraining results on 400M and 1.4B models demonstrate that RePro delivers 4.7%-14.0% relative accuracy gains over organic-only baseline on 22 downstream tasks. RePro also outperforms ReWire, the state-of-the-art web recycling method that prompts a 70B rephraser, as well as the organic baseline with a 4×\times larger data pool. Experiments with different amounts of recycled data highlight that RePro improves organic data efficiency by 2-3×\times. Individual and distributional analyses validate that RePro preserves more critical information and faithfully reflects the characteristics of organic data compared to prompting-based methods. Together, these results show that RePro provides an efficient and controllable path to effectively harness organic pretraining data. Our anonymized code is available at https://anonymous.4open.science/r/RePro. We will open-source our rephraser and recycled data.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces RePro, a reinforcement learning-based method for generating high-quality rephrasings of web pretraining data using a 4B parameter model. It resides in the 'Semantic Recycling via Rephrasing' leaf of the taxonomy, which contains only two papers total: RePro itself and one sibling work ('Rephrasing the Web'). This represents a sparse, emerging research direction within the broader data recycling landscape, suggesting the approach addresses a relatively underexplored strategy for augmenting pretraining corpora through controlled paraphrasing rather than collecting new data or applying static filtering heuristics.

The taxonomy reveals that semantic recycling sits alongside two other data reuse strategies: embedding/component recycling (which reuses learned representations) and cross-dataset aggregation (which combines existing corpora via deduplication). Neighboring branches include model-based quality scoring and curriculum scheduling under 'Data Selection and Filtering', which focus on identifying valuable subsets rather than generating new text. The scope note for semantic recycling explicitly excludes embedding reuse and continued pretraining from checkpoints, positioning RePro's text generation approach as distinct from methods that manipulate model components or training schedules without producing new linguistic variations.

Among the 24 candidates examined across three contributions, no clearly refuting prior work was identified. The core RePro method examined 4 candidates with no refutations; the reward design contribution examined 10 candidates with none refuting; and the data efficiency demonstration also examined 10 candidates without refutation. This suggests that within the limited search scope—focused on top semantic matches and citation expansion—the specific combination of RL-based rephraser training, multi-faceted reward design (quality plus three faithfulness rewards), and empirical validation of 2-3× data efficiency gains appears not to have direct precedent in the examined literature.

The analysis reflects a targeted search of 24 papers rather than exhaustive coverage of all paraphrasing or data augmentation research. The sparse taxonomy leaf (only one sibling paper) and absence of refuting candidates among examined works suggest RePro occupies a relatively novel position within the specific framing of web data recycling for LLM pretraining. However, the limited search scope means broader connections to general paraphrasing, data augmentation, or synthetic data generation outside this taxonomy may not be fully captured.

Taxonomy

Core-task Taxonomy Papers
48
3
Claimed Contributions
24
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: web data recycling for language model pretraining. The field addresses how to maximize the utility of web-scale corpora when training large language models, particularly as high-quality data becomes scarce. The taxonomy reveals several complementary branches: Data Recycling and Reuse Methods explore techniques for revisiting existing data through rephrasing or multi-epoch strategies; Data Selection and Filtering focus on identifying high-value subsets from noisy web crawls (e.g., DoReMi[2], FineWeb[7]); Web Corpus Construction and Curation address the engineering of large-scale datasets like RefinedWeb[11]; Pretraining Data Characteristics and Impact examine how data properties influence model behavior; Specialized Pretraining Paradigms adapt web data for domain-specific needs such as mathematics (Llemma[5]) or robotics (RT-2[16]); and Cross-Domain Applications, Environmental Considerations, and Peripheral topics round out the landscape. These branches collectively reflect a shift from simply scaling data volume to strategically curating and reusing what is already available. Within this ecosystem, semantic recycling via rephrasing has emerged as a particularly active direction, seeking to generate synthetic variations of web text to effectively multiply training data without new crawls. RePro[0] sits squarely in this branch alongside Rephrasing the Web[22], both exploring how paraphrasing or reformulation can yield fresh training signal from existing corpora. This contrasts with approaches like Photon[3], which emphasizes curriculum-based reuse of the same data across training stages, or Embedding Recycling[1], which focuses on reusing learned representations rather than text itself. A key open question is whether semantic transformations genuinely provide new information or merely reinforce existing patterns. RePro[0] and similar works must balance the cost of generating rephrased data against potential gains in model robustness and generalization, situating them at the intersection of data efficiency and synthetic augmentation strategies.

Claimed Contributions

RePro web recycling method using RL-trained rephraser

The authors introduce RePro, a method that trains a small language model (4B parameters) using reinforcement learning to rephrase pretraining data. This approach recycles web data to increase the amount of high-quality pretraining data while maintaining semantic meaning and structural characteristics of the original content.

4 retrieved papers
Quality and faithfulness reward design for rephraser optimization

The authors design a reward system consisting of one quality reward (DataMan score) and three faithfulness rewards (BERTScore for semantics, structure preservation, and length alignment). These rewards guide the rephraser to produce high-quality rephrasings while faithfully preserving the core characteristics of organic data.

10 retrieved papers
Demonstration of improved data efficiency and faithful preservation

The authors demonstrate that RePro achieves superior performance compared to existing methods while using a much smaller model, improving organic data efficiency by 2-3 times. Their analyses validate that the method preserves critical information and faithfully reflects the characteristics of organic data better than prompting-based approaches.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

RePro web recycling method using RL-trained rephraser

The authors introduce RePro, a method that trains a small language model (4B parameters) using reinforcement learning to rephrase pretraining data. This approach recycles web data to increase the amount of high-quality pretraining data while maintaining semantic meaning and structural characteristics of the original content.

Contribution

Quality and faithfulness reward design for rephraser optimization

The authors design a reward system consisting of one quality reward (DataMan score) and three faithfulness rewards (BERTScore for semantics, structure preservation, and length alignment). These rewards guide the rephraser to produce high-quality rephrasings while faithfully preserving the core characteristics of organic data.

Contribution

Demonstration of improved data efficiency and faithful preservation

The authors demonstrate that RePro achieves superior performance compared to existing methods while using a much smaller model, improving organic data efficiency by 2-3 times. Their analyses validate that the method preserves critical information and faithfully reflects the characteristics of organic data better than prompting-based approaches.

RePro: Training Language Models to Faithfully Recycle the Web for Pretraining | Novelty Validation