RePro: Training Language Models to Faithfully Recycle the Web for Pretraining
Overview
Overall Novelty Assessment
The paper introduces RePro, a reinforcement learning-based method for generating high-quality rephrasings of web pretraining data using a 4B parameter model. It resides in the 'Semantic Recycling via Rephrasing' leaf of the taxonomy, which contains only two papers total: RePro itself and one sibling work ('Rephrasing the Web'). This represents a sparse, emerging research direction within the broader data recycling landscape, suggesting the approach addresses a relatively underexplored strategy for augmenting pretraining corpora through controlled paraphrasing rather than collecting new data or applying static filtering heuristics.
The taxonomy reveals that semantic recycling sits alongside two other data reuse strategies: embedding/component recycling (which reuses learned representations) and cross-dataset aggregation (which combines existing corpora via deduplication). Neighboring branches include model-based quality scoring and curriculum scheduling under 'Data Selection and Filtering', which focus on identifying valuable subsets rather than generating new text. The scope note for semantic recycling explicitly excludes embedding reuse and continued pretraining from checkpoints, positioning RePro's text generation approach as distinct from methods that manipulate model components or training schedules without producing new linguistic variations.
Among the 24 candidates examined across three contributions, no clearly refuting prior work was identified. The core RePro method examined 4 candidates with no refutations; the reward design contribution examined 10 candidates with none refuting; and the data efficiency demonstration also examined 10 candidates without refutation. This suggests that within the limited search scope—focused on top semantic matches and citation expansion—the specific combination of RL-based rephraser training, multi-faceted reward design (quality plus three faithfulness rewards), and empirical validation of 2-3× data efficiency gains appears not to have direct precedent in the examined literature.
The analysis reflects a targeted search of 24 papers rather than exhaustive coverage of all paraphrasing or data augmentation research. The sparse taxonomy leaf (only one sibling paper) and absence of refuting candidates among examined works suggest RePro occupies a relatively novel position within the specific framing of web data recycling for LLM pretraining. However, the limited search scope means broader connections to general paraphrasing, data augmentation, or synthetic data generation outside this taxonomy may not be fully captured.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce RePro, a method that trains a small language model (4B parameters) using reinforcement learning to rephrase pretraining data. This approach recycles web data to increase the amount of high-quality pretraining data while maintaining semantic meaning and structural characteristics of the original content.
The authors design a reward system consisting of one quality reward (DataMan score) and three faithfulness rewards (BERTScore for semantics, structure preservation, and length alignment). These rewards guide the rephraser to produce high-quality rephrasings while faithfully preserving the core characteristics of organic data.
The authors demonstrate that RePro achieves superior performance compared to existing methods while using a much smaller model, improving organic data efficiency by 2-3 times. Their analyses validate that the method preserves critical information and faithfully reflects the characteristics of organic data better than prompting-based approaches.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[22] Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
RePro web recycling method using RL-trained rephraser
The authors introduce RePro, a method that trains a small language model (4B parameters) using reinforcement learning to rephrase pretraining data. This approach recycles web data to increase the amount of high-quality pretraining data while maintaining semantic meaning and structural characteristics of the original content.
[69] Rewritelm: An instruction-tuned large language model for text rewriting PDF
[70] AugmenToxic: Leveraging Reinforcement Learning to Optimize LLM Instruction Fine-Tuning for Data Augmentation to Enhance Toxicity Detection PDF
[71] Deep Learning with TensorFlow and Keras: Build and deploy supervised, unsupervised, deep, and reinforcement learning models PDF
[72] Keyword-aware Abstractive Summarization by Extracting Set-level Intermediate Summaries PDF
Quality and faithfulness reward design for rephraser optimization
The authors design a reward system consisting of one quality reward (DataMan score) and three faithfulness rewards (BERTScore for semantics, structure preservation, and length alignment). These rewards guide the rephraser to produce high-quality rephrasings while faithfully preserving the core characteristics of organic data.
[49] Boosting reward model with preference-conditional multi-aspect synthetic data generation PDF
[50] Beyond Correctness: Rewarding Faithful Reasoning in Retrieval-Augmented Generation PDF
[51] Calibrated self-rewarding vision language models PDF
[52] Reno: Enhancing one-step text-to-image models through reward-based noise optimization PDF
[53] Reason-SVG: Hybrid Reward RL for Aha-Moments in Vector Graphics Generation PDF
[54] Parrot: Pareto-optimal multi-reward reinforcement learning framework for text-to-image generation PDF
[55] Symbolic graphics programming with large language models PDF
[56] Rag-rewardbench: Benchmarking reward models in retrieval augmented generation for preference alignment PDF
[57] Multi-metric preference alignment for generative speech restoration PDF
[58] Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning PDF
Demonstration of improved data efficiency and faithful preservation
The authors demonstrate that RePro achieves superior performance compared to existing methods while using a much smaller model, improving organic data efficiency by 2-3 times. Their analyses validate that the method preserves critical information and faithfully reflects the characteristics of organic data better than prompting-based approaches.