Squeeze the Soaked Sponge: Efficient Off-policy RFT for Large Language Model

ICLR 2026 Conference SubmissionAnonymous Authors
Reinforcement FinetuningLarge Language ModelReasoning
Abstract:

Reinforcement Learning (RL) has demonstrated its potential to improve the reasoning ability of Large Language Models (LLMs), yet most existing Reinforcement Finetuning (RFT) methods are inherently \textit{on-policy} RL, failing to reuse historical data and thus preventing efficient scaling. In this work, we explore the potential of \textit{off-policy} RL to leverage historical data for rollout-efficient RFT. Specifically, we propose \textbf{Re}incarnating \textbf{Mix}-policy Proximal Policy Gradient (\textbf{ReMix}), which enables on-policy RFT methods to leverage off-policy data. ReMix consists of three major components: (1) Mix-policy proximal policy gradient with an increased Update-To-Data (UTD) ratio that utilizes the data from both current and past policies for efficient training; (2) KL-Convex policy constraint that combines the KL constraints on the base and precedent model to balance stability and flexibility; (3) Policy reincarnation that replaces the base model with the mix-policy RFT model in the mid way of training and restarts on-policy training, to achieve a seamless transition from early efficiency to steady convergence. In our experiments, we train a series of ReMix models based on PPO, GRPO from 1.5B, 7B base models. On five math reasoning benchmarks (i.e., AIME'24, AMC'23, Minerva, OlympiadBench, and MATH500), ReMix achieves an average Pass@1 accuracy of \textbf{52.10%} (with \textbf{0.079M rollouts}) and \textbf{64.39%} (with \textbf{0.011M rollouts}) on 1.5B and 7B models, respectively. Compared with 15 recent advanced models, ReMix shows SOTA-level performance with an over \textbf{30x to 450x reduction in training cost in terms of rollout data volume}, demonstrating superior training efficiency. Additionally, our multifaceted analysis reveals insightful findings, including the implicit preference for shorter responses of off-policy RFT, the collapse mode of self-reflection under severe off-policyness, etc.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes ReMix, a method enabling on-policy reinforcement finetuning to leverage off-policy data through mix-policy proximal policy gradient, KL-convex constraints, and policy reincarnation. It resides in the 'Off-Policy Policy Gradient and Importance Sampling Methods' leaf, which contains eight papers including the original work. This leaf sits within the broader 'Off-Policy RL Algorithms and Optimization Methods' branch, indicating a moderately populated research direction focused on importance sampling and policy gradient techniques for LLM finetuning.

The taxonomy reveals neighboring leaves addressing hybrid on-/off-policy integration and EM-based trajectory optimization, suggesting the field explores multiple strategies for balancing data reuse and training stability. The sibling papers in the same leaf—such as Squeeze Soaked Sponge, Bapo, and Asymmetric REINFORCE—share the core challenge of variance control and safe exploitation of stale rollouts. ReMix diverges by introducing policy reincarnation and a convex KL constraint, aiming to transition from early efficiency to steady convergence, whereas siblings typically focus on refined importance weighting or gradient tapering alone.

Among thirty candidates examined, the overall ReMix method shows one refutable candidate, while the mix-policy proximal policy gradient component encounters two overlapping prior works. The KL-convex constraint appears more novel, with zero refutable candidates among ten examined. These statistics suggest that while the core algorithmic building blocks have precedent in the limited search scope, the specific combination and the reincarnation mechanism may offer incremental differentiation. The search scale is modest, so broader literature may reveal additional overlaps or confirm relative novelty.

Based on the top-thirty semantic matches and taxonomy structure, ReMix appears to occupy a crowded methodological niche where incremental refinements to off-policy policy gradient methods are common. The analysis does not cover exhaustive citation networks or domain-specific applications, leaving open the possibility that the practical impact or empirical gains distinguish this work beyond the algorithmic formulation captured here.

Taxonomy

Core-task Taxonomy Papers
41
3
Claimed Contributions
30
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: efficient off-policy reinforcement finetuning for large language models. The field has grown into a rich landscape organized around several complementary themes. At the highest level, one finds branches dedicated to core algorithmic innovations—off-policy RL algorithms and optimization methods (including policy gradient and importance sampling techniques like Soaked Sponge[0], Squeeze Soaked Sponge[3], and Bapo[5]), asynchronous and distributed training schemes (e.g., Asynchronous RLHF[9] and Faster Asynchronous RLHF[23]), and unified frameworks that aim to consolidate disparate methods into general-purpose systems. Parallel to these algorithmic threads, the taxonomy highlights offline RL and reward-weighted finetuning approaches, domain-specific applications ranging from agent systems to specialized tasks, and alternative alignment paradigms that explore non-RL routes to preference learning. Additional branches address sample efficiency and data-centric methods, training dynamics and theoretical analysis, diffusion-based or non-autoregressive models, and transfer learning across domains, reflecting the breadth of strategies researchers employ to make LLM finetuning both effective and scalable. Within this landscape, a particularly active line of work centers on off-policy policy gradient and importance sampling methods, where the central challenge is to reuse previously collected data while controlling variance and bias. Soaked Sponge[0] sits squarely in this cluster, sharing methodological kinship with neighbors like Squeeze Soaked Sponge[3], which refines importance weighting strategies, and Bapo[5], which explores alternative off-policy corrections. Nearby efforts such as Tapered REINFORCE[8] and Asymmetric REINFORCE[22] investigate variance reduction through gradient tapering or asymmetric updates, while Trinity RFT[1] and Residual Off-Policy[2] propose hybrid or residual formulations to balance on-policy stability with off-policy sample reuse. The main trade-offs revolve around computational overhead, variance control, and the degree to which one can safely exploit stale rollouts without destabilizing training. Soaked Sponge[0] emphasizes efficient reuse of off-policy samples through careful importance correction, positioning itself as a practical middle ground between purely on-policy methods and more aggressive offline schemes.

Claimed Contributions

Reincarnating Mix-policy Proximal Policy Gradient (ReMix) method

The authors introduce ReMix, a general approach that enables on-policy reinforcement finetuning methods like PPO and GRPO to utilize off-policy data for efficient training. The method consists of three synergistic components: mix-policy proximal policy gradient with increased Update-To-Data ratio, KL-Convex policy constraint, and policy reincarnation.

10 retrieved papers
Can Refute
Mix-policy proximal policy gradient with increased UTD ratio

The authors propose a mix-policy proximal policy gradient method that strategically leverages both off-policy and on-policy data within a unified objective function, combined with an increased Update-To-Data ratio to perform repeated gradient updates on sampled data batches, thereby reducing fresh environment interaction demands.

10 retrieved papers
Can Refute
KL-Convex policy constraint

The authors introduce a KL-Convex policy constraint that dynamically updates the anchor objective to a convex combination of the base model and the precedent model, enabling the policy to preserve foundational capabilities while facilitating iterative refinement and continuous improvement.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Reincarnating Mix-policy Proximal Policy Gradient (ReMix) method

The authors introduce ReMix, a general approach that enables on-policy reinforcement finetuning methods like PPO and GRPO to utilize off-policy data for efficient training. The method consists of three synergistic components: mix-policy proximal policy gradient with increased Update-To-Data ratio, KL-Convex policy constraint, and policy reincarnation.

Contribution

Mix-policy proximal policy gradient with increased UTD ratio

The authors propose a mix-policy proximal policy gradient method that strategically leverages both off-policy and on-policy data within a unified objective function, combined with an increased Update-To-Data ratio to perform repeated gradient updates on sampled data batches, thereby reducing fresh environment interaction demands.

Contribution

KL-Convex policy constraint

The authors introduce a KL-Convex policy constraint that dynamically updates the anchor objective to a convex combination of the base model and the precedent model, enabling the policy to preserve foundational capabilities while facilitating iterative refinement and continuous improvement.