Squeeze the Soaked Sponge: Efficient Off-policy RFT for Large Language Model

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Reinforcement FinetuningLarge Language ModelReasoning

Reinforcement Learning (RL) has demonstrated its potential to improve the reasoning ability of Large Language Models (LLMs), yet most existing Reinforcement Finetuning (RFT) methods are inherently \textit{on-policy} RL, failing to reuse historical data and thus preventing efficient scaling. In this work, we explore the potential of \textit{off-policy} RL to leverage historical data for rollout-efficient RFT. Specifically, we propose \textbf{Re}incarnating \textbf{Mix}-policy Proximal Policy Gradient (\textbf{ReMix}), which enables on-policy RFT methods to leverage off-policy data. ReMix consists of three major components: (1) Mix-policy proximal policy gradient with an increased Update-To-Data (UTD) ratio that utilizes the data from both current and past policies for efficient training; (2) KL-Convex policy constraint that combines the KL constraints on the base and precedent model to balance stability and flexibility; (3) Policy reincarnation that replaces the base model with the mix-policy RFT model in the mid way of training and restarts on-policy training, to achieve a seamless transition from early efficiency to steady convergence. In our experiments, we train a series of ReMix models based on PPO, GRPO from 1.5B, 7B base models. On five math reasoning benchmarks (i.e., AIME'24, AMC'23, Minerva, OlympiadBench, and MATH500), ReMix achieves an average Pass@1 accuracy of \textbf{52.10%} (with \textbf{0.079M rollouts}) and \textbf{64.39%} (with \textbf{0.011M rollouts}) on 1.5B and 7B models, respectively. Compared with 15 recent advanced models, ReMix shows SOTA-level performance with an over \textbf{30x to 450x reduction in training cost in terms of rollout data volume}, demonstrating superior training efficiency. Additionally, our multifaceted analysis reveals insightful findings, including the implicit preference for shorter responses of off-policy RFT, the collapse mode of self-reflection under severe off-policyness, etc.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes ReMix, a method enabling on-policy reinforcement finetuning to leverage off-policy data through mix-policy proximal policy gradient, KL-convex constraints, and policy reincarnation. It resides in the 'Off-Policy Policy Gradient and Importance Sampling Methods' leaf, which contains eight papers including the original work. This leaf sits within the broader 'Off-Policy RL Algorithms and Optimization Methods' branch, indicating a moderately populated research direction focused on importance sampling and policy gradient techniques for LLM finetuning.

The taxonomy reveals neighboring leaves addressing hybrid on-/off-policy integration and EM-based trajectory optimization, suggesting the field explores multiple strategies for balancing data reuse and training stability. The sibling papers in the same leaf—such as Squeeze Soaked Sponge, Bapo, and Asymmetric REINFORCE—share the core challenge of variance control and safe exploitation of stale rollouts. ReMix diverges by introducing policy reincarnation and a convex KL constraint, aiming to transition from early efficiency to steady convergence, whereas siblings typically focus on refined importance weighting or gradient tapering alone.

Among thirty candidates examined, the overall ReMix method shows one refutable candidate, while the mix-policy proximal policy gradient component encounters two overlapping prior works. The KL-convex constraint appears more novel, with zero refutable candidates among ten examined. These statistics suggest that while the core algorithmic building blocks have precedent in the limited search scope, the specific combination and the reincarnation mechanism may offer incremental differentiation. The search scale is modest, so broader literature may reveal additional overlaps or confirm relative novelty.

Based on the top-thirty semantic matches and taxonomy structure, ReMix appears to occupy a crowded methodological niche where incremental refinements to off-policy policy gradient methods are common. The analysis does not cover exhaustive citation networks or domain-specific applications, leaving open the possibility that the practical impact or empirical gains distinguish this work beyond the algorithmic formulation captured here.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: efficient off-policy reinforcement finetuning for large language models. The field has grown into a rich landscape organized around several complementary themes. At the highest level, one finds branches dedicated to core algorithmic innovations—off-policy RL algorithms and optimization methods (including policy gradient and importance sampling techniques like Soaked Sponge[0], Squeeze Soaked Sponge[3], and Bapo[5]), asynchronous and distributed training schemes (e.g., Asynchronous RLHF[9] and Faster Asynchronous RLHF[23]), and unified frameworks that aim to consolidate disparate methods into general-purpose systems. Parallel to these algorithmic threads, the taxonomy highlights offline RL and reward-weighted finetuning approaches, domain-specific applications ranging from agent systems to specialized tasks, and alternative alignment paradigms that explore non-RL routes to preference learning. Additional branches address sample efficiency and data-centric methods, training dynamics and theoretical analysis, diffusion-based or non-autoregressive models, and transfer learning across domains, reflecting the breadth of strategies researchers employ to make LLM finetuning both effective and scalable. Within this landscape, a particularly active line of work centers on off-policy policy gradient and importance sampling methods, where the central challenge is to reuse previously collected data while controlling variance and bias. Soaked Sponge[0] sits squarely in this cluster, sharing methodological kinship with neighbors like Squeeze Soaked Sponge[3], which refines importance weighting strategies, and Bapo[5], which explores alternative off-policy corrections. Nearby efforts such as Tapered REINFORCE[8] and Asymmetric REINFORCE[22] investigate variance reduction through gradient tapering or asymmetric updates, while Trinity RFT[1] and Residual Off-Policy[2] propose hybrid or residual formulations to balance on-policy stability with off-policy sample reuse. The main trade-offs revolve around computational overhead, variance control, and the degree to which one can safely exploit stale rollouts without destabilizing training. Soaked Sponge[0] emphasizes efficient reuse of off-policy samples through careful importance correction, positioning itself as a practical middle ground between purely on-policy methods and more aggressive offline schemes.

Claimed Contributions

Reincarnating Mix-policy Proximal Policy Gradient (ReMix) method

Can Refute

10 retrieved papers

The authors introduce ReMix, a general approach that enables on-policy reinforcement finetuning methods like PPO and GRPO to utilize off-policy data for efficient training. The method consists of three synergistic components: mix-policy proximal policy gradient with increased Update-To-Data ratio, KL-Convex policy constraint, and policy reincarnation.

10 retrieved papers

Can Refute

Mix-policy proximal policy gradient with increased UTD ratio

Can Refute

10 retrieved papers

The authors propose a mix-policy proximal policy gradient method that strategically leverages both off-policy and on-policy data within a unified objective function, combined with an increased Update-To-Data ratio to perform repeated gradient updates on sampled data batches, thereby reducing fresh environment interaction demands.

10 retrieved papers

Can Refute

KL-Convex policy constraint

10 retrieved papers

The authors introduce a KL-Convex policy constraint that dynamically updates the anchor objective to a convex combination of the base model and the precedent model, enabling the policy to preserve foundational capabilities while facilitating iterative refinement and continuous improvement.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[3] Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model PDF

Liang Jing, Tang, Hongyao, Ma Yi, Liu Jin-yi, Zheng Yan, Hu, Shuyue, Bai Lei, Hao, Jianye (2025)

[5] Bapo: Stabilizing off-policy reinforcement learning for llms via balanced policy optimization with adaptive clipping PDF

Xi, Zhiheng, Guo Xin, Nan Yang, Zhou En-Yu, Chen Wen-xiang, Liu Jiaqi, Zhang Zhihao, Guo Hong-lin, Deng Xun, Lei Zhikai, Zheng Miao, Zhang Shuo, Sun Peng, Zheng Rui, Yan Hang, Gui, Tao, Zhang Qi, Huang, Xuanjing (2025)

[8] Tapered off-policy reinforce: Stable and efficient reinforcement learning for llms PDF

Roux, Nicolas Le, Bellemare, Marc G., Nicolas Le Roux, Lebensold, Jonathan, Marc G. Bellemare, Bergeron, Arnaud, Jonathan Lebensold, Greaves, Joshua, Arnaud Bergeron, FrÃ©chette, Alex, Joshua Greaves, Alex Fr'echette, Thibodeau-Laufer, Eric, Carolyne Pelletier, Toth Sandor, Eric Thibodeau-Laufer, Work, Sam, S'andor Toth, Sam Work (2025)

[22] Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards PDF

Arnal, Charles, Charles Arnal, Cabannes, Vivien, GaÃ«tan Narozniak, Tang, Yunhao, Vivien Cabannes, Kempe, Julia, Yunhao Tang, Munos, RÃ©mi, Julia Kempe, RÃ©mi Munos (2025)

[38] Off-Policy Finetuning for LLM Math Reasoning PDF

A Hudson, N Phouksouvath (0)

[40] Tapered Off-Policy REINFORCE-Stable and efficient reinforcement learning for large language models PDF

N Le Roux, MG Bellemare, J Lebensold (0)

[41] Fine-tuning Large Language Models via Tapered Off-Policy REINFORCE (TOPR) PDF

M Pu (0)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Reincarnating Mix-policy Proximal Policy Gradient (ReMix) method

[43] Learning to reason under off-policy guidance PDF

Can Refute

[7] On-policy rl meets off-policy experts: Harmonizing supervised fine-tuning and reinforcement learning via dynamic weighting PDF

Cannot Refute

[12] RL-finetuning LLMs from on- and off-policy data with a single algorithm PDF

Cannot Refute

[42] Offline-to-online reinforcement learning with policy ensemble and policy-extended value PDF

Cannot Refute

[44] On-policy policy gradient reinforcement learning without on-policy sampling PDF

Cannot Refute

[45] Off-Policy Reinforcement Learning for Control Design PDF

Cannot Refute

[46] On-Policy vs. Off-Policy Reinforcement Learning for Multi-Domain SFC Embedding in SDN/NFV-Enabled Networks PDF

Cannot Refute

[47] Revisiting Group Relative Policy Optimization: Insights into On-Policy and Off-Policy Training PDF

Cannot Refute

[48] Research on experience replay of off-policy deep reinforcement learning: a review PDF

Cannot Refute

[49] Efficient recurrent off-policy RL requires a context-encoder-specific learning rate PDF

Cannot Refute

Contribution

Mix-policy proximal policy gradient with increased UTD ratio

[61] Generalized Proximal Policy Optimization with Sample Reuse PDF

Can Refute

[65] P3O: Policy-on Policy-off Policy Optimization PDF

Can Refute

[3] Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model PDF

Cannot Refute

[60] Lyapunov-based Safe Policy Optimization for Continuous Control PDF

Cannot Refute

[62] HiPPO: Enhancing proximal policy optimization with highlight replay PDF

Cannot Refute

[63] PTR-PPO: Proximal Policy Optimization with Prioritized Trajectory Replay PDF

Cannot Refute

[64] Transductive off-policy proximal policy optimization PDF

Cannot Refute

[66] Behavior proximal policy optimization PDF

Cannot Refute

[67] Proximal policy optimization with advantage reuse competition PDF

Cannot Refute

[68] Path Planning for Multi-UAV Based on Improved Proximal Policy Optimization Algorithm PDF

Cannot Refute

Contribution

KL-Convex policy constraint

[50] Group Relative Policy Optimization for Image Captioning PDF

Cannot Refute

[51] Truly proximal policy optimization PDF

Cannot Refute

[52] Perception-Aware Policy Optimization for Multimodal Reasoning PDF

Cannot Refute

[53] Trust region policy optimization via entropy regularization for Kullback-Leibler divergence constraint PDF

Cannot Refute

[54] Projection-Based Constrained Policy Optimization PDF

Cannot Refute

[55] Policy bifurcation in safe reinforcement learning PDF

Cannot Refute

[56] Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models PDF

Cannot Refute

[57] Stepwise alignment for constrained language model policy optimization PDF

Cannot Refute

[58] On the design of kl-regularized policy gradient algorithms for llm reasoning PDF

Cannot Refute

[59] The choice of divergence: A neglected key to mitigating diversity collapse in reinforcement learning with verifiable reward PDF

Cannot Refute

Squeeze the Soaked Sponge: Efficient Off-policy RFT for Large Language Model

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[3] Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model PDF

[5] Bapo: Stabilizing off-policy reinforcement learning for llms via balanced policy optimization with adaptive clipping PDF

[8] Tapered off-policy reinforce: Stable and efficient reinforcement learning for llms PDF

[22] Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards PDF

[38] Off-Policy Finetuning for LLM Math Reasoning PDF

[40] Tapered Off-Policy REINFORCE-Stable and efficient reinforcement learning for large language models PDF

[41] Fine-tuning Large Language Models via Tapered Off-Policy REINFORCE (TOPR) PDF

Contribution Analysis

Reincarnating Mix-policy Proximal Policy Gradient (ReMix) method

[43] Learning to reason under off-policy guidance PDF

[7] On-policy rl meets off-policy experts: Harmonizing supervised fine-tuning and reinforcement learning via dynamic weighting PDF

[12] RL-finetuning LLMs from on- and off-policy data with a single algorithm PDF

[42] Offline-to-online reinforcement learning with policy ensemble and policy-extended value PDF

[44] On-policy policy gradient reinforcement learning without on-policy sampling PDF

[45] Off-Policy Reinforcement Learning for Control Design PDF

[46] On-Policy vs. Off-Policy Reinforcement Learning for Multi-Domain SFC Embedding in SDN/NFV-Enabled Networks PDF

[47] Revisiting Group Relative Policy Optimization: Insights into On-Policy and Off-Policy Training PDF

[48] Research on experience replay of off-policy deep reinforcement learning: a review PDF

[49] Efficient recurrent off-policy RL requires a context-encoder-specific learning rate PDF

Mix-policy proximal policy gradient with increased UTD ratio

[61] Generalized Proximal Policy Optimization with Sample Reuse PDF

[65] P3O: Policy-on Policy-off Policy Optimization PDF

[3] Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model PDF

[60] Lyapunov-based Safe Policy Optimization for Continuous Control PDF

[62] HiPPO: Enhancing proximal policy optimization with highlight replay PDF

[63] PTR-PPO: Proximal Policy Optimization with Prioritized Trajectory Replay PDF

[64] Transductive off-policy proximal policy optimization PDF

[66] Behavior proximal policy optimization PDF

[67] Proximal policy optimization with advantage reuse competition PDF

[68] Path Planning for Multi-UAV Based on Improved Proximal Policy Optimization Algorithm PDF

KL-Convex policy constraint

[50] Group Relative Policy Optimization for Image Captioning PDF

[51] Truly proximal policy optimization PDF

[52] Perception-Aware Policy Optimization for Multimodal Reasoning PDF

[53] Trust region policy optimization via entropy regularization for Kullback-Leibler divergence constraint PDF

[54] Projection-Based Constrained Policy Optimization PDF

[55] Policy bifurcation in safe reinforcement learning PDF

[56] Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models PDF

[57] Stepwise alignment for constrained language model policy optimization PDF

[58] On the design of kl-regularized policy gradient algorithms for llm reasoning PDF

[59] The choice of divergence: A neglected key to mitigating diversity collapse in reinforcement learning with verifiable reward PDF

Table of Contents