Towards Bridging the Gap between Large-Scale Pretraining and Efficient Finetuning for Humanoid Control

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

reinforcement learningworld modelhumanoid

Reinforcement learning (RL) is widely used for humanoid control, with on-policy methods such as Proximal Policy Optimization (PPO) enabling robust training via large-scale parallel simulation and, in some cases, zero-shot deployment to real robots. However, the low sample efficiency of on-policy algorithms limits safe adaptation to new environments. Although off-policy RL and model-based RL have shown improved sample efficiency, the gap between large-scale pretraining and efficient finetuning on humanoids still exists. In this paper, we find that off-policy Soft Actor-Critic (SAC), with large-batch update and a high Update-To-Data (UTD) ratio, reliably supports large-scale pretraining of humanoid locomotion policies, achieving zero-shot deployment on real robots. For adaptation, we demonstrate in simulation that these SAC-pretrained policies can be finetuned in new environments and out-of-distribution tasks using model-based methods. Data collection in the new environment executes a deterministic policy while stochastic exploration is instead confined to a physics-informed world model. This separation mitigates the risks of random exploration during adaptation while preserving exploratory coverage for improvement. Overall, the approach couples the wall-clock efficiency of large-scale simulation during pretraining with the sample efficiency of model-based learning during fine-tuning.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: efficient finetuning of humanoid locomotion policies using model-based reinforcement learning. The field organizes around several complementary strategies for learning robust locomotion controllers. Model-Based Pretraining and Finetuning Frameworks explore how learned dynamics models can accelerate policy adaptation, often combining simulation-based pretraining with real-world deployment. World Model and Dynamics Learning focuses on building predictive representations of robot behavior, ranging from neural network approximations like Neural Network Dynamics[30] to more structured approaches such as Denoising World Model[3]. Model Predictive Control and Planning Integration emphasizes optimization-driven methods that leverage models for online trajectory generation, exemplified by works like Temporal Difference MPC[17] and Adaptive MPC Terrain[16]. End-to-End Learning Approaches and Hierarchical and Task Space Control represent alternative paradigms that either bypass explicit modeling or decompose control into layered abstractions. Specialized Learning Techniques and Representations address domain-specific challenges such as terrain adaptation and contact modeling, while Comparative Studies and Benchmarking provide empirical assessments across methods. Recent activity highlights tensions between sample efficiency and generalization. Many studies pursue hybrid strategies that blend model-based planning with model-free policy learning, as seen in Fusing Dynamics RL[21] and PTRL Prior Transfer[20], seeking to exploit the strengths of both paradigms. Within this landscape, Bridging Pretraining Finetuning[0] sits among works that use learned dynamics to bootstrap policy training, closely related to Decoupled Backpropagation[4] and Neural Network Dynamics[30], which similarly emphasize gradient-based optimization through differentiable models. Compared to PPF Preservative Finetuning[2], which focuses on retaining pretrained knowledge during adaptation, the original work appears to prioritize efficient transfer from model-based pretraining to downstream tasks. Open questions persist around how to balance model accuracy with computational overhead, and whether hierarchical decompositions or end-to-end learning better handle the complexity of real-world humanoid locomotion.

Claimed Contributions

Scalable JAX implementation of SAC for humanoid pretraining and zero-shot deployment

2 retrieved papers

The authors develop a JAX-based SAC implementation that enables large-batch updates and high update-to-data ratios, achieving fast pretraining in parallel simulation and successful zero-shot transfer to real humanoid robots. This implementation serves as the policy module for subsequent model-based finetuning.

2 retrieved papers

Finetuning strategy with deterministic execution and physics-informed world model exploration

2 retrieved papers

The authors propose a finetuning approach that separates deterministic policy execution in the real environment from stochastic exploration confined to a physics-informed world model. This design enhances safety during adaptation while maintaining exploratory coverage, enabling data-efficient in-distribution adaptation and stronger out-of-distribution generalization.

2 retrieved papers

Open-source LIFT pipeline for humanoid control

10 retrieved papers

The authors provide a complete open-source framework that integrates large-scale pretraining, zero-shot sim-to-real transfer, and efficient finetuning for humanoid robots, offering a practical baseline for the robotics research community.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[4] First order model-based rl through decoupled backpropagation PDF

Khorrambakht, Rooholla, Chane-Sane, Elliot, Mansard, Nicolas, Righetti, Ludovic (2025)

[30] Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning PDF

Nagabandi, Anusha, Anusha Nagabandi, Kahn, Gregory, Gregory Kahn, Fearing, Ronald S., Ronald S. Fearing, G. Kahn, Levine, Sergey, Sergey Levine, R. Fearing, S. Levine (2018)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Scalable JAX implementation of SAC for humanoid pretraining and zero-shot deployment

[62] Learning Sim-to-Real Humanoid Locomotion in 15 Minutes PDF

Cannot Refute

[63] Agile Legged Robots Through Reinforcement Learning and Optimal Control PDF

Cannot Refute

Contribution

Finetuning strategy with deterministic execution and physics-informed world model exploration

[51] Learning latent dynamics for planning from pixels PDF

Cannot Refute

[52] On the model-based stochastic value gradient for continuous reinforcement learning PDF

Cannot Refute

Contribution

Open-source LIFT pipeline for humanoid control

[2] PPF: Pre-Training and Preservative Fine-Tuning of Humanoid Locomotion via Model-Assumption-Based Regularization PDF

Cannot Refute

[53] Learning Human-to-Humanoid Real-Time Whole-Body Teleoperation PDF

Cannot Refute

[54] WoCoCo: Learning Whole-Body Humanoid Control with Sequential Contacts PDF

Cannot Refute

[55] Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids PDF

Cannot Refute

[56] Resmimic: From general motion tracking to humanoid whole-body loco-manipulation via residual learning PDF

Cannot Refute

[57] Exbody2: Advanced expressive humanoid whole-body control PDF

Cannot Refute

[58] Behavior foundation model for humanoid robots PDF

Cannot Refute

[59] Adversarial locomotion and motion imitation for humanoid policy learning PDF

Cannot Refute

[60] From experts to a generalist: Toward general whole-body control for humanoid robots PDF

Cannot Refute

[61] Sim-to-Real Learning for Humanoid Box Loco-Manipulation PDF

Cannot Refute

Towards Bridging the Gap between Large-Scale Pretraining and Efficient Finetuning for Humanoid Control

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[4] First order model-based rl through decoupled backpropagation PDF

[30] Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning PDF

Contribution Analysis

Scalable JAX implementation of SAC for humanoid pretraining and zero-shot deployment

[62] Learning Sim-to-Real Humanoid Locomotion in 15 Minutes PDF

[63] Agile Legged Robots Through Reinforcement Learning and Optimal Control PDF

Finetuning strategy with deterministic execution and physics-informed world model exploration

[51] Learning latent dynamics for planning from pixels PDF

[52] On the model-based stochastic value gradient for continuous reinforcement learning PDF

Open-source LIFT pipeline for humanoid control

[2] PPF: Pre-Training and Preservative Fine-Tuning of Humanoid Locomotion via Model-Assumption-Based Regularization PDF

[53] Learning Human-to-Humanoid Real-Time Whole-Body Teleoperation PDF

[54] WoCoCo: Learning Whole-Body Humanoid Control with Sequential Contacts PDF

[55] Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids PDF

[56] Resmimic: From general motion tracking to humanoid whole-body loco-manipulation via residual learning PDF

[57] Exbody2: Advanced expressive humanoid whole-body control PDF

[58] Behavior foundation model for humanoid robots PDF

[59] Adversarial locomotion and motion imitation for humanoid policy learning PDF

[60] From experts to a generalist: Toward general whole-body control for humanoid robots PDF

[61] Sim-to-Real Learning for Humanoid Box Loco-Manipulation PDF

Table of Contents