Abstract:

Reinforcement learning (RL) is widely used for humanoid control, with on-policy methods such as Proximal Policy Optimization (PPO) enabling robust training via large-scale parallel simulation and, in some cases, zero-shot deployment to real robots. However, the low sample efficiency of on-policy algorithms limits safe adaptation to new environments. Although off-policy RL and model-based RL have shown improved sample efficiency, the gap between large-scale pretraining and efficient finetuning on humanoids still exists. In this paper, we find that off-policy Soft Actor-Critic (SAC), with large-batch update and a high Update-To-Data (UTD) ratio, reliably supports large-scale pretraining of humanoid locomotion policies, achieving zero-shot deployment on real robots. For adaptation, we demonstrate in simulation that these SAC-pretrained policies can be finetuned in new environments and out-of-distribution tasks using model-based methods. Data collection in the new environment executes a deterministic policy while stochastic exploration is instead confined to a physics-informed world model. This separation mitigates the risks of random exploration during adaptation while preserving exploratory coverage for improvement. Overall, the approach couples the wall-clock efficiency of large-scale simulation during pretraining with the sample efficiency of model-based learning during fine-tuning.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
14
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: efficient finetuning of humanoid locomotion policies using model-based reinforcement learning. The field organizes around several complementary strategies for learning robust locomotion controllers. Model-Based Pretraining and Finetuning Frameworks explore how learned dynamics models can accelerate policy adaptation, often combining simulation-based pretraining with real-world deployment. World Model and Dynamics Learning focuses on building predictive representations of robot behavior, ranging from neural network approximations like Neural Network Dynamics[30] to more structured approaches such as Denoising World Model[3]. Model Predictive Control and Planning Integration emphasizes optimization-driven methods that leverage models for online trajectory generation, exemplified by works like Temporal Difference MPC[17] and Adaptive MPC Terrain[16]. End-to-End Learning Approaches and Hierarchical and Task Space Control represent alternative paradigms that either bypass explicit modeling or decompose control into layered abstractions. Specialized Learning Techniques and Representations address domain-specific challenges such as terrain adaptation and contact modeling, while Comparative Studies and Benchmarking provide empirical assessments across methods. Recent activity highlights tensions between sample efficiency and generalization. Many studies pursue hybrid strategies that blend model-based planning with model-free policy learning, as seen in Fusing Dynamics RL[21] and PTRL Prior Transfer[20], seeking to exploit the strengths of both paradigms. Within this landscape, Bridging Pretraining Finetuning[0] sits among works that use learned dynamics to bootstrap policy training, closely related to Decoupled Backpropagation[4] and Neural Network Dynamics[30], which similarly emphasize gradient-based optimization through differentiable models. Compared to PPF Preservative Finetuning[2], which focuses on retaining pretrained knowledge during adaptation, the original work appears to prioritize efficient transfer from model-based pretraining to downstream tasks. Open questions persist around how to balance model accuracy with computational overhead, and whether hierarchical decompositions or end-to-end learning better handle the complexity of real-world humanoid locomotion.

Claimed Contributions

Scalable JAX implementation of SAC for humanoid pretraining and zero-shot deployment

The authors develop a JAX-based SAC implementation that enables large-batch updates and high update-to-data ratios, achieving fast pretraining in parallel simulation and successful zero-shot transfer to real humanoid robots. This implementation serves as the policy module for subsequent model-based finetuning.

2 retrieved papers
Finetuning strategy with deterministic execution and physics-informed world model exploration

The authors propose a finetuning approach that separates deterministic policy execution in the real environment from stochastic exploration confined to a physics-informed world model. This design enhances safety during adaptation while maintaining exploratory coverage, enabling data-efficient in-distribution adaptation and stronger out-of-distribution generalization.

2 retrieved papers
Open-source LIFT pipeline for humanoid control

The authors provide a complete open-source framework that integrates large-scale pretraining, zero-shot sim-to-real transfer, and efficient finetuning for humanoid robots, offering a practical baseline for the robotics research community.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Scalable JAX implementation of SAC for humanoid pretraining and zero-shot deployment

The authors develop a JAX-based SAC implementation that enables large-batch updates and high update-to-data ratios, achieving fast pretraining in parallel simulation and successful zero-shot transfer to real humanoid robots. This implementation serves as the policy module for subsequent model-based finetuning.

Contribution

Finetuning strategy with deterministic execution and physics-informed world model exploration

The authors propose a finetuning approach that separates deterministic policy execution in the real environment from stochastic exploration confined to a physics-informed world model. This design enhances safety during adaptation while maintaining exploratory coverage, enabling data-efficient in-distribution adaptation and stronger out-of-distribution generalization.

Contribution

Open-source LIFT pipeline for humanoid control

The authors provide a complete open-source framework that integrates large-scale pretraining, zero-shot sim-to-real transfer, and efficient finetuning for humanoid robots, offering a practical baseline for the robotics research community.