Master Skill Learning with Policy-Grounded Synergy of LLM-based Reward Shaping and Exploring

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

Robot Skill AcquisitionDexterous ManipulationAutomatic Reward Design

The acquisition of robotic skills via reinforcement learning (RL) is crucial for advancing embodied intelligence, but designing effective reward functions for complex tasks remains challenging. Recent methods using large language models (LLMs) can generate reward functions from language instructions, but they often produce overly goal-oriented rewards that neglect state exploration, causing robots to get stuck in local optima. Traditional RL addresses this by adding exploration bonuses, but these are typically generic and inefficient, wasting resources on exploring task-irrelevant areas. To address these limitations, we propose Policy-grounded Synergy of Reward Shaping and Exploration (PoRSE), a novel and unified framework that guides LLMs to generate task-aware reward functions while constructing an abstract affordance space for efficient exploration bonuses. Given the vast number of possible reward-bonus combinations, it is impractical to exhaustively train a policy from scratch for each configuration to identify the best one. Instead, PoRSE employs an in-policy-improvement grounding process, dynamically and continuously generating and filtering out reward-bonus pairs along the policy improvement process. This approach accelerates skill acquisition and fosters a mutually reinforcing relationship between reward shaping, exploration and policy enhancement through close feedback. Experiments show that PoRSE is highly effective, achieving significant improvement in average returns across all robotic tasks compared to previous state-of-the-art methods. It also achieves initial success in two highly challenging manipulation tasks, marking a significant breakthrough.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes PoRSE, a unified framework that combines LLM-driven reward shaping with task-aware exploration bonuses for robotic skill learning. It resides in the Code-Based Reward Synthesis leaf, which contains five papers including the original work. This leaf focuses on approaches where LLMs generate executable reward code through zero-shot or evolutionary optimization without human demonstrations. The presence of five papers in this specific leaf suggests a moderately active research direction within the broader taxonomy of fifty papers across approximately thirty-six topics, indicating neither extreme crowding nor isolation.

The taxonomy reveals that Code-Based Reward Synthesis sits within the larger LLM-Driven Reward Function Generation branch, which also includes Self-Refinement and Iterative Improvement (three papers), Reward Learning from Alternative Modalities (two papers), and Domain-Specific Reward Design (three papers). Neighboring branches address VLM-Based Reward Learning (seven papers across three leaves), LLM-Guided Task Structuring (five papers), and Exploration Guidance (two papers). The scope note for Code-Based Reward Synthesis explicitly excludes demonstration-assisted or video-to-reward approaches, while the Exploration Guidance branch exists as a separate category, suggesting that unifying these two aspects—as PoRSE attempts—crosses traditional categorical boundaries within the field.

Among thirty candidates examined through limited semantic search, none were found to clearly refute any of the three core contributions. The PoRSE framework itself was assessed against ten candidates with zero refutable matches. Similarly, the in-policy improvement grounding process and the task-aware affordance state space each faced ten candidates without clear prior overlap. These statistics suggest that within the examined scope, the specific combination of policy-grounded reward-exploration co-design appears relatively unexplored, though the limited search scale (thirty candidates from a potentially much larger literature) means substantial prior work could exist beyond the top-K semantic matches retrieved.

Based on the limited literature search covering thirty semantically similar papers, the work appears to occupy a distinctive position by bridging reward synthesis and exploration guidance—two areas typically treated separately in the taxonomy. However, the analysis does not cover exhaustive citation networks or domain-specific venues, leaving open the possibility of relevant prior work in adjacent communities such as intrinsic motivation research or hierarchical RL that may not surface in semantic search focused on LLM-based methods.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: LLM-based reward shaping and exploration for robotic skill learning. The field has organized itself around several complementary directions. LLM-Driven Reward Function Generation focuses on synthesizing executable reward code from natural language or task descriptions, often leveraging iterative refinement and code-based representations (e.g., Language to Rewards[1], Eureka[3]). VLM-Based Reward Learning and Shaping exploits vision-language models to provide dense feedback from visual observations, enabling zero-shot or few-shot reward signals. LLM-Guided Task Structuring and Curriculum Learning emphasizes hierarchical decomposition and progressive skill sequencing, while Exploration Guidance and Bonus Design targets intrinsic motivation and curiosity-driven mechanisms. Integrated LLM-RL Frameworks and Co-Design studies the joint optimization of language-driven reward models and policy learning loops, and LLM-Enhanced Planning and Reasoning for Manipulation addresses high-level task planning interleaved with low-level control. Multi-Agent and Cooperative Learning with LLMs, Specialized Applications and Embodied Reasoning, and Foundational and Survey Works (e.g., LLM-RL Survey[4]) round out the taxonomy by covering collaborative settings, domain-specific challenges, and broad overviews. A particularly active line of work centers on code-based reward synthesis, where methods like Eureka[3] and Text2Reward[5] generate Python reward functions that are iteratively refined through execution feedback. Policy-Grounded Synergy[0] sits squarely in this branch, emphasizing the interplay between policy rollouts and reward code updates to achieve tighter alignment with task semantics. Compared to Language to Rewards[1], which pioneered translating language into reward specifications, Policy-Grounded Synergy[0] places greater emphasis on grounding the synthesis process in actual policy behavior rather than relying solely on linguistic priors. Meanwhile, Reward Design LM[7] explores similar iterative refinement but with different prompting strategies. Across these branches, key trade-offs emerge between the expressiveness of code-based rewards, the sample efficiency of VLM-based shaping, and the scalability of curriculum-driven decomposition, leaving open questions about how best to combine these complementary signals in complex, long-horizon robotic tasks.

Claimed Contributions

PoRSE framework for unified reward shaping and exploration

10 retrieved papers

The authors introduce PoRSE, a framework that leverages LLMs to automatically generate both goal-oriented reward functions and an abstract affordance state space for task-relevant exploration. This unified approach addresses the limitation of prior LLM-based methods that focus solely on goal-oriented rewards while neglecting efficient state exploration.

10 retrieved papers

In-policy improvement grounding process (IPG)

10 retrieved papers

The authors develop an in-policy improvement grounding process that uses real-time policy feedback to guide LLMs in refining reward-bonus configurations and dynamically balancing their trade-offs. This process includes an LLM-bootstrapping elimination-expansion filtering mechanism and policy fusion approach, avoiding the need to retrain policies from scratch for each configuration.

10 retrieved papers

Task-aware affordance state space for exploration bonuses

10 retrieved papers

The authors propose using LLMs to automatically construct a low-dimensional affordance state space that maps high-dimensional environmental states to task-relevant dimensions. This enables curiosity-driven exploration bonuses that are tightly aligned with task objectives, improving exploration efficiency compared to generic exploration methods.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Language to Rewards for Robotic Skill Synthesis PDF

Yu, Wenhao, Wenhao Yu, Gileadi, Nimrod, Nimrod Gileadi, Fu, Chuyuan, Chuyuan Fu, Kirmani, Sean, Sean Kirmani, Lee, Kuang-Huei, Kuang-Huei Lee, Arenas, Montse Gonzalez, Montse Gonzalez Arenas, Chiang, Hao-Tien Lewis, Hao-Tien Lewis Chiang, Erez, Tom, Tom Erez, H. Chiang, Hasenclever, Leonard, Leonard Hasenclever, Humplik, Jan, Jan Humplik, Ichter, Brian, Brian Ichter, Xiao, Ted, Ted Xiao, Xu Peng, Peng Xu, Zeng, Andy, Andy Zeng, Zhang, Tingnan, Tingnan Zhang, Heess, Nicolas, Nicolas Heess, Sadigh, Dorsa, Dorsa Sadigh, N. Heess, Tan Jie, Jie Tan, Tassa, Yuval, Yuval Tassa, Xia Fei, Fei Xia, F. Xia (2023)

[3] Eureka: Human-Level Reward Design via Coding Large Language Models PDF

Ma, Yecheng Jason, Liang, William, Yecheng Jason Ma, Wang, Guanzhi, William Liang, Huang, De-An, Guanzhi Wang, Bastani, Osbert, De-An Huang, Jayaraman, Dinesh, Osbert Bastani, Zhu, Yuke, Dinesh Jayaraman, Fan, Linxi, Yuke Zhu, Anandkumar, Anima, Linxi Fan, Anima Anandkumar (2023)

[5] Text2Reward: Reward Shaping with Language Models for Reinforcement Learning PDF

Xie Tianbao, Tianbao Xie, Zhao Siheng, Siheng Zhao, Wu, Chen Henry, Chen Wu, Liu Yitao, Yitao Liu, Chen Henry Wu, Luo Qian, Qian Luo, Zhong, Victor, Victor W. Zhong, Yang Yanchao, Yanchao Yang, Victor Zhong, Yu Tao, Changyuan Yu, Tao Yu (2023) • International Conference on Learning Representations

[7] Reward Design with Language Models PDF

Kwon, Minae, Minae Kwon, Xie, Sang Michael, Sang Michael Xie, Bullard, Kalesha, Kalesha Bullard, Sadigh, Dorsa, Dorsa Sadigh (2023)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

PoRSE framework for unified reward shaping and exploration

[5] Text2Reward: Reward Shaping with Language Models for Reinforcement Learning PDF

Cannot Refute

[7] Reward Design with Language Models PDF

Cannot Refute

[10] Language reward modulation for pretraining reinforcement learning PDF

Cannot Refute

[14] Guiding pretraining in reinforcement learning with large language models PDF

Cannot Refute

[35] Reward guidance for reinforcement learning tasks based on large language models: The LMGT framework PDF

Cannot Refute

[51] Self-Rewarding Language Models PDF

Cannot Refute

[52] Sycophancy to subterfuge: Investigating reward-tampering in large language models PDF

Cannot Refute

[53] Real-time integration of fine-tuned large language model for improved decision-making in reinforcement learning PDF

Cannot Refute

[54] A Large Language Model-Driven Reward Design Framework via Dynamic Feedback for Reinforcement Learning PDF

Cannot Refute

[55] Diversity-Incentivized Exploration for Versatile Reasoning PDF

Cannot Refute

Contribution

In-policy improvement grounding process (IPG)

[56] Directly Fine-Tuning Diffusion Models on Differentiable Rewards PDF

Cannot Refute

[57] Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards PDF

Cannot Refute

[58] Fine-tuning language models from human preferences PDF

Cannot Refute

[59] Robot Fine-Tuning Made Easy: Pre-Training Rewards and Policies for Autonomous Real-World Reinforcement Learning PDF

Cannot Refute

[60] Fine-Tuning Language Models with Reward Learning on Policy PDF

Cannot Refute

[61] Online transfer learning (OTL) for accelerating deep reinforcement learning (DRL) for building energy management PDF

Cannot Refute

[62] Inverse reinforcement learning with dynamic reward scaling for llm alignment PDF

Cannot Refute

[63] Automating reward function configuration for drug design PDF

Cannot Refute

[64] From novice to expert: Llm agent policy optimization via step-wise reinforcement learning PDF

Cannot Refute

[65] Fine-Tuning of Neural Network Approximate MPC without Retraining via Bayesian Optimization PDF

Cannot Refute

Contribution

Task-aware affordance state space for exploration bonuses

[66] Curiously exploring affordance spaces of a pouring task PDF

Cannot Refute

[67] Vat-mart: Learning visual action trajectory proposals for manipulating 3d articulated objects PDF

Cannot Refute

[68] The feeling of grip: novelty, error dynamics, and the predictive brain PDF

Cannot Refute

[69] Active learning of affordances for robot use of household objects PDF

Cannot Refute

[70] Panta Rhei: Curiosity-Driven Exploration to Learn the Image-Schematic Affordances of Pouring Liquids. PDF

Cannot Refute

[71] An intrinsic reward for affordance exploration PDF

Cannot Refute

[72] Modelling affordances for the control and evaluation of intrinsically motivated robots PDF

Cannot Refute

[73] Hierarchical Skills and Skill-based Representation. PDF

Cannot Refute

[74] Learning from Perception to Imagination: Towards General Multimodal Robot Manipulation PDF

Cannot Refute

[75] Active Affordance Learning in Continuous State and Action Spaces PDF

Cannot Refute

Master Skill Learning with Policy-Grounded Synergy of LLM-based Reward Shaping and Exploring

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Language to Rewards for Robotic Skill Synthesis PDF

[3] Eureka: Human-Level Reward Design via Coding Large Language Models PDF

[5] Text2Reward: Reward Shaping with Language Models for Reinforcement Learning PDF

[7] Reward Design with Language Models PDF

Contribution Analysis

PoRSE framework for unified reward shaping and exploration

[5] Text2Reward: Reward Shaping with Language Models for Reinforcement Learning PDF

[7] Reward Design with Language Models PDF

[10] Language reward modulation for pretraining reinforcement learning PDF

[14] Guiding pretraining in reinforcement learning with large language models PDF

[35] Reward guidance for reinforcement learning tasks based on large language models: The LMGT framework PDF

[51] Self-Rewarding Language Models PDF

[52] Sycophancy to subterfuge: Investigating reward-tampering in large language models PDF

[53] Real-time integration of fine-tuned large language model for improved decision-making in reinforcement learning PDF

[54] A Large Language Model-Driven Reward Design Framework via Dynamic Feedback for Reinforcement Learning PDF

[55] Diversity-Incentivized Exploration for Versatile Reasoning PDF

In-policy improvement grounding process (IPG)

[56] Directly Fine-Tuning Diffusion Models on Differentiable Rewards PDF

[57] Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards PDF

[58] Fine-tuning language models from human preferences PDF

[59] Robot Fine-Tuning Made Easy: Pre-Training Rewards and Policies for Autonomous Real-World Reinforcement Learning PDF

[60] Fine-Tuning Language Models with Reward Learning on Policy PDF

[61] Online transfer learning (OTL) for accelerating deep reinforcement learning (DRL) for building energy management PDF

[62] Inverse reinforcement learning with dynamic reward scaling for llm alignment PDF

[63] Automating reward function configuration for drug design PDF

[64] From novice to expert: Llm agent policy optimization via step-wise reinforcement learning PDF

[65] Fine-Tuning of Neural Network Approximate MPC without Retraining via Bayesian Optimization PDF

Task-aware affordance state space for exploration bonuses

[66] Curiously exploring affordance spaces of a pouring task PDF

[67] Vat-mart: Learning visual action trajectory proposals for manipulating 3d articulated objects PDF

[68] The feeling of grip: novelty, error dynamics, and the predictive brain PDF

[69] Active learning of affordances for robot use of household objects PDF

[70] Panta Rhei: Curiosity-Driven Exploration to Learn the Image-Schematic Affordances of Pouring Liquids. PDF

[71] An intrinsic reward for affordance exploration PDF

[72] Modelling affordances for the control and evaluation of intrinsically motivated robots PDF

[73] Hierarchical Skills and Skill-based Representation. PDF

[74] Learning from Perception to Imagination: Towards General Multimodal Robot Manipulation PDF

[75] Active Affordance Learning in Continuous State and Action Spaces PDF

Table of Contents