Master Skill Learning with Policy-Grounded Synergy of LLM-based Reward Shaping and Exploring

ICLR 2026 Conference SubmissionAnonymous Authors
Robot Skill AcquisitionDexterous ManipulationAutomatic Reward Design
Abstract:

The acquisition of robotic skills via reinforcement learning (RL) is crucial for advancing embodied intelligence, but designing effective reward functions for complex tasks remains challenging. Recent methods using large language models (LLMs) can generate reward functions from language instructions, but they often produce overly goal-oriented rewards that neglect state exploration, causing robots to get stuck in local optima. Traditional RL addresses this by adding exploration bonuses, but these are typically generic and inefficient, wasting resources on exploring task-irrelevant areas. To address these limitations, we propose Policy-grounded Synergy of Reward Shaping and Exploration (PoRSE), a novel and unified framework that guides LLMs to generate task-aware reward functions while constructing an abstract affordance space for efficient exploration bonuses. Given the vast number of possible reward-bonus combinations, it is impractical to exhaustively train a policy from scratch for each configuration to identify the best one. Instead, PoRSE employs an in-policy-improvement grounding process, dynamically and continuously generating and filtering out reward-bonus pairs along the policy improvement process. This approach accelerates skill acquisition and fosters a mutually reinforcing relationship between reward shaping, exploration and policy enhancement through close feedback. Experiments show that PoRSE is highly effective, achieving significant improvement in average returns across all robotic tasks compared to previous state-of-the-art methods. It also achieves initial success in two highly challenging manipulation tasks, marking a significant breakthrough.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes PoRSE, a unified framework that combines LLM-driven reward shaping with task-aware exploration bonuses for robotic skill learning. It resides in the Code-Based Reward Synthesis leaf, which contains five papers including the original work. This leaf focuses on approaches where LLMs generate executable reward code through zero-shot or evolutionary optimization without human demonstrations. The presence of five papers in this specific leaf suggests a moderately active research direction within the broader taxonomy of fifty papers across approximately thirty-six topics, indicating neither extreme crowding nor isolation.

The taxonomy reveals that Code-Based Reward Synthesis sits within the larger LLM-Driven Reward Function Generation branch, which also includes Self-Refinement and Iterative Improvement (three papers), Reward Learning from Alternative Modalities (two papers), and Domain-Specific Reward Design (three papers). Neighboring branches address VLM-Based Reward Learning (seven papers across three leaves), LLM-Guided Task Structuring (five papers), and Exploration Guidance (two papers). The scope note for Code-Based Reward Synthesis explicitly excludes demonstration-assisted or video-to-reward approaches, while the Exploration Guidance branch exists as a separate category, suggesting that unifying these two aspects—as PoRSE attempts—crosses traditional categorical boundaries within the field.

Among thirty candidates examined through limited semantic search, none were found to clearly refute any of the three core contributions. The PoRSE framework itself was assessed against ten candidates with zero refutable matches. Similarly, the in-policy improvement grounding process and the task-aware affordance state space each faced ten candidates without clear prior overlap. These statistics suggest that within the examined scope, the specific combination of policy-grounded reward-exploration co-design appears relatively unexplored, though the limited search scale (thirty candidates from a potentially much larger literature) means substantial prior work could exist beyond the top-K semantic matches retrieved.

Based on the limited literature search covering thirty semantically similar papers, the work appears to occupy a distinctive position by bridging reward synthesis and exploration guidance—two areas typically treated separately in the taxonomy. However, the analysis does not cover exhaustive citation networks or domain-specific venues, leaving open the possibility of relevant prior work in adjacent communities such as intrinsic motivation research or hierarchical RL that may not surface in semantic search focused on LLM-based methods.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: LLM-based reward shaping and exploration for robotic skill learning. The field has organized itself around several complementary directions. LLM-Driven Reward Function Generation focuses on synthesizing executable reward code from natural language or task descriptions, often leveraging iterative refinement and code-based representations (e.g., Language to Rewards[1], Eureka[3]). VLM-Based Reward Learning and Shaping exploits vision-language models to provide dense feedback from visual observations, enabling zero-shot or few-shot reward signals. LLM-Guided Task Structuring and Curriculum Learning emphasizes hierarchical decomposition and progressive skill sequencing, while Exploration Guidance and Bonus Design targets intrinsic motivation and curiosity-driven mechanisms. Integrated LLM-RL Frameworks and Co-Design studies the joint optimization of language-driven reward models and policy learning loops, and LLM-Enhanced Planning and Reasoning for Manipulation addresses high-level task planning interleaved with low-level control. Multi-Agent and Cooperative Learning with LLMs, Specialized Applications and Embodied Reasoning, and Foundational and Survey Works (e.g., LLM-RL Survey[4]) round out the taxonomy by covering collaborative settings, domain-specific challenges, and broad overviews. A particularly active line of work centers on code-based reward synthesis, where methods like Eureka[3] and Text2Reward[5] generate Python reward functions that are iteratively refined through execution feedback. Policy-Grounded Synergy[0] sits squarely in this branch, emphasizing the interplay between policy rollouts and reward code updates to achieve tighter alignment with task semantics. Compared to Language to Rewards[1], which pioneered translating language into reward specifications, Policy-Grounded Synergy[0] places greater emphasis on grounding the synthesis process in actual policy behavior rather than relying solely on linguistic priors. Meanwhile, Reward Design LM[7] explores similar iterative refinement but with different prompting strategies. Across these branches, key trade-offs emerge between the expressiveness of code-based rewards, the sample efficiency of VLM-based shaping, and the scalability of curriculum-driven decomposition, leaving open questions about how best to combine these complementary signals in complex, long-horizon robotic tasks.

Claimed Contributions

PoRSE framework for unified reward shaping and exploration

The authors introduce PoRSE, a framework that leverages LLMs to automatically generate both goal-oriented reward functions and an abstract affordance state space for task-relevant exploration. This unified approach addresses the limitation of prior LLM-based methods that focus solely on goal-oriented rewards while neglecting efficient state exploration.

10 retrieved papers
In-policy improvement grounding process (IPG)

The authors develop an in-policy improvement grounding process that uses real-time policy feedback to guide LLMs in refining reward-bonus configurations and dynamically balancing their trade-offs. This process includes an LLM-bootstrapping elimination-expansion filtering mechanism and policy fusion approach, avoiding the need to retrain policies from scratch for each configuration.

10 retrieved papers
Task-aware affordance state space for exploration bonuses

The authors propose using LLMs to automatically construct a low-dimensional affordance state space that maps high-dimensional environmental states to task-relevant dimensions. This enables curiosity-driven exploration bonuses that are tightly aligned with task objectives, improving exploration efficiency compared to generic exploration methods.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

PoRSE framework for unified reward shaping and exploration

The authors introduce PoRSE, a framework that leverages LLMs to automatically generate both goal-oriented reward functions and an abstract affordance state space for task-relevant exploration. This unified approach addresses the limitation of prior LLM-based methods that focus solely on goal-oriented rewards while neglecting efficient state exploration.

Contribution

In-policy improvement grounding process (IPG)

The authors develop an in-policy improvement grounding process that uses real-time policy feedback to guide LLMs in refining reward-bonus configurations and dynamically balancing their trade-offs. This process includes an LLM-bootstrapping elimination-expansion filtering mechanism and policy fusion approach, avoiding the need to retrain policies from scratch for each configuration.

Contribution

Task-aware affordance state space for exploration bonuses

The authors propose using LLMs to automatically construct a low-dimensional affordance state space that maps high-dimensional environmental states to task-relevant dimensions. This enables curiosity-driven exploration bonuses that are tightly aligned with task objectives, improving exploration efficiency compared to generic exploration methods.

Master Skill Learning with Policy-Grounded Synergy of LLM-based Reward Shaping and Exploring | Novelty Validation