Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

ICLR 2026 Conference SubmissionAnonymous Authors
Reinforcement LearningLLM AgentExploration
Abstract:

Exploration remains the key bottleneck for large language model agents trained with reinforcement learning. While prior methods exploit pretrained knowledge, they fail in environments requiring the discovery of novel states. We propose EMPO2^2, a hybrid RL framework that leverages memory for exploration and combines on- and off-policy updates to make LLMs perform well with memory while also ensuring robustness without it. On ScienceWorld and WebShop, EMPO2^2 achieves 128.6% and 11.3% improvements over GRPO, respectively. Moreover, in out-of-distribution tests, EMPO2^2 demonstrates superior adaptability to new tasks, requiring only a few trials with memory and no parameter updates. These results highlight EMPO2^2 as a promising framework for building more exploratory and generalizable LLM-based agents.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes EMPO², a hybrid reinforcement learning framework combining memory-augmented exploration with on- and off-policy optimization for LLM agents. It resides in the 'On-Policy and Hybrid RL Optimization' leaf under 'Policy Optimization Frameworks', alongside two sibling papers (Multi Agent Reflection and TROLL). This leaf contains only three papers within a taxonomy of 50 total works across 26 leaf nodes, suggesting a moderately sparse research direction. The framework targets exploration bottlenecks in environments requiring novel state discovery, positioning itself at the intersection of policy optimization and exploration strategy design.

The taxonomy reveals that EMPO² sits adjacent to several related but distinct research directions. Neighboring leaves include 'Experience Replay and Off-Policy Learning' (2 papers) and 'Structured Policy Optimization' (2 papers), while the broader 'Exploration Strategy Design' branch contains 15 papers across four subcategories—LLM-guided exploration, tree search methods, intrinsic motivation, and exploration-exploitation balance. The paper's hybrid on-off-policy approach bridges these areas: it shares on-policy stability concerns with its siblings while incorporating off-policy efficiency mechanisms more common in the experience replay cluster. Its memory mechanism connects conceptually to intrinsic motivation methods, though the taxonomy explicitly separates LLM-guided exploration from curiosity-driven approaches.

Among three identified contributions, the literature search examined 19 candidates total. The core EMPO² framework (8 candidates examined, 0 refutable) and hybrid optimization mechanism (1 candidate examined, 0 refutable) show no clear prior overlap within this limited scope. However, the self-generated memory mechanism for exploration (10 candidates examined, 1 refutable) encounters at least one overlapping work among the candidates reviewed. This suggests the memory-augmented exploration component may have more substantial precedent, while the hybrid optimization strategy combining on- and off-policy updates appears less directly anticipated in the examined literature. The analysis reflects top-K semantic search results, not exhaustive coverage.

Based on the limited search scope of 19 candidates, EMPO² appears to occupy a relatively sparse position within hybrid RL optimization for LLM agents, though its memory mechanism shows some overlap with prior exploration work. The taxonomy structure indicates this is an active but not overcrowded research direction, with the paper's combination of memory, hybrid optimization, and exploration focus distinguishing it from immediate neighbors. A more comprehensive literature review would be needed to assess novelty conclusively, particularly regarding memory-augmented exploration techniques outside the top-K semantic matches examined here.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
19
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Exploration in reinforcement learning for large language model agents. The field organizes itself around several complementary dimensions. At the highest level, one finds branches dedicated to exploration strategy design and guidance—methods that explicitly shape how agents discover novel states or actions—alongside policy optimization frameworks that determine how agents update their behavior from collected experience. Reward design and shaping addresses the challenge of defining meaningful feedback signals, while trajectory and sample management focuses on efficiently selecting and reusing interaction data. Task-specific applications span domains from interactive environments to reasoning tasks, and theoretical foundations provide formal guarantees. Surveys and optimization overviews synthesize progress, while adversarial and collaborative branches address security concerns and multi-agent coordination. Representative works such as ExploRLLM[15] and Efficient Exploration[10] illustrate targeted exploration mechanisms, whereas AgentGym[31] and Long Horizon Interactive[2] demonstrate large-scale interactive training setups. Within policy optimization frameworks, a particularly active line of work examines on-policy and hybrid methods that balance sample efficiency with stable learning. Some approaches, like TROLL[24] and Multi Agent Reflection[21], emphasize iterative refinement and reflection mechanisms to guide exploration without straying too far from the current policy. In contrast, methods such as Tree Search Agents[4] and TreeRL[9] incorporate planning-based lookahead to expand the search frontier more aggressively. The Exploratory Memory Agent[0] sits within this on-policy and hybrid cluster, sharing an emphasis on structured exploration with neighbors like Multi Agent Reflection[21] and TROLL[24], yet it distinguishes itself by leveraging memory mechanisms to retain and reuse exploratory insights across episodes. This design choice addresses a common trade-off: while reflection-based methods can be sample-efficient within a single episode, memory-augmented approaches aim to transfer exploration knowledge across tasks, potentially accelerating long-horizon learning at the cost of additional architectural complexity.

Claimed Contributions

EMPO2: Exploratory Memory-Augmented On- and Off-Policy Optimization framework

A novel reinforcement learning framework that integrates both parametric policy updates and non-parametric memory updates to enhance exploration in LLM agents. The framework operates in dual modes during rollout (with and without memory) and update phases (on-policy and off-policy learning), enabling agents to leverage memory during training while remaining robust without it at inference time.

8 retrieved papers
Self-generated memory mechanism for exploration

A non-parametric update mechanism where the policy itself generates reflective tips from past trajectory rollouts and stores them in external memory. These self-generated tips are retrieved and used to condition subsequent rollouts, promoting exploration by maintaining continuity across episodes and helping agents discover novel states.

10 retrieved papers
Can Refute
Hybrid on- and off-policy optimization with reward-guided knowledge distillation

A training approach that combines on-policy updates (retaining memory tips) with off-policy updates (removing tips during parameter updates). The off-policy mode functions as reward-guided knowledge distillation, where high-reward memory-augmented trajectories serve as teacher demonstrations that are selectively distilled into the base policy, internalizing exploration benefits without requiring memory at test time.

1 retrieved paper

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

EMPO2: Exploratory Memory-Augmented On- and Off-Policy Optimization framework

A novel reinforcement learning framework that integrates both parametric policy updates and non-parametric memory updates to enhance exploration in LLM agents. The framework operates in dual modes during rollout (with and without memory) and update phases (on-policy and off-policy learning), enabling agents to leverage memory during training while remaining robust without it at inference time.

Contribution

Self-generated memory mechanism for exploration

A non-parametric update mechanism where the policy itself generates reflective tips from past trajectory rollouts and stores them in external memory. These self-generated tips are retrieved and used to condition subsequent rollouts, promoting exploration by maintaining continuity across episodes and helping agents discover novel states.

Contribution

Hybrid on- and off-policy optimization with reward-guided knowledge distillation

A training approach that combines on-policy updates (retaining memory tips) with off-policy updates (removing tips during parameter updates). The off-policy mode functions as reward-guided knowledge distillation, where high-reward memory-augmented trajectories serve as teacher demonstrations that are selectively distilled into the base policy, internalizing exploration benefits without requiring memory at test time.