Meta-RL Induces Exploration in Language Agents

ICLR 2026 Conference SubmissionAnonymous Authors
Large Language ModelAgentReinforcement LearningMeta Learning
Abstract:

Reinforcement learning (RL) has enabled the training of Large Language Model (LLM) agents to interact with the environment and to solve multi-turn longhorizon tasks. However, the RL-trained agents often struggle in tasks that require active exploration and fail to efficiently adapt from trial-and-error experiences. In this paper, we present LaMer, a general Meta-RL framework that enables LLM agents to actively explore and learn from the environment feedback at test time. LaMer consists of two key components: (i) a cross-episode training framework to encourage exploration and long term rewards optimization; and (ii) in-context policy adaptation via reflection, allowing the agent to adapt their policy from task feedback signal without gradient update. Experiments across diverse environments show that LaMer significantly improves performance over RL baselines, with 11%, 14%, and 19% performance gains on Sokoban, MineSweeper and Webshop, respectively. Moreover, LaMer also demonstrates better generalization to more challenging or previously unseen tasks compared to the RL-trained agents. Overall, our results demonstrate that meta-reinforcement learning provides a principled approach to induce exploration in language agents, enabling more robust adaptation to novel environments through learned exploration strategies.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces LaMer, a meta-reinforcement learning framework that combines cross-episode training with in-context policy adaptation via reflection to improve exploration in LLM agents. According to the taxonomy, this work resides in the 'Cross-Episode Meta-RL with In-Context Adaptation' leaf, which contains only two papers total: the original submission and one sibling work (LLM In-Context Exploration). This positioning suggests the paper targets a relatively sparse but emerging research direction within the broader meta-RL for language agents landscape, which encompasses fourteen papers across multiple branches addressing exploration, curiosity mechanisms, and specialized applications.

The taxonomy reveals that neighboring research directions include meta-learned exploration policies using decision transformers, curiosity algorithm search via program synthesis, and reflective memory systems for reusable meta-policies. The original paper's leaf explicitly excludes methods lacking cross-episode training or in-context adaptation mechanisms, distinguishing it from branches focused on emergent exploration through exploitation-only objectives or offline meta-RL with natural language supervision. This structural context indicates the work bridges meta-RL training paradigms with LLM in-context learning capabilities, occupying a niche between traditional meta-RL frameworks and pure prompt-based adaptation approaches that do not employ cross-episode optimization.

Among the three contributions analyzed, the literature search examined twenty-seven candidates total, identifying refutable prior work for each component. The core LaMer framework (ten candidates examined, one refutable) and cross-episode training with trajectory discounting (seven candidates, one refutable) each show limited overlap within the search scope. In-context policy adaptation via self-reflection (ten candidates, two refutable) appears to have more substantial prior work among the examined papers. These statistics reflect a targeted semantic search rather than exhaustive coverage, suggesting that while some overlap exists in the examined subset, the specific combination of cross-episode meta-RL with reflection-based adaptation may represent a novel integration within the limited candidate pool.

Based on the analysis of thirty candidates from top-K semantic search, the work appears to occupy a sparsely populated research direction with one closely related sibling paper in its taxonomy leaf. The contribution-level statistics indicate varying degrees of prior work across components, with reflection-based adaptation showing more overlap than the cross-episode training mechanism. However, the limited search scope means this assessment captures only a snapshot of the most semantically similar work, not a comprehensive field survey, and the novelty evaluation remains contingent on this bounded literature examination.

Taxonomy

Core-task Taxonomy Papers
14
3
Claimed Contributions
27
Contribution Candidate Papers Compared
4
Refutable Paper

Research Landscape Overview

Core task: inducing exploration in language agents through meta-reinforcement learning. The field structure reflects diverse approaches to enabling language-based agents to explore effectively across tasks. At the highest level, the taxonomy distinguishes between frameworks that directly integrate meta-RL with language agent architectures, methods that meta-learn exploration strategies or curiosity mechanisms, and specialized applications such as recommendation systems or domain-specific deployments. Some branches focus on offline meta-RL with natural language supervision, while others address the balance between imitation and exploration or develop reflective memory systems that allow agents to reuse meta-policies. Representative works span from foundational meta-RL text game environments like Meta-RL Text Games[8] to more recent efforts such as MetaEvo-Rec[2] for content discovery and Text-to-Decision Agent[3] for offline decision-making with language supervision. A particularly active line of work centers on cross-episode meta-RL with in-context adaptation, where agents leverage large language models to rapidly adjust exploration behavior based on accumulated experience within and across episodes. Meta-RL Language Exploration[0] exemplifies this direction by combining meta-reinforcement learning with in-context learning to induce exploratory behavior in language agents. This approach contrasts with LLM In-Context Exploration[1], which similarly exploits in-context mechanisms but may differ in how meta-level policies are structured or updated. Meanwhile, methods like Meta-Learning Curiosity[11] and Exploitation for Exploration[10] emphasize learning intrinsic motivation signals, and Meta-Policy Reflexion[12] integrates reflective memory to enable reusable exploration strategies. The original paper sits within the cross-episode adaptation cluster, sharing the emphasis on in-context learning with LLM In-Context Exploration[1] while potentially offering distinct meta-RL formulations or exploration incentives that differentiate its contribution from closely related contemporaries.

Claimed Contributions

LAMER: A Meta-RL framework for training LLM agents

The authors propose LAMER, a meta-reinforcement learning framework designed to train large language model agents. This framework enables agents to balance exploration and exploitation across multiple episodes, allowing them to actively explore environments and adapt their policies at test time without gradient updates.

10 retrieved papers
Can Refute
Cross-episode training framework with trajectory discount factor

The authors introduce a training scheme that optimizes rewards across multiple sequential episodes rather than single episodes. This approach uses a trajectory discount factor to assign credit across episodes, encouraging the agent to explore in early episodes and exploit gathered information in later ones.

7 retrieved papers
Can Refute
In-context policy adaptation via self-reflection

The authors develop a mechanism where the agent generates textual reflections after each episode to summarize experiences and adjust strategy. This enables policy adaptation through context modification rather than parameter updates, naturally leveraging the in-context learning capabilities of large language models.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

LAMER: A Meta-RL framework for training LLM agents

The authors propose LAMER, a meta-reinforcement learning framework designed to train large language model agents. This framework enables agents to balance exploration and exploitation across multiple episodes, allowing them to actively explore environments and adapt their policies at test time without gradient updates.

Contribution

Cross-episode training framework with trajectory discount factor

The authors introduce a training scheme that optimizes rewards across multiple sequential episodes rather than single episodes. This approach uses a trajectory discount factor to assign credit across episodes, encouraging the agent to explore in early episodes and exploit gathered information in later ones.

Contribution

In-context policy adaptation via self-reflection

The authors develop a mechanism where the agent generates textual reflections after each episode to summarize experiences and adjust strategy. This enables policy adaptation through context modification rather than parameter updates, naturally leveraging the in-context learning capabilities of large language models.