Meta-RL Induces Exploration in Language Agents

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Large Language ModelAgentReinforcement LearningMeta Learning

Reinforcement learning (RL) has enabled the training of Large Language Model (LLM) agents to interact with the environment and to solve multi-turn longhorizon tasks. However, the RL-trained agents often struggle in tasks that require active exploration and fail to efficiently adapt from trial-and-error experiences. In this paper, we present LaMer, a general Meta-RL framework that enables LLM agents to actively explore and learn from the environment feedback at test time. LaMer consists of two key components: (i) a cross-episode training framework to encourage exploration and long term rewards optimization; and (ii) in-context policy adaptation via reflection, allowing the agent to adapt their policy from task feedback signal without gradient update. Experiments across diverse environments show that LaMer significantly improves performance over RL baselines, with 11%, 14%, and 19% performance gains on Sokoban, MineSweeper and Webshop, respectively. Moreover, LaMer also demonstrates better generalization to more challenging or previously unseen tasks compared to the RL-trained agents. Overall, our results demonstrate that meta-reinforcement learning provides a principled approach to induce exploration in language agents, enabling more robust adaptation to novel environments through learned exploration strategies.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces LaMer, a meta-reinforcement learning framework that combines cross-episode training with in-context policy adaptation via reflection to improve exploration in LLM agents. According to the taxonomy, this work resides in the 'Cross-Episode Meta-RL with In-Context Adaptation' leaf, which contains only two papers total: the original submission and one sibling work (LLM In-Context Exploration). This positioning suggests the paper targets a relatively sparse but emerging research direction within the broader meta-RL for language agents landscape, which encompasses fourteen papers across multiple branches addressing exploration, curiosity mechanisms, and specialized applications.

The taxonomy reveals that neighboring research directions include meta-learned exploration policies using decision transformers, curiosity algorithm search via program synthesis, and reflective memory systems for reusable meta-policies. The original paper's leaf explicitly excludes methods lacking cross-episode training or in-context adaptation mechanisms, distinguishing it from branches focused on emergent exploration through exploitation-only objectives or offline meta-RL with natural language supervision. This structural context indicates the work bridges meta-RL training paradigms with LLM in-context learning capabilities, occupying a niche between traditional meta-RL frameworks and pure prompt-based adaptation approaches that do not employ cross-episode optimization.

Among the three contributions analyzed, the literature search examined twenty-seven candidates total, identifying refutable prior work for each component. The core LaMer framework (ten candidates examined, one refutable) and cross-episode training with trajectory discounting (seven candidates, one refutable) each show limited overlap within the search scope. In-context policy adaptation via self-reflection (ten candidates, two refutable) appears to have more substantial prior work among the examined papers. These statistics reflect a targeted semantic search rather than exhaustive coverage, suggesting that while some overlap exists in the examined subset, the specific combination of cross-episode meta-RL with reflection-based adaptation may represent a novel integration within the limited candidate pool.

Based on the analysis of thirty candidates from top-K semantic search, the work appears to occupy a sparsely populated research direction with one closely related sibling paper in its taxonomy leaf. The contribution-level statistics indicate varying degrees of prior work across components, with reflection-based adaptation showing more overlap than the cross-episode training mechanism. However, the limited search scope means this assessment captures only a snapshot of the most semantically similar work, not a comprehensive field survey, and the novelty evaluation remains contingent on this bounded literature examination.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: inducing exploration in language agents through meta-reinforcement learning. The field structure reflects diverse approaches to enabling language-based agents to explore effectively across tasks. At the highest level, the taxonomy distinguishes between frameworks that directly integrate meta-RL with language agent architectures, methods that meta-learn exploration strategies or curiosity mechanisms, and specialized applications such as recommendation systems or domain-specific deployments. Some branches focus on offline meta-RL with natural language supervision, while others address the balance between imitation and exploration or develop reflective memory systems that allow agents to reuse meta-policies. Representative works span from foundational meta-RL text game environments like Meta-RL Text Games[8] to more recent efforts such as MetaEvo-Rec[2] for content discovery and Text-to-Decision Agent[3] for offline decision-making with language supervision. A particularly active line of work centers on cross-episode meta-RL with in-context adaptation, where agents leverage large language models to rapidly adjust exploration behavior based on accumulated experience within and across episodes. Meta-RL Language Exploration[0] exemplifies this direction by combining meta-reinforcement learning with in-context learning to induce exploratory behavior in language agents. This approach contrasts with LLM In-Context Exploration[1], which similarly exploits in-context mechanisms but may differ in how meta-level policies are structured or updated. Meanwhile, methods like Meta-Learning Curiosity[11] and Exploitation for Exploration[10] emphasize learning intrinsic motivation signals, and Meta-Policy Reflexion[12] integrates reflective memory to enable reusable exploration strategies. The original paper sits within the cross-episode adaptation cluster, sharing the emphasis on in-context learning with LLM In-Context Exploration[1] while potentially offering distinct meta-RL formulations or exploration incentives that differentiate its contribution from closely related contemporaries.

Claimed Contributions

LAMER: A Meta-RL framework for training LLM agents

Can Refute

10 retrieved papers

The authors propose LAMER, a meta-reinforcement learning framework designed to train large language model agents. This framework enables agents to balance exploration and exploitation across multiple episodes, allowing them to actively explore environments and adapt their policies at test time without gradient updates.

10 retrieved papers

Can Refute

Cross-episode training framework with trajectory discount factor

Can Refute

7 retrieved papers

The authors introduce a training scheme that optimizes rewards across multiple sequential episodes rather than single episodes. This approach uses a trajectory discount factor to assign credit across episodes, encouraging the agent to explore in early episodes and exploit gathered information in later ones.

7 retrieved papers

Can Refute

In-context policy adaptation via self-reflection

Can Refute

10 retrieved papers

The authors develop a mechanism where the agent generates textual reflections after each episode to summarize experiences and adjust strategy. This enables policy adaptation through context modification rather than parameter updates, naturally leveraging the in-context learning capabilities of large language models.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Can large language models explore in-context? PDF

Dylan Foster, Keegan Harris, Akshay Krishnamurthy, Aleksandrs Slivkins, Cyril Zhang (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

LAMER: A Meta-RL framework for training LLM agents

[15] Optimizing test-time compute via meta reinforcement fine-tuning PDF

Can Refute

[2] MetaEvo-Rec: Self-Evolving Meta-Reinforcement Learning Recommendation with Large-Language-Model Guided Policy Adaptation PDF

Cannot Refute

[16] Meta-Learning Online Adaptation of Language Models PDF

Cannot Refute

[17] Instructrag: Leveraging retrieval-augmented generation on instruction graphs for llm-based task planning PDF

Cannot Refute

[18] Meta-Learning Reinforcement Learning for Crypto-Return Prediction PDF

Cannot Refute

[19] Efficient meta reinforcement learning for preference-based fast adaptation PDF

Cannot Refute

[20] Meta-reinforcement learning robust to distributional shift via performing lifelong in-context learning PDF

Cannot Refute

[21] Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting PDF

Cannot Refute

[22] Adapting Like Humans: A Metacognitive Agent with Test-time Reasoning PDF

Cannot Refute

[23] Multimodal Agentic AI Architecture for High Frequency Trading Using Reinforcement Learning and Temporal Graph Encoders PDF

Cannot Refute

Contribution

Cross-episode training framework with trajectory discount factor

[24] Efficient Cross-Episode Meta-RL PDF

Can Refute

[25] Reciprocal reward influence encourages cooperation from self-interested agents PDF

Cannot Refute

[26] Delayed Geometric Discounts: An Alternative Criterion for Reinforcement Learning PDF

Cannot Refute

[27] Policy gradient PDF

Cannot Refute

[28] Bachelor's Thesis Submitted in 2025 PDF

Cannot Refute

[29] Motivated optimal developmental learning for sequential tasks without using rigid time-discounts PDF

Cannot Refute

[31] When Does Reward Drive Exploration? PDF

Cannot Refute

Contribution

In-context policy adaptation via self-reflection

[32] Reflexion: Language agents with verbal reinforcement learning PDF

Can Refute

[34] Agentic context engineering: Evolving contexts for self-improving language models PDF

Can Refute

[33] Language agent tree search unifies reasoning acting and planning in language models PDF

Cannot Refute

[35] DORA: Dynamic Optimization Prompt for Continuous Reflection of LLM-based Agent PDF

Cannot Refute

[36] Evotest: Evolutionary test-time learning for self-improving agentic systems PDF

Cannot Refute

[37] Improving LLM Agent Planning with In-Context Learning via Atomic Fact Augmentation and Lookahead Search PDF

Cannot Refute

[38] PromptPilot: Autonomous Prompt Optimization via Genetic Particle Filtering and Dynamic Exploration PDF

Cannot Refute

[39] REGENT: A Retrieval-Augmented Generalist Agent That Can Act In-Context in New Environments PDF

Cannot Refute

[40] â¦ GÃ¶del Machine: A Commentary on Novelty and Implications From formal proofs to empirical evolution: re-energizing self-improving AI with the Darwin GÃ¶del â¦ PDF

Cannot Refute

[41] Self-Adapting Financial Agents: Evolution Through Feedback-Driven Meta-Prompt Optimization PDF

Cannot Refute

Meta-RL Induces Exploration in Language Agents

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Can large language models explore in-context? PDF

Contribution Analysis

LAMER: A Meta-RL framework for training LLM agents

[15] Optimizing test-time compute via meta reinforcement fine-tuning PDF

[2] MetaEvo-Rec: Self-Evolving Meta-Reinforcement Learning Recommendation with Large-Language-Model Guided Policy Adaptation PDF

[16] Meta-Learning Online Adaptation of Language Models PDF

[17] Instructrag: Leveraging retrieval-augmented generation on instruction graphs for llm-based task planning PDF

[18] Meta-Learning Reinforcement Learning for Crypto-Return Prediction PDF

[19] Efficient meta reinforcement learning for preference-based fast adaptation PDF

[20] Meta-reinforcement learning robust to distributional shift via performing lifelong in-context learning PDF

[21] Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting PDF

[22] Adapting Like Humans: A Metacognitive Agent with Test-time Reasoning PDF

[23] Multimodal Agentic AI Architecture for High Frequency Trading Using Reinforcement Learning and Temporal Graph Encoders PDF

Cross-episode training framework with trajectory discount factor

[24] Efficient Cross-Episode Meta-RL PDF

[25] Reciprocal reward influence encourages cooperation from self-interested agents PDF

[26] Delayed Geometric Discounts: An Alternative Criterion for Reinforcement Learning PDF

[27] Policy gradient PDF

[28] Bachelor's Thesis Submitted in 2025 PDF

[29] Motivated optimal developmental learning for sequential tasks without using rigid time-discounts PDF

[31] When Does Reward Drive Exploration? PDF

In-context policy adaptation via self-reflection

[32] Reflexion: Language agents with verbal reinforcement learning PDF

[34] Agentic context engineering: Evolving contexts for self-improving language models PDF

[33] Language agent tree search unifies reasoning acting and planning in language models PDF

[35] DORA: Dynamic Optimization Prompt for Continuous Reflection of LLM-based Agent PDF

[36] Evotest: Evolutionary test-time learning for self-improving agentic systems PDF

[37] Improving LLM Agent Planning with In-Context Learning via Atomic Fact Augmentation and Lookahead Search PDF

[38] PromptPilot: Autonomous Prompt Optimization via Genetic Particle Filtering and Dynamic Exploration PDF

[39] REGENT: A Retrieval-Augmented Generalist Agent That Can Act In-Context in New Environments PDF

[40] â¦ GÃ¶del Machine: A Commentary on Novelty and Implications From formal proofs to empirical evolution: re-energizing self-improving AI with the Darwin GÃ¶del â¦ PDF

[41] Self-Adapting Financial Agents: Evolution Through Feedback-Driven Meta-Prompt Optimization PDF

Table of Contents

[40] â¦ GÃ¶del Machine: A Commentary on Novelty and Implications From formal proofs to empirical evolution: re-energizing self-improving AI with the Darwin GÃ¶del â¦ PDF