EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems

ICLR 2026 Conference SubmissionAnonymous Authors
Large Language ModelsAgentsTest Time Learning
Abstract:

A fundamental limitation of current AI agents is their inability to learn complex skills on the fly at test time, often behaving like “clever but clueless interns” in novel environments. This severely limits their practical utility. To systematically measure and drive progress on this challenge, we first introduce the Jericho Test-Time Learning (J-TTL) benchmark. J-TTL is a new evaluation setup where an agent must play the same game for several consecutive episodes, attempting to improve its performance from one episode to the next. On J-TTL, we find that existing adaptation methods like reflection, memory, or reinforcement learning struggle. To address the challenges posed by our benchmark, we present EvoTest, an evolutionary test-time learning framework that improves an agent without any fine-tuning or gradients—by evolving the entire agentic system after every episode. EvoTest has two roles: the Actor Agent, which plays the game, and the Evolver Agent, which analyzes the episode transcript to propose a revised configuration for the next run. This configuration rewrites the prompt, updates memory by logging effective state–action choices, tunes hyperparameters, and learns the tool-use routines. On our J-TTL benchmark, EvoTest consistently increases performance, outperforming not only reflection and memory-only baselines but also more complex online fine-tuning methods. Notably, our method is the only one capable of winning two games (Detective and Library), while all baselines fail to win any.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a test-time learning benchmark (J-TTL) and an evolutionary framework (EvoTest) for improving agents across consecutive episodes without gradient-based updates. It resides in the 'Self-Evolving and Introspective Agents' leaf, which contains only four papers total. This leaf sits within the broader 'Test-Time Adaptation and Evolution Mechanisms' branch, indicating a relatively sparse research direction compared to other areas like LLM-based agents or offline-to-online transfer. The small sibling set suggests this specific combination of evolutionary mechanisms and test-time learning is not yet heavily explored.

The taxonomy reveals neighboring leaves focused on structured test-time scaling via search and planning (five papers) and environment-grounded adaptation (four papers). EvoTest diverges from these by emphasizing whole-system evolution rather than fixed search protocols or environment-specific grounding signals. The broader 'Test-Time Adaptation' branch excludes offline pretraining methods, positioning this work as fundamentally about online improvement. The taxonomy structure shows that while test-time adaptation is an active area, evolutionary approaches within it remain underrepresented compared to meta-learning or multi-agent coordination branches.

Among the 27 candidates examined, no contribution was clearly refuted. The J-TTL benchmark examined 10 candidates with zero refutable overlaps, suggesting limited prior work on consecutive-episode learning benchmarks in text-based games. The EvoTest framework similarly examined 10 candidates without refutation, indicating that evolutionary test-time learning without gradients is relatively unexplored. The UCB-based configuration selection examined 7 candidates, also with no refutations. These statistics reflect a focused search scope rather than exhaustive coverage, but the absence of overlaps across all contributions suggests meaningful novelty within the examined literature.

Based on the limited search scope of 27 semantically similar papers, the work appears to occupy a relatively novel position. The sparse taxonomy leaf and zero refutations across contributions suggest that evolutionary test-time learning for sequential decision-making is not yet well-established. However, this assessment is constrained by the top-K semantic search methodology and does not capture potential overlaps in broader evolutionary computation or reinforcement learning literature outside the examined candidates.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
27
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: test-time learning for autonomous agents in sequential decision-making tasks. The field encompasses methods that enable agents to adapt, refine, or evolve their policies during deployment rather than relying solely on pre-trained models. The taxonomy reveals five main branches. Test-Time Adaptation and Evolution Mechanisms focuses on introspective and self-evolving strategies that allow agents to improve through reflection or online feedback, as seen in works like Recursive Introspection[14] and Learning on the Job[26]. Offline-to-Online Transfer and Deployment addresses the challenge of bridging simulation or offline training with real-world operation, exemplified by Deployable Reinforcement Learning with[13] and Pacer and Runner[3]. Multi-Agent Coordination and Deployment examines scenarios where multiple agents must collaborate or coordinate in dynamic environments, such as Field Deployment of Multi-Agent[6]. LLM-Based Autonomous Agents and Agentic Systems explores how large language models can serve as reasoning engines or planners for sequential tasks, with representative efforts including Reasoningbank[7] and Appagent v2[18]. Finally, Specialized Control and Optimization Applications targets domain-specific problems like UAV path planning or robotic manipulation, illustrated by LLM-based UAV Path Planning[1] and related control frameworks. A central tension across these branches is the trade-off between sample efficiency during deployment and the complexity of adaptation mechanisms. Self-evolving approaches often require sophisticated introspection or meta-learning loops, while offline-to-online methods must balance safety constraints with exploration. EvoTest[0] sits within the Self-Evolving and Introspective Agents cluster, emphasizing evolutionary or iterative refinement at test time. Compared to Recursive Introspection[14], which leverages explicit reasoning traces, and Learning on the Job[26], which focuses on continual skill acquisition, EvoTest[0] appears to prioritize evolutionary search or population-based strategies for policy improvement. This positions it as a complement to gradient-based online adaptation methods like Pacer and Runner[3], offering an alternative when differentiable updates are impractical or when diverse exploration is beneficial. Open questions remain about scalability, the cost of test-time computation, and how to integrate these evolving agents with safety-critical deployment constraints.

Claimed Contributions

Jericho Test-Time Learning (J-TTL) benchmark

The authors propose J-TTL, a benchmark using Jericho text-based games to systematically measure an agent's ability to learn and improve on the fly across multiple consecutive attempts at the same task within a single test session, addressing the lack of standardized testbeds for rapid in-session improvement.

10 retrieved papers
EvoTest evolutionary test-time learning framework

The authors introduce EvoTest, a gradient-free framework that enables test-time learning by evolving the complete agent configuration (prompt, memory, hyperparameters, and tool-use routines) between episodes through transcript-level analysis using an Actor Agent and an Evolver Agent, without requiring weight updates or fine-tuning.

10 retrieved papers
Whole-system evolution via configuration selection with UCB

The authors develop a holistic adaptation mechanism where the Evolver Agent performs whole-system evolution by concurrently optimizing multiple components of the agentic configuration and uses Upper Confidence Bound (UCB) selection to manage the exploration-exploitation trade-off when choosing configurations, enabling more comprehensive and stable learning than single-channel adaptation methods.

7 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Jericho Test-Time Learning (J-TTL) benchmark

The authors propose J-TTL, a benchmark using Jericho text-based games to systematically measure an agent's ability to learn and improve on the fly across multiple consecutive attempts at the same task within a single test session, addressing the lack of standardized testbeds for rapid in-session improvement.

Contribution

EvoTest evolutionary test-time learning framework

The authors introduce EvoTest, a gradient-free framework that enables test-time learning by evolving the complete agent configuration (prompt, memory, hyperparameters, and tool-use routines) between episodes through transcript-level analysis using an Actor Agent and an Evolver Agent, without requiring weight updates or fine-tuning.

Contribution

Whole-system evolution via configuration selection with UCB

The authors develop a holistic adaptation mechanism where the Evolver Agent performs whole-system evolution by concurrently optimizing multiple components of the agentic configuration and uses Upper Confidence Bound (UCB) selection to manage the exploration-exploitation trade-off when choosing configurations, enabling more comprehensive and stable learning than single-channel adaptation methods.

EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems | Novelty Validation