EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Large Language ModelsAgentsTest Time Learning

A fundamental limitation of current AI agents is their inability to learn complex skills on the fly at test time, often behaving like “clever but clueless interns” in novel environments. This severely limits their practical utility. To systematically measure and drive progress on this challenge, we first introduce the Jericho Test-Time Learning (J-TTL) benchmark. J-TTL is a new evaluation setup where an agent must play the same game for several consecutive episodes, attempting to improve its performance from one episode to the next. On J-TTL, we find that existing adaptation methods like reflection, memory, or reinforcement learning struggle. To address the challenges posed by our benchmark, we present EvoTest, an evolutionary test-time learning framework that improves an agent without any fine-tuning or gradients—by evolving the entire agentic system after every episode. EvoTest has two roles: the Actor Agent, which plays the game, and the Evolver Agent, which analyzes the episode transcript to propose a revised configuration for the next run. This configuration rewrites the prompt, updates memory by logging effective state–action choices, tunes hyperparameters, and learns the tool-use routines. On our J-TTL benchmark, EvoTest consistently increases performance, outperforming not only reflection and memory-only baselines but also more complex online fine-tuning methods. Notably, our method is the only one capable of winning two games (Detective and Library), while all baselines fail to win any.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a test-time learning benchmark (J-TTL) and an evolutionary framework (EvoTest) for improving agents across consecutive episodes without gradient-based updates. It resides in the 'Self-Evolving and Introspective Agents' leaf, which contains only four papers total. This leaf sits within the broader 'Test-Time Adaptation and Evolution Mechanisms' branch, indicating a relatively sparse research direction compared to other areas like LLM-based agents or offline-to-online transfer. The small sibling set suggests this specific combination of evolutionary mechanisms and test-time learning is not yet heavily explored.

The taxonomy reveals neighboring leaves focused on structured test-time scaling via search and planning (five papers) and environment-grounded adaptation (four papers). EvoTest diverges from these by emphasizing whole-system evolution rather than fixed search protocols or environment-specific grounding signals. The broader 'Test-Time Adaptation' branch excludes offline pretraining methods, positioning this work as fundamentally about online improvement. The taxonomy structure shows that while test-time adaptation is an active area, evolutionary approaches within it remain underrepresented compared to meta-learning or multi-agent coordination branches.

Among the 27 candidates examined, no contribution was clearly refuted. The J-TTL benchmark examined 10 candidates with zero refutable overlaps, suggesting limited prior work on consecutive-episode learning benchmarks in text-based games. The EvoTest framework similarly examined 10 candidates without refutation, indicating that evolutionary test-time learning without gradients is relatively unexplored. The UCB-based configuration selection examined 7 candidates, also with no refutations. These statistics reflect a focused search scope rather than exhaustive coverage, but the absence of overlaps across all contributions suggests meaningful novelty within the examined literature.

Based on the limited search scope of 27 semantically similar papers, the work appears to occupy a relatively novel position. The sparse taxonomy leaf and zero refutations across contributions suggest that evolutionary test-time learning for sequential decision-making is not yet well-established. However, this assessment is constrained by the top-K semantic search methodology and does not capture potential overlaps in broader evolutionary computation or reinforcement learning literature outside the examined candidates.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: test-time learning for autonomous agents in sequential decision-making tasks. The field encompasses methods that enable agents to adapt, refine, or evolve their policies during deployment rather than relying solely on pre-trained models. The taxonomy reveals five main branches. Test-Time Adaptation and Evolution Mechanisms focuses on introspective and self-evolving strategies that allow agents to improve through reflection or online feedback, as seen in works like Recursive Introspection[14] and Learning on the Job[26]. Offline-to-Online Transfer and Deployment addresses the challenge of bridging simulation or offline training with real-world operation, exemplified by Deployable Reinforcement Learning with[13] and Pacer and Runner[3]. Multi-Agent Coordination and Deployment examines scenarios where multiple agents must collaborate or coordinate in dynamic environments, such as Field Deployment of Multi-Agent[6]. LLM-Based Autonomous Agents and Agentic Systems explores how large language models can serve as reasoning engines or planners for sequential tasks, with representative efforts including Reasoningbank[7] and Appagent v2[18]. Finally, Specialized Control and Optimization Applications targets domain-specific problems like UAV path planning or robotic manipulation, illustrated by LLM-based UAV Path Planning[1] and related control frameworks. A central tension across these branches is the trade-off between sample efficiency during deployment and the complexity of adaptation mechanisms. Self-evolving approaches often require sophisticated introspection or meta-learning loops, while offline-to-online methods must balance safety constraints with exploration. EvoTest[0] sits within the Self-Evolving and Introspective Agents cluster, emphasizing evolutionary or iterative refinement at test time. Compared to Recursive Introspection[14], which leverages explicit reasoning traces, and Learning on the Job[26], which focuses on continual skill acquisition, EvoTest[0] appears to prioritize evolutionary search or population-based strategies for policy improvement. This positions it as a complement to gradient-based online adaptation methods like Pacer and Runner[3], offering an alternative when differentiable updates are impractical or when diverse exploration is beneficial. Open questions remain about scalability, the cost of test-time computation, and how to integrate these evolving agents with safety-critical deployment constraints.

Claimed Contributions

Jericho Test-Time Learning (J-TTL) benchmark

10 retrieved papers

The authors propose J-TTL, a benchmark using Jericho text-based games to systematically measure an agent's ability to learn and improve on the fly across multiple consecutive attempts at the same task within a single test session, addressing the lack of standardized testbeds for rapid in-session improvement.

10 retrieved papers

EvoTest evolutionary test-time learning framework

10 retrieved papers

The authors introduce EvoTest, a gradient-free framework that enables test-time learning by evolving the complete agent configuration (prompt, memory, hyperparameters, and tool-use routines) between episodes through transcript-level analysis using an Actor Agent and an Evolver Agent, without requiring weight updates or fine-tuning.

10 retrieved papers

Whole-system evolution via configuration selection with UCB

7 retrieved papers

The authors develop a holistic adaptation mechanism where the Evolver Agent performs whole-system evolution by concurrently optimizing multiple components of the agentic configuration and uses Upper Confidence Bound (UCB) selection to manage the exploration-exploitation trade-off when choosing configurations, enabling more comprehensive and stable learning than single-channel adaptation methods.

7 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[7] Reasoningbank: Scaling agent self-evolving with reasoning memory PDF

Ouyang, Siru, Yan Jun, Siru Ouyang, Hsu, I-Hung, Jun Yan, Chen Yanfei, I-Hung Hsu, Jiang Ke, Yanfei Chen, Wang Zifeng, Ke Jiang, Han, Rujun, Zifeng Wang, Rujun Han, Daruki, Samira, Long T. Le, Tang, Xiangru, Samira Daruki, Xiangru Tang, Lee George, Vishy Tirumalashetty, Rofouei, Mahsan, George Lee, Lin Hang-fei, Mahsan Rofouei, Jiawei, Hangfei Lin, Lee, Chen-Yu, Jiawei Han, Pfister, Tomas, Chen-Yu Lee, Tomas Pfister (2025)

[14] Recursive Introspection: Teaching Language Model Agents How to Self-Improve PDF

Naman Garg, Aviral Kumar, YuXiao Qu, Tianjun Zhang (2024) • Neural Information Processing Systems

[26] Learning on the Job: An Experience-Driven Self-Evolving Agent for Long-Horizon Tasks PDF

Yang Cheng, Yang Xuemeng, Wen Li-cheng, Fu, Daocheng, Mei, Jianbiao, Wu Rong, Cai, Pinlong, Shen Yu-fan, Deng, Nianchen, Shi, Botian, Qiao Yu, Li Haifeng (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Jericho Test-Time Learning (J-TTL) benchmark

[61] Optimizing test-time compute via meta reinforcement fine-tuning PDF

Cannot Refute

[62] STEPS: A Benchmark for Order Reasoning in Sequential Tasks PDF

Cannot Refute

[63] Dynamic cheatsheet: Test-time learning with adaptive memory PDF

Cannot Refute

[64] lmgame-Bench: How Good are LLMs at Playing Games? PDF

Cannot Refute

[65] Reward Is Enough: LLMs Are In-Context Reinforcement Learners PDF

Cannot Refute

[66] Coom: A game benchmark for continual reinforcement learning PDF

Cannot Refute

[67] Web-Shepherd: Advancing PRMs for Reinforcing Web Agents PDF

Cannot Refute

[68] Rethinking Out-of-Distribution Detection for Reinforcement Learning: Advancing Methods for Evaluation and Detection PDF

Cannot Refute

[69] Adversarially robust decision transformer PDF

Cannot Refute

[70] Uni : Unified inference in sequential decision problems PDF

Cannot Refute

Contribution

EvoTest evolutionary test-time learning framework

[51] Flex: Continuous agent evolution via forward learning from experience PDF

Cannot Refute

[52] Continual Test-Time Domain Adaptation PDF

Cannot Refute

[53] Contrastive Test-Time Adaptation PDF

Cannot Refute

[54] DELTA: degradation-free fully test-time adaptation PDF

Cannot Refute

[55] Test-time model adaptation with only forward passes PDF

Cannot Refute

[56] Scaling Image and Video Generation via Test-Time Evolutionary Search PDF

Cannot Refute

[57] Instance weighted incremental evolution strategies for reinforcement learning in dynamic environments PDF

Cannot Refute

[58] Diffusion-es: Gradient-free planning with diffusion for autonomous driving and zero-shot instruction following PDF

Cannot Refute

[59] TEA: Test-Time Energy Adaptation PDF

Cannot Refute

[60] Improved Test-Time Adaptation for Domain Generalization PDF

Cannot Refute

Contribution

Whole-system evolution via configuration selection with UCB

[71] DISCO: An end-to-end bandit framework for personalised discount allocation PDF

Cannot Refute

[72] Reinforcement Learning Based Online Algorithm for Near-Field Time-Varying IRS Phase Shift Optimization: System Evolution Perspective PDF

Cannot Refute

[73] Top-k Multi-Armed Bandit Learning for Content Dissemination in Swarms of Micro-UAVs PDF

Cannot Refute

[74] On the Evolution of the MCTS Upper Confidence Bounds for Trees by Means of Evolutionary Algorithms in the Game of Carcassonne PDF

Cannot Refute

[75] CB-EVO: Contextual Bandit Tuning with Evolutionary Search for Logic Synthesis PDF

Cannot Refute

[76] A Practical Approach to Optimize Multi-Agent Systems PDF

Cannot Refute

[77] Evolving game-specific UCB alternatives for general video game playing PDF

Cannot Refute

EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[7] Reasoningbank: Scaling agent self-evolving with reasoning memory PDF

[14] Recursive Introspection: Teaching Language Model Agents How to Self-Improve PDF

[26] Learning on the Job: An Experience-Driven Self-Evolving Agent for Long-Horizon Tasks PDF

Contribution Analysis

Jericho Test-Time Learning (J-TTL) benchmark

[61] Optimizing test-time compute via meta reinforcement fine-tuning PDF

[62] STEPS: A Benchmark for Order Reasoning in Sequential Tasks PDF

[63] Dynamic cheatsheet: Test-time learning with adaptive memory PDF

[64] lmgame-Bench: How Good are LLMs at Playing Games? PDF

[65] Reward Is Enough: LLMs Are In-Context Reinforcement Learners PDF

[66] Coom: A game benchmark for continual reinforcement learning PDF

[67] Web-Shepherd: Advancing PRMs for Reinforcing Web Agents PDF

[68] Rethinking Out-of-Distribution Detection for Reinforcement Learning: Advancing Methods for Evaluation and Detection PDF

[69] Adversarially robust decision transformer PDF

[70] Uni : Unified inference in sequential decision problems PDF

EvoTest evolutionary test-time learning framework

[51] Flex: Continuous agent evolution via forward learning from experience PDF

[52] Continual Test-Time Domain Adaptation PDF

[53] Contrastive Test-Time Adaptation PDF

[54] DELTA: degradation-free fully test-time adaptation PDF

[55] Test-time model adaptation with only forward passes PDF

[56] Scaling Image and Video Generation via Test-Time Evolutionary Search PDF

[57] Instance weighted incremental evolution strategies for reinforcement learning in dynamic environments PDF

[58] Diffusion-es: Gradient-free planning with diffusion for autonomous driving and zero-shot instruction following PDF

[59] TEA: Test-Time Energy Adaptation PDF

[60] Improved Test-Time Adaptation for Domain Generalization PDF

Whole-system evolution via configuration selection with UCB

[71] DISCO: An end-to-end bandit framework for personalised discount allocation PDF

[72] Reinforcement Learning Based Online Algorithm for Near-Field Time-Varying IRS Phase Shift Optimization: System Evolution Perspective PDF

[73] Top-k Multi-Armed Bandit Learning for Content Dissemination in Swarms of Micro-UAVs PDF

[74] On the Evolution of the MCTS Upper Confidence Bounds for Trees by Means of Evolutionary Algorithms in the Game of Carcassonne PDF

[75] CB-EVO: Contextual Bandit Tuning with Evolutionary Search for Logic Synthesis PDF

[76] A Practical Approach to Optimize Multi-Agent Systems PDF

[77] Evolving game-specific UCB alternatives for general video game playing PDF

Table of Contents