EVALUATING MEMORY IN LLM AGENTS VIA INCRE- MENTAL MULTI-TURN INTERACTIONS

ICLR 2026 Conference SubmissionAnonymous Authors
LLM Agents; Agents with Memory; Memory Agents Benchmark; Evaluation for Memory
Abstract:

Recent benchmarks for Large Language Model (LLM) agents primarily focus on evaluating reasoning, planning, and execution capabilities, while another critical component—memory, encompassing how agents memorize, update, and retrieve long-term information—is under-evaluated due to the lack of benchmarks. We term agents with memory mechanisms as memory agents. In this paper, based on classic theories from memory science and cognitive science, we identify four core competencies essential for memory agents: accurate retrieval, test-time learning, long-range understanding, and selective forgetting. Existing benchmarks either rely on limited context lengths or are tailored for static, long-context settings like book-based QA, which do not reflect the interactive, multi-turn nature of memory agents that incrementally accumulate information. Moreover, no existing benchmarks cover all four competencies. We introduce MemoryAgentBench, a new benchmark specifically designed for memory agents. Our benchmark transforms existing long-context datasets and incorporates newly constructed datasets into a multi-turn format, effectively simulating the incremental information processing characteristic of memory agents. By carefully selecting and curating datasets, our benchmark provides comprehensive coverage of the four core memory competencies outlined above, thereby offering a systematic and challenging testbed for assessing memory quality. We evaluate a diverse set of memory agents, ranging from simple context-based and retrieval-augmented generation (RAG) systems to advanced agents with external memory modules and tool integration. Empirical results reveal that current methods fall short of mastering all four competencies, underscoring the need for further research into comprehensive memory mechanisms for LLM agents.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MemoryAgentBench, a benchmark targeting four core memory competencies in LLM agents: accurate retrieval, test-time learning, long-range understanding, and selective forgetting. It resides in the Core Memory Competency Evaluation leaf, which contains only two papers including this one and MemBench. This leaf sits within the broader Memory Evaluation and Benchmarking branch, a relatively focused area compared to the more crowded Memory Architecture and Mechanisms branch. The positioning suggests the paper addresses a recognized gap in systematic memory assessment, though the leaf's small size indicates this remains an emerging rather than saturated research direction.

The taxonomy reveals neighboring evaluation approaches in sibling leaves: Conversational and Long-Term Memory Evaluation examines dialogue continuity and temporal tracking, while Multi-Platform Memory Tracking addresses asynchronous enterprise environments. These adjacent directions emphasize context-specific memory challenges, whereas Core Memory Competency Evaluation focuses on fundamental, domain-agnostic abilities. The paper's multi-turn format and incremental information processing distinguish it from static long-context benchmarks in General Agent Evaluation, which assess broader capabilities like planning and tool use. This structural positioning highlights the paper's attempt to bridge foundational memory science with interactive agent scenarios.

Among twenty-five candidates examined, the benchmark contribution encountered one refutable candidate from ten reviewed, while the unified evaluation framework faced two refutable candidates from ten. The two new datasets (EventQA and FactConsolidation) showed no refutable prior work among five candidates examined. These statistics suggest moderate prior work overlap for the benchmark and framework contributions, though the limited search scope—top-K semantic matches plus citation expansion—means undiscovered relevant work may exist. The dataset contributions appear more novel within this search window, though the small candidate pool (five papers) limits confidence in this assessment.

The analysis reflects a targeted but not exhaustive literature review. The taxonomy structure indicates memory evaluation remains less explored than architectural design, with only three leaves versus six in Memory Architecture and Mechanisms. However, the presence of MemBench as a direct sibling and the refutable candidates for core contributions suggest the paper builds incrementally on recognized foundations rather than opening entirely new ground. The scope limitations mean this assessment captures visible trends but cannot rule out overlooked parallel efforts in the broader literature.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
25
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: Evaluating memory mechanisms in large language model agents. The field has evolved into a rich landscape organized around several complementary perspectives. Memory Architecture and Mechanisms explores how agents store and organize information, ranging from hierarchical structures to cognitive-inspired designs such as those in Cognitive memory in large[2] and Mirix[3]. Memory Evaluation and Benchmarking focuses on systematic assessment frameworks, including core competency tests like MemBench[32] and long-term conversational evaluations such as Evaluating the Long-Term Memory[7]. General Agent Evaluation addresses broader performance metrics beyond memory alone, while Surveys and Taxonomies synthesize the growing body of knowledge, as seen in A survey on the[1] and Survey on Evaluation of[6]. Application Domains demonstrate memory systems in specialized contexts like chemistry (Chemagent[12]) or economics (EconAgent[46]), and Memory Security and Privacy tackle vulnerabilities such as those examined in Unveiling Privacy Risks in[34]. Memory Enhancement Techniques propose methods to improve retrieval and storage efficiency, exemplified by Empowering Working Memory for[8] and Memtool[42], while Theoretical Foundations ground these systems in cognitive science principles. A particularly active line of work contrasts architectural innovations with evaluation rigor. Some studies emphasize novel memory designs—such as hierarchical or multi-system approaches in Multiple memory systems for[29] and Hierarchical memory for high-efficiency[37]—while others prioritize robust benchmarking to measure whether these designs truly enhance agent capabilities. EVALUATING MEMORY IN LLM[0] sits squarely within the Memory Evaluation and Benchmarking branch, specifically targeting core memory competencies. Its emphasis aligns closely with MemBench[32], which also provides systematic evaluation frameworks, yet EVALUATING MEMORY IN LLM[0] appears to focus more directly on fundamental memory operations rather than domain-specific or conversational scenarios explored in works like Evaluating Very Long-Term Conversational[10]. This positioning reflects an ongoing tension in the field: whether to prioritize general-purpose memory assessment or task-specific validation, and how to balance architectural creativity with empirical grounding.

Claimed Contributions

MemoryAgentBench benchmark for evaluating memory in LLM agents

The authors propose a unified benchmark framework that transforms existing long-context datasets and incorporates newly constructed datasets into a multi-turn format to evaluate memory agents across four core competencies: accurate retrieval, test-time learning, long-range understanding, and selective forgetting.

10 retrieved papers
Can Refute
Two new datasets: EventQA and FactConsolidation

The authors create EventQA, a reasoning-style NIAH task for evaluating temporal event recall in long narratives, and FactConsolidation, a dataset using counterfactual edit pairs to assess whether agents can forget outdated memory and reason over contradictory information.

5 retrieved papers
Unified evaluation framework for memory agents

The authors develop a systematic evaluation protocol that presents agents with sequences of textual inputs simulating multi-turn interactions, where inputs are incrementally fed to agents in temporal order, enabling comprehensive assessment of memory mechanisms across diverse agent architectures.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MemoryAgentBench benchmark for evaluating memory in LLM agents

The authors propose a unified benchmark framework that transforms existing long-context datasets and incorporates newly constructed datasets into a multi-turn format to evaluate memory agents across four core competencies: accurate retrieval, test-time learning, long-range understanding, and selective forgetting.

Contribution

Two new datasets: EventQA and FactConsolidation

The authors create EventQA, a reasoning-style NIAH task for evaluating temporal event recall in long narratives, and FactConsolidation, a dataset using counterfactual edit pairs to assess whether agents can forget outdated memory and reason over contradictory information.

Contribution

Unified evaluation framework for memory agents

The authors develop a systematic evaluation protocol that presents agents with sequences of textual inputs simulating multi-turn interactions, where inputs are incrementally fed to agents in temporal order, enabling comprehensive assessment of memory mechanisms across diverse agent architectures.