EVALUATING MEMORY IN LLM AGENTS VIA INCRE- MENTAL MULTI-TURN INTERACTIONS

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

LLM Agents; Agents with Memory; Memory Agents Benchmark; Evaluation for Memory

Recent benchmarks for Large Language Model (LLM) agents primarily focus on evaluating reasoning, planning, and execution capabilities, while another critical component—memory, encompassing how agents memorize, update, and retrieve long-term information—is under-evaluated due to the lack of benchmarks. We term agents with memory mechanisms as memory agents. In this paper, based on classic theories from memory science and cognitive science, we identify four core competencies essential for memory agents: accurate retrieval, test-time learning, long-range understanding, and selective forgetting. Existing benchmarks either rely on limited context lengths or are tailored for static, long-context settings like book-based QA, which do not reflect the interactive, multi-turn nature of memory agents that incrementally accumulate information. Moreover, no existing benchmarks cover all four competencies. We introduce MemoryAgentBench, a new benchmark specifically designed for memory agents. Our benchmark transforms existing long-context datasets and incorporates newly constructed datasets into a multi-turn format, effectively simulating the incremental information processing characteristic of memory agents. By carefully selecting and curating datasets, our benchmark provides comprehensive coverage of the four core memory competencies outlined above, thereby offering a systematic and challenging testbed for assessing memory quality. We evaluate a diverse set of memory agents, ranging from simple context-based and retrieval-augmented generation (RAG) systems to advanced agents with external memory modules and tool integration. Empirical results reveal that current methods fall short of mastering all four competencies, underscoring the need for further research into comprehensive memory mechanisms for LLM agents.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MemoryAgentBench, a benchmark targeting four core memory competencies in LLM agents: accurate retrieval, test-time learning, long-range understanding, and selective forgetting. It resides in the Core Memory Competency Evaluation leaf, which contains only two papers including this one and MemBench. This leaf sits within the broader Memory Evaluation and Benchmarking branch, a relatively focused area compared to the more crowded Memory Architecture and Mechanisms branch. The positioning suggests the paper addresses a recognized gap in systematic memory assessment, though the leaf's small size indicates this remains an emerging rather than saturated research direction.

The taxonomy reveals neighboring evaluation approaches in sibling leaves: Conversational and Long-Term Memory Evaluation examines dialogue continuity and temporal tracking, while Multi-Platform Memory Tracking addresses asynchronous enterprise environments. These adjacent directions emphasize context-specific memory challenges, whereas Core Memory Competency Evaluation focuses on fundamental, domain-agnostic abilities. The paper's multi-turn format and incremental information processing distinguish it from static long-context benchmarks in General Agent Evaluation, which assess broader capabilities like planning and tool use. This structural positioning highlights the paper's attempt to bridge foundational memory science with interactive agent scenarios.

Among twenty-five candidates examined, the benchmark contribution encountered one refutable candidate from ten reviewed, while the unified evaluation framework faced two refutable candidates from ten. The two new datasets (EventQA and FactConsolidation) showed no refutable prior work among five candidates examined. These statistics suggest moderate prior work overlap for the benchmark and framework contributions, though the limited search scope—top-K semantic matches plus citation expansion—means undiscovered relevant work may exist. The dataset contributions appear more novel within this search window, though the small candidate pool (five papers) limits confidence in this assessment.

The analysis reflects a targeted but not exhaustive literature review. The taxonomy structure indicates memory evaluation remains less explored than architectural design, with only three leaves versus six in Memory Architecture and Mechanisms. However, the presence of MemBench as a direct sibling and the refutable candidates for core contributions suggest the paper builds incrementally on recognized foundations rather than opening entirely new ground. The scope limitations mean this assessment captures visible trends but cannot rule out overlooked parallel efforts in the broader literature.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Evaluating memory mechanisms in large language model agents. The field has evolved into a rich landscape organized around several complementary perspectives. Memory Architecture and Mechanisms explores how agents store and organize information, ranging from hierarchical structures to cognitive-inspired designs such as those in Cognitive memory in large[2] and Mirix[3]. Memory Evaluation and Benchmarking focuses on systematic assessment frameworks, including core competency tests like MemBench[32] and long-term conversational evaluations such as Evaluating the Long-Term Memory[7]. General Agent Evaluation addresses broader performance metrics beyond memory alone, while Surveys and Taxonomies synthesize the growing body of knowledge, as seen in A survey on the[1] and Survey on Evaluation of[6]. Application Domains demonstrate memory systems in specialized contexts like chemistry (Chemagent[12]) or economics (EconAgent[46]), and Memory Security and Privacy tackle vulnerabilities such as those examined in Unveiling Privacy Risks in[34]. Memory Enhancement Techniques propose methods to improve retrieval and storage efficiency, exemplified by Empowering Working Memory for[8] and Memtool[42], while Theoretical Foundations ground these systems in cognitive science principles. A particularly active line of work contrasts architectural innovations with evaluation rigor. Some studies emphasize novel memory designs—such as hierarchical or multi-system approaches in Multiple memory systems for[29] and Hierarchical memory for high-efficiency[37]—while others prioritize robust benchmarking to measure whether these designs truly enhance agent capabilities. EVALUATING MEMORY IN LLM[0] sits squarely within the Memory Evaluation and Benchmarking branch, specifically targeting core memory competencies. Its emphasis aligns closely with MemBench[32], which also provides systematic evaluation frameworks, yet EVALUATING MEMORY IN LLM[0] appears to focus more directly on fundamental memory operations rather than domain-specific or conversational scenarios explored in works like Evaluating Very Long-Term Conversational[10]. This positioning reflects an ongoing tension in the field: whether to prioritize general-purpose memory assessment or task-specific validation, and how to balance architectural creativity with empirical grounding.

Claimed Contributions

MemoryAgentBench benchmark for evaluating memory in LLM agents

Can Refute

10 retrieved papers

The authors propose a unified benchmark framework that transforms existing long-context datasets and incorporates newly constructed datasets into a multi-turn format to evaluate memory agents across four core competencies: accurate retrieval, test-time learning, long-range understanding, and selective forgetting.

10 retrieved papers

Can Refute

Two new datasets: EventQA and FactConsolidation

5 retrieved papers

The authors create EventQA, a reasoning-style NIAH task for evaluating temporal event recall in long narratives, and FactConsolidation, a dataset using counterfactual edit pairs to assess whether agents can forget outdated memory and reason over contradictory information.

5 retrieved papers

Unified evaluation framework for memory agents

Can Refute

10 retrieved papers

The authors develop a systematic evaluation protocol that presents agents with sequences of textual inputs simulating multi-turn interactions, where inputs are incrementally fed to agents in temporal order, enabling comprehensive assessment of memory mechanisms across diverse agent architectures.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[32] MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents PDF

Tan Haoran, Zhang Zeyu, Ma Chen, Chen Xu, Dai QuanYu, Dong Zhenhua (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MemoryAgentBench benchmark for evaluating memory in LLM agents

[10] Evaluating Very Long-Term Conversational Memory of LLM Agents PDF

Can Refute

[1] A survey on the memory mechanism of large language model-based agents PDF

Cannot Refute

[3] Mirix: Multi-agent memory system for llm-based agents PDF

Cannot Refute

[6] Survey on Evaluation of LLM-based Agents PDF

Cannot Refute

[25] Large language model based multi-agents: A survey of progress and challenges PDF

Cannot Refute

[28] Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large language model PDF

Cannot Refute

[51] A Benchmark for Procedural Memory Retrieval in Language Agents PDF

Cannot Refute

[52] Episodic Memories Generation and Evaluation Benchmark for Large Language Models PDF

Cannot Refute

[53] Madial-bench: Towards real-world evaluation of memory-augmented dialogue generation PDF

Cannot Refute

[54] Open-ended instructable embodied agents with memory-augmented large language models PDF

Cannot Refute

Contribution

Two new datasets: EventQA and FactConsolidation

[55] Larimar: Large language models with episodic memory control PDF

Cannot Refute

[56] Memory in Large Language Models: Mechanisms, Evaluation and Evolution PDF

Cannot Refute

[57] Unlearning or Obfuscating? Jogging the Memory of Unlearned LLMs via Benign Relearning PDF

Cannot Refute

[58] Forgetting in robotic episodic long-term memory PDF

Cannot Refute

[59] Forgetful but Faithful: A Cognitive Memory Architecture and Benchmark for Privacy-Aware Generative Agents PDF

Cannot Refute

Contribution

Unified evaluation framework for memory agents

[10] Evaluating Very Long-Term Conversational Memory of LLM Agents PDF

Can Refute

[66] LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory PDF

Can Refute

[5] A-MEM: Agentic Memory for LLM Agents PDF

Cannot Refute

[60] Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents PDF

Cannot Refute

[61] BOLAA: Benchmarking and Orchestrating LLM-augmented Autonomous Agents PDF

Cannot Refute

[62] AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents PDF

Cannot Refute

[63] FindingDory: A Benchmark to Evaluate Memory in Embodied Agents PDF

Cannot Refute

[64] HaluMem: Evaluating Hallucinations in Memory Systems of Agents PDF

Cannot Refute

[65] SkyRL-Agent: Efficient RL Training for Multi-turn LLM Agent PDF

Cannot Refute

[67] DABstep: Data Agent Benchmark for Multi-step Reasoning PDF

Cannot Refute

EVALUATING MEMORY IN LLM AGENTS VIA INCRE- MENTAL MULTI-TURN INTERACTIONS

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[32] MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents PDF

Contribution Analysis

MemoryAgentBench benchmark for evaluating memory in LLM agents

[10] Evaluating Very Long-Term Conversational Memory of LLM Agents PDF

[1] A survey on the memory mechanism of large language model-based agents PDF

[3] Mirix: Multi-agent memory system for llm-based agents PDF

[6] Survey on Evaluation of LLM-based Agents PDF

[25] Large language model based multi-agents: A survey of progress and challenges PDF

[28] Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large language model PDF

[51] A Benchmark for Procedural Memory Retrieval in Language Agents PDF

[52] Episodic Memories Generation and Evaluation Benchmark for Large Language Models PDF

[53] Madial-bench: Towards real-world evaluation of memory-augmented dialogue generation PDF

[54] Open-ended instructable embodied agents with memory-augmented large language models PDF

Two new datasets: EventQA and FactConsolidation

[55] Larimar: Large language models with episodic memory control PDF

[56] Memory in Large Language Models: Mechanisms, Evaluation and Evolution PDF

[57] Unlearning or Obfuscating? Jogging the Memory of Unlearned LLMs via Benign Relearning PDF

[58] Forgetting in robotic episodic long-term memory PDF

[59] Forgetful but Faithful: A Cognitive Memory Architecture and Benchmark for Privacy-Aware Generative Agents PDF

Unified evaluation framework for memory agents

[10] Evaluating Very Long-Term Conversational Memory of LLM Agents PDF

[66] LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory PDF

[5] A-MEM: Agentic Memory for LLM Agents PDF

[60] Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents PDF

[61] BOLAA: Benchmarking and Orchestrating LLM-augmented Autonomous Agents PDF

[62] AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents PDF

[63] FindingDory: A Benchmark to Evaluate Memory in Embodied Agents PDF

[64] HaluMem: Evaluating Hallucinations in Memory Systems of Agents PDF

[65] SkyRL-Agent: Efficient RL Training for Multi-turn LLM Agent PDF

[67] DABstep: Data Agent Benchmark for Multi-step Reasoning PDF

Table of Contents