Abstract:

Memory is crucial for enabling agents to tackle complex tasks with temporal and spatial dependencies. While many reinforcement learning (RL) algorithms incorporate memory, the field lacks a universal benchmark to assess an agent's memory capabilities across diverse scenarios. This gap is particularly evident in tabletop robotic manipulation, where memory is essential for solving tasks with partial observability and ensuring robust performance, yet no standardized benchmarks exist. To address this, we introduce MIKASA (Memory-Intensive Skills Assessment Suite for Agents), a comprehensive benchmark for memory RL, with three key contributions: (1) we propose a comprehensive classification framework for memory-intensive RL tasks, (2) we collect MIKASA-Base -- a unified benchmark that enables systematic evaluation of memory-enhanced agents across diverse scenarios, and (3) we develop MIKASA-Robo -- a novel benchmark of 32 carefully designed memory-intensive tasks that assess memory capabilities in tabletop robotic manipulation. Our work introduces a unified framework to advance memory RL research, enabling more robust systems for real-world use.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MIKASA, a benchmark suite for evaluating memory capabilities in reinforcement learning, with particular emphasis on tabletop robotic manipulation. It resides in the 'Memory-Intensive Task Benchmarks' leaf of the taxonomy, which contains only two papers total (including this one). This sparse population suggests the research direction—systematic memory evaluation in RL—remains relatively underdeveloped compared to the broader field of memory mechanisms and architectures, where multiple crowded subtopics exist (e.g., Transformer-Based Memory with four papers, Episodic Memory with three).

The taxonomy reveals that MIKASA sits within 'Memory Benchmarking and Evaluation,' a branch containing three leaves: Memory-Intensive Task Benchmarks, Memory Interpretability and Analysis, and Partially Observable and Control Tasks. Neighboring branches focus on memory mechanisms (External and Episodic Memory Systems, Recurrent and Sequence Models) and applications (Cognitive and Neuroscience-Inspired Memory, Embodied and Robotic Agents). The paper's dual focus on general memory evaluation (MIKASA-Base) and robotic manipulation (MIKASA-Robo) positions it at the intersection of benchmarking and embodied applications, bridging two otherwise separate research directions.

Among 30 candidates examined, the classification framework contribution shows one refutable candidate out of ten examined, suggesting some prior taxonomic work exists. The MIKASA-Base unified benchmark found no clear refutations across ten candidates, indicating potential novelty in its cross-scenario evaluation approach. The MIKASA-Robo robotic benchmark identified one refutable candidate among ten examined, likely reflecting existing robotic memory tasks but possibly differing in scope or design. The limited search scale (30 total candidates) means these statistics capture only the most semantically similar prior work, not an exhaustive field survey.

Based on the top-30 semantic matches examined, the work appears to occupy a relatively sparse research area within memory benchmarking, particularly for robotic manipulation scenarios. The taxonomy structure confirms that systematic memory evaluation remains less explored than memory mechanism design. However, the presence of at least one overlapping work for two of three contributions suggests the paper builds incrementally on emerging benchmarking efforts rather than pioneering an entirely new direction.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: memory-intensive reinforcement learning tasks evaluation. The field has organized itself around several complementary perspectives. Memory Mechanisms and Architectures explores how agents represent and update internal state, ranging from recurrent networks to episodic retrieval systems like those in Amago[3] and Efficient Episodic Memory[6]. Memory Benchmarking and Evaluation develops standardized testbeds—such as Memory Gym[20]—that systematically probe an agent's ability to retain and recall information over extended horizons. Memory Applications and Specialized Domains applies these techniques to concrete settings like robotics, molecular design, and interactive environments. Optimization and Systems for Memory-Intensive RL addresses computational bottlenecks through hardware acceleration and efficient replay management, while RLHF and Policy Optimization for LLMs adapts memory-aware methods to large language models. Finally, Architectural Comparisons and Network Selection investigates trade-offs among different backbone designs, including transformers and world models. Within this landscape, a particularly active line of work focuses on creating rigorous benchmarks that expose the limits of current memory architectures. Memory Gym[20] exemplifies this effort by offering a suite of tasks with controllable memory demands, enabling systematic comparison across methods. Memory Benchmark Robots[0] sits squarely in this benchmarking tradition, providing evaluation protocols tailored to robotic scenarios where long-term dependencies are critical. Compared to Memory Gym[20], which emphasizes breadth across diverse memory challenges, Memory Benchmark Robots[0] appears to specialize in embodied settings, potentially incorporating physical constraints and sensor noise that are less prominent in abstract grid-world suites. Meanwhile, works like Amago[3] and Remax[4] demonstrate how meta-learning and retrieval-augmented policies can exploit these benchmarks to improve generalization, highlighting an ongoing tension between designing better evaluation tools and developing architectures that can master them.

Claimed Contributions

Comprehensive classification framework for memory-intensive RL tasks

The authors introduce a systematic taxonomy that organizes memory-intensive tasks into four key categories: object memory, spatial memory, sequential memory, and memory capacity. This framework enables systematic evaluation of memory-enhanced agents across diverse scenarios without added complexity.

10 retrieved papers
Can Refute
MIKASA-Base unified benchmark for memory RL

The authors present MIKASA-Base, a Gymnasium-based framework that consolidates widely used open-source memory-intensive environments under a common API. This benchmark standardizes task access and evaluation, supporting fair comparisons and reproducible research in memory-centric RL.

10 retrieved papers
MIKASA-Robo benchmark of memory-intensive robotic manipulation tasks

The authors develop MIKASA-Robo, an open-source benchmark comprising 32 robotic tabletop manipulation tasks across 12 categories. These tasks target specific memory-dependent skills in realistic settings and address the gap in standardized benchmarks for memory evaluation in robotic manipulation.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Comprehensive classification framework for memory-intensive RL tasks

The authors introduce a systematic taxonomy that organizes memory-intensive tasks into four key categories: object memory, spatial memory, sequential memory, and memory capacity. This framework enables systematic evaluation of memory-enhanced agents across diverse scenarios without added complexity.

Contribution

MIKASA-Base unified benchmark for memory RL

The authors present MIKASA-Base, a Gymnasium-based framework that consolidates widely used open-source memory-intensive environments under a common API. This benchmark standardizes task access and evaluation, supporting fair comparisons and reproducible research in memory-centric RL.

Contribution

MIKASA-Robo benchmark of memory-intensive robotic manipulation tasks

The authors develop MIKASA-Robo, an open-source benchmark comprising 32 robotic tabletop manipulation tasks across 12 categories. These tasks target specific memory-dependent skills in realistic settings and address the gap in standardized benchmarks for memory evaluation in robotic manipulation.