Memory, Benchmark & Robots: A Benchmark for Solving Complex Tasks with Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

MemoryBenchmarkRobotsPOMDPRL

Memory is crucial for enabling agents to tackle complex tasks with temporal and spatial dependencies. While many reinforcement learning (RL) algorithms incorporate memory, the field lacks a universal benchmark to assess an agent's memory capabilities across diverse scenarios. This gap is particularly evident in tabletop robotic manipulation, where memory is essential for solving tasks with partial observability and ensuring robust performance, yet no standardized benchmarks exist. To address this, we introduce MIKASA (Memory-Intensive Skills Assessment Suite for Agents), a comprehensive benchmark for memory RL, with three key contributions: (1) we propose a comprehensive classification framework for memory-intensive RL tasks, (2) we collect MIKASA-Base -- a unified benchmark that enables systematic evaluation of memory-enhanced agents across diverse scenarios, and (3) we develop MIKASA-Robo -- a novel benchmark of 32 carefully designed memory-intensive tasks that assess memory capabilities in tabletop robotic manipulation. Our work introduces a unified framework to advance memory RL research, enabling more robust systems for real-world use.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MIKASA, a benchmark suite for evaluating memory capabilities in reinforcement learning, with particular emphasis on tabletop robotic manipulation. It resides in the 'Memory-Intensive Task Benchmarks' leaf of the taxonomy, which contains only two papers total (including this one). This sparse population suggests the research direction—systematic memory evaluation in RL—remains relatively underdeveloped compared to the broader field of memory mechanisms and architectures, where multiple crowded subtopics exist (e.g., Transformer-Based Memory with four papers, Episodic Memory with three).

The taxonomy reveals that MIKASA sits within 'Memory Benchmarking and Evaluation,' a branch containing three leaves: Memory-Intensive Task Benchmarks, Memory Interpretability and Analysis, and Partially Observable and Control Tasks. Neighboring branches focus on memory mechanisms (External and Episodic Memory Systems, Recurrent and Sequence Models) and applications (Cognitive and Neuroscience-Inspired Memory, Embodied and Robotic Agents). The paper's dual focus on general memory evaluation (MIKASA-Base) and robotic manipulation (MIKASA-Robo) positions it at the intersection of benchmarking and embodied applications, bridging two otherwise separate research directions.

Among 30 candidates examined, the classification framework contribution shows one refutable candidate out of ten examined, suggesting some prior taxonomic work exists. The MIKASA-Base unified benchmark found no clear refutations across ten candidates, indicating potential novelty in its cross-scenario evaluation approach. The MIKASA-Robo robotic benchmark identified one refutable candidate among ten examined, likely reflecting existing robotic memory tasks but possibly differing in scope or design. The limited search scale (30 total candidates) means these statistics capture only the most semantically similar prior work, not an exhaustive field survey.

Based on the top-30 semantic matches examined, the work appears to occupy a relatively sparse research area within memory benchmarking, particularly for robotic manipulation scenarios. The taxonomy structure confirms that systematic memory evaluation remains less explored than memory mechanism design. However, the presence of at least one overlapping work for two of three contributions suggests the paper builds incrementally on emerging benchmarking efforts rather than pioneering an entirely new direction.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: memory-intensive reinforcement learning tasks evaluation. The field has organized itself around several complementary perspectives. Memory Mechanisms and Architectures explores how agents represent and update internal state, ranging from recurrent networks to episodic retrieval systems like those in Amago[3] and Efficient Episodic Memory[6]. Memory Benchmarking and Evaluation develops standardized testbeds—such as Memory Gym[20]—that systematically probe an agent's ability to retain and recall information over extended horizons. Memory Applications and Specialized Domains applies these techniques to concrete settings like robotics, molecular design, and interactive environments. Optimization and Systems for Memory-Intensive RL addresses computational bottlenecks through hardware acceleration and efficient replay management, while RLHF and Policy Optimization for LLMs adapts memory-aware methods to large language models. Finally, Architectural Comparisons and Network Selection investigates trade-offs among different backbone designs, including transformers and world models. Within this landscape, a particularly active line of work focuses on creating rigorous benchmarks that expose the limits of current memory architectures. Memory Gym[20] exemplifies this effort by offering a suite of tasks with controllable memory demands, enabling systematic comparison across methods. Memory Benchmark Robots[0] sits squarely in this benchmarking tradition, providing evaluation protocols tailored to robotic scenarios where long-term dependencies are critical. Compared to Memory Gym[20], which emphasizes breadth across diverse memory challenges, Memory Benchmark Robots[0] appears to specialize in embodied settings, potentially incorporating physical constraints and sensor noise that are less prominent in abstract grid-world suites. Meanwhile, works like Amago[3] and Remax[4] demonstrate how meta-learning and retrieval-augmented policies can exploit these benchmarks to improve generalization, highlighting an ongoing tension between designing better evaluation tools and developing architectures that can master them.

Claimed Contributions

Comprehensive classification framework for memory-intensive RL tasks

Can Refute

10 retrieved papers

The authors introduce a systematic taxonomy that organizes memory-intensive tasks into four key categories: object memory, spatial memory, sequential memory, and memory capacity. This framework enables systematic evaluation of memory-enhanced agents across diverse scenarios without added complexity.

10 retrieved papers

Can Refute

MIKASA-Base unified benchmark for memory RL

10 retrieved papers

The authors present MIKASA-Base, a Gymnasium-based framework that consolidates widely used open-source memory-intensive environments under a common API. This benchmark standardizes task access and evaluation, supporting fair comparisons and reproducible research in memory-centric RL.

10 retrieved papers

MIKASA-Robo benchmark of memory-intensive robotic manipulation tasks

Can Refute

10 retrieved papers

The authors develop MIKASA-Robo, an open-source benchmark comprising 32 robotic tabletop manipulation tasks across 12 categories. These tasks target specific memory-dependent skills in realistic settings and address the gap in standardized benchmarks for memory evaluation in robotic manipulation.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[20] Memory Gym: Towards Endless Tasks to Benchmark Memory Capabilities of Agents PDF

Pleines, Marco, Marco Pleines, Pallasch, Matthias, Matthias Pallasch, Zimmer, Frank, FrÃ¤nk Zimmer, Preuss, Mike, Mike PreuÃ (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Comprehensive classification framework for memory-intensive RL tasks

[30] Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation PDF

Can Refute

[58] Rmm: Reinforced memory management for class-incremental learning PDF

Cannot Refute

[59] A review of meta-heuristic high utility patterns mining methods PDF

Cannot Refute

[60] Inventory planning in capacitated high-tech assembly systems under non-stationary demand PDF

Cannot Refute

[61] Improving the Accuracy of Extracting Useful Information in Search Engines from the Web Using Deep Reinforcement Learning Based on the Q-Learning Algorithm PDF

Cannot Refute

[62] Memory Reduction through Experience Classification f or Deep Reinforcement Learning with Prioritized Experience Replay PDF

Cannot Refute

[63] Synthetic POMDPs to Challenge Memory-Augmented RL: Memory Demand Structure Modeling PDF

Cannot Refute

[64] Dissociations between rule-based and information-integration categorization are not caused by differences in task difficulty PDF

Cannot Refute

[65] Lightweight Multi-Class Autoencoder Model for Malicious Traffic Detection in Private 5G Networks PDF

Cannot Refute

[66] Exploring new computing paradigms for data-intensive applications PDF

Cannot Refute

Contribution

MIKASA-Base unified benchmark for memory RL

[5] Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning PDF

Cannot Refute

[9] Mastering memory tasks with world models PDF

Cannot Refute

[20] Memory Gym: Towards Endless Tasks to Benchmark Memory Capabilities of Agents PDF

Cannot Refute

[51] Craftax: A Lightning-Fast Benchmark for Open-Ended Reinforcement Learning PDF

Cannot Refute

[52] Coom: A game benchmark for continual reinforcement learning PDF

Cannot Refute

[53] Stable Hadamard Memory: Revitalizing Memory-Augmented Agents for Reinforcement Learning PDF

Cannot Refute

[54] The rise of agentic ai: A review of definitions, frameworks, architectures, applications, evaluation metrics, and challenges PDF

Cannot Refute

[55] Augmented Memory: Sample-Efficient Generative Molecular Design with Reinforcement Learning PDF

Cannot Refute

[56] FindingDory: A Benchmark to Evaluate Memory in Embodied Agents PDF

Cannot Refute

[57] EAST: a comprehensive evaluation framework for swarm intelligence-based UAV path planning PDF

Cannot Refute

Contribution

MIKASA-Robo benchmark of memory-intensive robotic manipulation tasks

[71] Sam2act: Integrating visual foundation model with a memory architecture for robotic manipulation PDF

Can Refute

[67] Large vlm-based vision-language-action models for robotic manipulation: A survey PDF

Cannot Refute

[68] Object-aware gaussian splatting for robotic manipulation PDF

Cannot Refute

[69] Ram: Retrieval-based affordance transfer for generalizable zero-shot robotic manipulation PDF

Cannot Refute

[70] Fourier Transporter: Bi-Equivariant Robotic Manipulation in 3D PDF

Cannot Refute

[72] Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation PDF

Cannot Refute

[73] Rethinking progression of memory state in robotic manipulation: An object-centric perspective PDF

Cannot Refute

[74] Memory-based gaze prediction in deep imitation learning for robot manipulation PDF

Cannot Refute

[75] General Force Sensation for Tactile Robot PDF

Cannot Refute

[76] Episodic memory verbalization using hierarchical representations of life-long robot experience PDF

Cannot Refute

Memory, Benchmark & Robots: A Benchmark for Solving Complex Tasks with Reinforcement Learning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[20] Memory Gym: Towards Endless Tasks to Benchmark Memory Capabilities of Agents PDF

Contribution Analysis

Comprehensive classification framework for memory-intensive RL tasks

[30] Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation PDF

[58] Rmm: Reinforced memory management for class-incremental learning PDF

[59] A review of meta-heuristic high utility patterns mining methods PDF

[60] Inventory planning in capacitated high-tech assembly systems under non-stationary demand PDF

[61] Improving the Accuracy of Extracting Useful Information in Search Engines from the Web Using Deep Reinforcement Learning Based on the Q-Learning Algorithm PDF

[62] Memory Reduction through Experience Classification f or Deep Reinforcement Learning with Prioritized Experience Replay PDF

[63] Synthetic POMDPs to Challenge Memory-Augmented RL: Memory Demand Structure Modeling PDF

[64] Dissociations between rule-based and information-integration categorization are not caused by differences in task difficulty PDF

[65] Lightweight Multi-Class Autoencoder Model for Malicious Traffic Detection in Private 5G Networks PDF

[66] Exploring new computing paradigms for data-intensive applications PDF

MIKASA-Base unified benchmark for memory RL

[5] Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning PDF

[9] Mastering memory tasks with world models PDF

[20] Memory Gym: Towards Endless Tasks to Benchmark Memory Capabilities of Agents PDF

[51] Craftax: A Lightning-Fast Benchmark for Open-Ended Reinforcement Learning PDF

[52] Coom: A game benchmark for continual reinforcement learning PDF

[53] Stable Hadamard Memory: Revitalizing Memory-Augmented Agents for Reinforcement Learning PDF

[54] The rise of agentic ai: A review of definitions, frameworks, architectures, applications, evaluation metrics, and challenges PDF

[55] Augmented Memory: Sample-Efficient Generative Molecular Design with Reinforcement Learning PDF

[56] FindingDory: A Benchmark to Evaluate Memory in Embodied Agents PDF

[57] EAST: a comprehensive evaluation framework for swarm intelligence-based UAV path planning PDF

MIKASA-Robo benchmark of memory-intensive robotic manipulation tasks

[71] Sam2act: Integrating visual foundation model with a memory architecture for robotic manipulation PDF

[67] Large vlm-based vision-language-action models for robotic manipulation: A survey PDF

[68] Object-aware gaussian splatting for robotic manipulation PDF

[69] Ram: Retrieval-based affordance transfer for generalizable zero-shot robotic manipulation PDF

[70] Fourier Transporter: Bi-Equivariant Robotic Manipulation in 3D PDF

[72] Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation PDF

[73] Rethinking progression of memory state in robotic manipulation: An object-centric perspective PDF

[74] Memory-based gaze prediction in deep imitation learning for robot manipulation PDF

[75] General Force Sensation for Tactile Robot PDF

[76] Episodic memory verbalization using hierarchical representations of life-long robot experience PDF

Table of Contents