Abstract:

While Mixture of Experts (MoE) models achieve remarkable efficiency by activating only subsets of parameters, they suffer from high memory access costs during inference. Memory-layer architectures offer an appealing alternative with very few memory access, but previous attempts like UltraMem have only matched the performance of 2-expert MoE models, falling significantly short of state-of-the-art 8-expert configurations. We present UltraMemV2, a redesigned memory-layer architecture that closes this performance gap. Our approach introduces five key improvements: integrating memory layers into every transformer block, simplifying value expansion with single linear projections, adopting FFN-based value processing from PEER, implementing principled parameter initialization, and rebalancing memory-to-FFN computation ratios. Through extensive evaluation, we demonstrate that UltraMemV2 achieves performance parity with 8-expert MoE models under same computation and parameters but significantly low memory access. Notably, UltraMemV2 shows superior performance on memory-intensive tasks, with improvements of +1.6 points on long-context memorization, +6.2 points on multi-round memorization, and +7.9 points on in-context learning. We validate our approach at scale with models up to 2.5B activated parameters from 120B total parameters, and establish that activation density has greater impact on performance than total sparse parameter count. Our work brings memory-layer architectures to performance parity with state-of-the-art MoE models, presenting a compelling alternative for efficient sparse computation.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper presents UltraMemV2, a memory-layer architecture designed to match the performance of 8-expert Mixture of Experts models while reducing memory access costs. It resides in the 'Memory-Layer Architectures for Sparse Computation' leaf, which contains only two papers total: the original UltraMem and this work. This is a notably sparse research direction within the broader taxonomy of 50 papers across seven major branches, suggesting that memory-layer replacements for standard transformer components remain an emerging and under-explored approach compared to more established sparse methods like pruning or MoE routing.

The taxonomy reveals that UltraMemV2 sits within the 'Memory-Augmented Sparse Neural Network Architectures' branch, which also includes external memory-augmented networks, sparse distributed memory models, and video temporal memory systems. Neighboring branches cover recurrent sparsity (sparse LSTMs, state-space models), sparse training frameworks, and hardware accelerators. The scope note for the parent branch explicitly excludes MoE models without memory layers, positioning this work as an alternative paradigm rather than an incremental MoE refinement. The taxonomy structure shows that while memory-augmented approaches are well-represented overall, the specific strategy of replacing feedforward layers with memory modules is far less crowded than attention-based or recurrent sparsity methods.

Among 25 candidates examined across three contributions, the core architectural claim (achieving 8-expert MoE parity) shows one refutable candidate out of 10 examined, indicating some prior overlap in performance targets or design principles. The five architectural improvements examined 5 candidates with no refutations, suggesting these specific design choices may be novel within the limited search scope. The third contribution (activation density vs. parameter count) examined 10 candidates with no refutations, implying this empirical finding has not been explicitly demonstrated in the retrieved literature. The limited search scale (25 papers, not hundreds) means these statistics reflect top-K semantic matches and immediate citations, not exhaustive coverage of all memory-layer or MoE research.

Based on the top-25 semantic search results, UltraMemV2 appears to advance a sparsely populated research direction with some architectural novelty, though the core performance claim has at least one overlapping prior work. The taxonomy context suggests this work addresses a gap between memory-layer efficiency and MoE performance that few papers have tackled directly. However, the analysis does not cover the full MoE literature or all memory-augmented architectures, so definitive novelty claims require broader verification beyond this limited candidate set.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
25
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: efficient sparse model architectures with memory layers. The field spans a diverse set of approaches that combine sparsity—whether in connectivity, activation, or storage—with explicit memory mechanisms to reduce computational and storage costs while preserving model capacity. At the highest level, the taxonomy organizes work into memory-augmented sparse neural network architectures (which directly integrate sparse memory layers into feedforward or modular designs), recurrent and sequence models with sparsity (including sparse LSTMs and temporal memory systems), sparse training frameworks and compression methods (focused on pruning and dynamic sparsity during learning), hardware accelerators for sparse models (covering FPGA, GPU, and emerging in-memory computing substrates), sparse attention and modular architectures (which exploit structured sparsity in transformers and mixture-of-experts), application-specific sparse memory networks (tailored to domains like imaging or speech), and theoretical and biological memory models (drawing on neuroscience-inspired principles such as Kanerva's Sparse Distributed Memory[12]). Representative works illustrate these divisions: Sparse LSTM FPGA[4] and Intrinsic Sparse LSTM[10] exemplify recurrent sparsity, MRAM-SRAM Hybrid[7] and Cambricon-X[28] address hardware acceleration, and Sparse Modular Activation[6] explores modular sparse gating. Several active lines of work reveal key trade-offs and open questions. One thread investigates ultra-sparse memory layers that drastically reduce parameter counts while maintaining retrieval accuracy, as seen in Ultra-Sparse Memory[36] and now extended by UltraMemV2[0], which pushes sparsity levels further and refines indexing strategies for faster lookups. Another contrasts static pruning methods—where sparsity patterns are fixed after training—with dynamic or adaptive approaches like Adaptive Sparse Memory[5] and Recovery Sparse Networks[22], which adjust connectivity on-the-fly to balance efficiency and task performance. A third theme examines how memory-augmented architectures scale to long sequences or lifelong learning scenarios, with works such as Lifelong Compressive Memory[34] and Dual Memory Lifelong[15] exploring hierarchical or episodic storage. UltraMemV2[0] sits squarely within the memory-layer architectures for sparse computation branch, closely neighboring Ultra-Sparse Memory[36] and sharing design principles with Adaptive Sparse Memory[5], yet it distinguishes itself by achieving higher sparsity ratios and introducing novel compression techniques that reduce both memory footprint and access latency, positioning it as a next-generation solution for resource-constrained deployments.

Claimed Contributions

UltraMemV2 architecture achieving performance parity with 8-expert MoE models

The authors introduce UltraMemV2, a memory-layer architecture that for the first time matches the performance of state-of-the-art 8-expert Mixture of Experts models while maintaining lower memory access costs. This closes a significant performance gap that previously limited memory-layer architectures to matching only 2-expert MoE performance.

10 retrieved papers
Can Refute
Five key architectural improvements to memory-layer design

The authors present five specific architectural innovations: placing memory layers in every transformer block, using single linear projections for value expansion, adopting PEER's FFN-based value processing, developing principled parameter initialization to prevent training divergence, and adjusting memory-to-FFN computation ratios.

5 retrieved papers
Demonstration that activation density impacts performance more than total sparse parameter count

Through experiments scaling to 120B total parameters, the authors establish a key design principle showing that the number of activated values per layer (activation density) has greater impact on model performance than simply increasing the total number of sparse parameters.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

UltraMemV2 architecture achieving performance parity with 8-expert MoE models

The authors introduce UltraMemV2, a memory-layer architecture that for the first time matches the performance of state-of-the-art 8-expert Mixture of Experts models while maintaining lower memory access costs. This closes a significant performance gap that previously limited memory-layer architectures to matching only 2-expert MoE performance.

Contribution

Five key architectural improvements to memory-layer design

The authors present five specific architectural innovations: placing memory layers in every transformer block, using single linear projections for value expansion, adopting PEER's FFN-based value processing, developing principled parameter initialization to prevent training divergence, and adjusting memory-to-FFN computation ratios.

Contribution

Demonstration that activation density impacts performance more than total sparse parameter count

Through experiments scaling to 120B total parameters, the authors establish a key design principle showing that the number of activated values per layer (activation density) has greater impact on model performance than simply increasing the total number of sparse parameters.