UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

memory networkmoepretrainlong context

While Mixture of Experts (MoE) models achieve remarkable efficiency by activating only subsets of parameters, they suffer from high memory access costs during inference. Memory-layer architectures offer an appealing alternative with very few memory access, but previous attempts like UltraMem have only matched the performance of 2-expert MoE models, falling significantly short of state-of-the-art 8-expert configurations. We present UltraMemV2, a redesigned memory-layer architecture that closes this performance gap. Our approach introduces five key improvements: integrating memory layers into every transformer block, simplifying value expansion with single linear projections, adopting FFN-based value processing from PEER, implementing principled parameter initialization, and rebalancing memory-to-FFN computation ratios. Through extensive evaluation, we demonstrate that UltraMemV2 achieves performance parity with 8-expert MoE models under same computation and parameters but significantly low memory access. Notably, UltraMemV2 shows superior performance on memory-intensive tasks, with improvements of +1.6 points on long-context memorization, +6.2 points on multi-round memorization, and +7.9 points on in-context learning. We validate our approach at scale with models up to 2.5B activated parameters from 120B total parameters, and establish that activation density has greater impact on performance than total sparse parameter count. Our work brings memory-layer architectures to performance parity with state-of-the-art MoE models, presenting a compelling alternative for efficient sparse computation.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper presents UltraMemV2, a memory-layer architecture designed to match the performance of 8-expert Mixture of Experts models while reducing memory access costs. It resides in the 'Memory-Layer Architectures for Sparse Computation' leaf, which contains only two papers total: the original UltraMem and this work. This is a notably sparse research direction within the broader taxonomy of 50 papers across seven major branches, suggesting that memory-layer replacements for standard transformer components remain an emerging and under-explored approach compared to more established sparse methods like pruning or MoE routing.

The taxonomy reveals that UltraMemV2 sits within the 'Memory-Augmented Sparse Neural Network Architectures' branch, which also includes external memory-augmented networks, sparse distributed memory models, and video temporal memory systems. Neighboring branches cover recurrent sparsity (sparse LSTMs, state-space models), sparse training frameworks, and hardware accelerators. The scope note for the parent branch explicitly excludes MoE models without memory layers, positioning this work as an alternative paradigm rather than an incremental MoE refinement. The taxonomy structure shows that while memory-augmented approaches are well-represented overall, the specific strategy of replacing feedforward layers with memory modules is far less crowded than attention-based or recurrent sparsity methods.

Among 25 candidates examined across three contributions, the core architectural claim (achieving 8-expert MoE parity) shows one refutable candidate out of 10 examined, indicating some prior overlap in performance targets or design principles. The five architectural improvements examined 5 candidates with no refutations, suggesting these specific design choices may be novel within the limited search scope. The third contribution (activation density vs. parameter count) examined 10 candidates with no refutations, implying this empirical finding has not been explicitly demonstrated in the retrieved literature. The limited search scale (25 papers, not hundreds) means these statistics reflect top-K semantic matches and immediate citations, not exhaustive coverage of all memory-layer or MoE research.

Based on the top-25 semantic search results, UltraMemV2 appears to advance a sparsely populated research direction with some architectural novelty, though the core performance claim has at least one overlapping prior work. The taxonomy context suggests this work addresses a gap between memory-layer efficiency and MoE performance that few papers have tackled directly. However, the analysis does not cover the full MoE literature or all memory-augmented architectures, so definitive novelty claims require broader verification beyond this limited candidate set.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: efficient sparse model architectures with memory layers. The field spans a diverse set of approaches that combine sparsity—whether in connectivity, activation, or storage—with explicit memory mechanisms to reduce computational and storage costs while preserving model capacity. At the highest level, the taxonomy organizes work into memory-augmented sparse neural network architectures (which directly integrate sparse memory layers into feedforward or modular designs), recurrent and sequence models with sparsity (including sparse LSTMs and temporal memory systems), sparse training frameworks and compression methods (focused on pruning and dynamic sparsity during learning), hardware accelerators for sparse models (covering FPGA, GPU, and emerging in-memory computing substrates), sparse attention and modular architectures (which exploit structured sparsity in transformers and mixture-of-experts), application-specific sparse memory networks (tailored to domains like imaging or speech), and theoretical and biological memory models (drawing on neuroscience-inspired principles such as Kanerva's Sparse Distributed Memory[12]). Representative works illustrate these divisions: Sparse LSTM FPGA[4] and Intrinsic Sparse LSTM[10] exemplify recurrent sparsity, MRAM-SRAM Hybrid[7] and Cambricon-X[28] address hardware acceleration, and Sparse Modular Activation[6] explores modular sparse gating. Several active lines of work reveal key trade-offs and open questions. One thread investigates ultra-sparse memory layers that drastically reduce parameter counts while maintaining retrieval accuracy, as seen in Ultra-Sparse Memory[36] and now extended by UltraMemV2[0], which pushes sparsity levels further and refines indexing strategies for faster lookups. Another contrasts static pruning methods—where sparsity patterns are fixed after training—with dynamic or adaptive approaches like Adaptive Sparse Memory[5] and Recovery Sparse Networks[22], which adjust connectivity on-the-fly to balance efficiency and task performance. A third theme examines how memory-augmented architectures scale to long sequences or lifelong learning scenarios, with works such as Lifelong Compressive Memory[34] and Dual Memory Lifelong[15] exploring hierarchical or episodic storage. UltraMemV2[0] sits squarely within the memory-layer architectures for sparse computation branch, closely neighboring Ultra-Sparse Memory[36] and sharing design principles with Adaptive Sparse Memory[5], yet it distinguishes itself by achieving higher sparsity ratios and introducing novel compression techniques that reduce both memory footprint and access latency, positioning it as a next-generation solution for resource-constrained deployments.

Claimed Contributions

UltraMemV2 architecture achieving performance parity with 8-expert MoE models

Can Refute

10 retrieved papers

The authors introduce UltraMemV2, a memory-layer architecture that for the first time matches the performance of state-of-the-art 8-expert Mixture of Experts models while maintaining lower memory access costs. This closes a significant performance gap that previously limited memory-layer architectures to matching only 2-expert MoE performance.

10 retrieved papers

Can Refute

Five key architectural improvements to memory-layer design

5 retrieved papers

The authors present five specific architectural innovations: placing memory layers in every transformer block, using single linear projections for value expansion, adopting PEER's FFN-based value processing, developing principled parameter initialization to prevent training divergence, and adjusting memory-to-FFN computation ratios.

5 retrieved papers

Demonstration that activation density impacts performance more than total sparse parameter count

10 retrieved papers

Through experiments scaling to 120B total parameters, the authors establish a key design principle showing that the number of activated values per layer (activation density) has greater impact on model performance than simply increasing the total number of sparse parameters.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[36] Ultra-Sparse Memory Network PDF

Huang Zihao, Zihao Huang, Huang Hong-zhi, Qiyang Min, Zhu Defa, Hongzhi Huang, Zeng, Yutao, Defa Zhu, Guo Ran, Yutao Zeng, Zhou Xun, Ran Guo, Xun Zhou (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

UltraMemV2 architecture achieving performance parity with 8-expert MoE models

[55] Memory layers at scale PDF

Can Refute

[51] Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design PDF

Cannot Refute

[52] Mixture of a million experts PDF

Cannot Refute

[53] Exploring Memory Expansion Designs for Training Mixture-of-Experts Models PDF

Cannot Refute

[54] Graph mixture of experts and memory-augmented routers for multivariate time series anomaly detection PDF

Cannot Refute

[56] HOPE: A Memory-Based and Composition-Aware Framework for Zero-Shot Learning with Hopfield Network and Soft Mixture of Experts PDF

Cannot Refute

[57] MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism PDF

Cannot Refute

[58] Edgemoe: Fast on-device inference of moe-based large language models PDF

Cannot Refute

[59] Harnessing inter-gpu shared memory for seamless moe communication-computation fusion PDF

Cannot Refute

[60] MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs PDF

Cannot Refute

Contribution

Five key architectural improvements to memory-layer design

[71] MemoryFormer: Minimize transformer computation by removing fully-connected layers PDF

Cannot Refute

[72] The FFN as a Key-Value Memory: Functional Specialization in Transformer Computation PDF

Cannot Refute

[73] MIDUS: Memory-Infused Depth Up-Scaling PDF

Cannot Refute

[74] An Evolved Universal Transformer Memory PDF

Cannot Refute

[75] E.T.: Entity-Transformers. Coreference augmented Neural Language Model for richer mention representations via Entity-Transformer blocks PDF

Cannot Refute

Contribution

Demonstration that activation density impacts performance more than total sparse parameter count

[61] SCNN: An accelerator for compressed-sparse convolutional neural networks PDF

Cannot Refute

[62] A Dynamic Adaptive Activation NeuronâTransistor for Dynamic Sparse Neural Networks in Advanced Driving Assistance System PDF

Cannot Refute

[63] Efficient sparse-winograd convolutional neural networks PDF

Cannot Refute

[64] A theoretical view on sparsely activated networks PDF

Cannot Refute

[65] Procrustes: a dataflow and accelerator for sparse deep neural network training PDF

Cannot Refute

[66] Learning Activation Functions for Sparse Neural Networks PDF

Cannot Refute

[67] Context-aware Sparse Spatiotemporal Learning for Event-based Vision PDF

Cannot Refute

[68] Data density-based clustering for regularized fuzzy neural networks based on nullneurons and robust activation function PDF

Cannot Refute

[69] Death and rebirth of neural activity in sparse inhibitory networks PDF

Cannot Refute

[70] Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters PDF

Cannot Refute

UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[36] Ultra-Sparse Memory Network PDF

Contribution Analysis

UltraMemV2 architecture achieving performance parity with 8-expert MoE models

[55] Memory layers at scale PDF

[51] Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design PDF

[52] Mixture of a million experts PDF

[53] Exploring Memory Expansion Designs for Training Mixture-of-Experts Models PDF

[54] Graph mixture of experts and memory-augmented routers for multivariate time series anomaly detection PDF

[56] HOPE: A Memory-Based and Composition-Aware Framework for Zero-Shot Learning with Hopfield Network and Soft Mixture of Experts PDF

[57] MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism PDF

[58] Edgemoe: Fast on-device inference of moe-based large language models PDF

[59] Harnessing inter-gpu shared memory for seamless moe communication-computation fusion PDF

[60] MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs PDF

Five key architectural improvements to memory-layer design

[71] MemoryFormer: Minimize transformer computation by removing fully-connected layers PDF

[72] The FFN as a Key-Value Memory: Functional Specialization in Transformer Computation PDF

[73] MIDUS: Memory-Infused Depth Up-Scaling PDF

[74] An Evolved Universal Transformer Memory PDF

[75] E.T.: Entity-Transformers. Coreference augmented Neural Language Model for richer mention representations via Entity-Transformer blocks PDF

Demonstration that activation density impacts performance more than total sparse parameter count

[61] SCNN: An accelerator for compressed-sparse convolutional neural networks PDF

[62] A Dynamic Adaptive Activation NeuronâTransistor for Dynamic Sparse Neural Networks in Advanced Driving Assistance System PDF

[63] Efficient sparse-winograd convolutional neural networks PDF

[64] A theoretical view on sparsely activated networks PDF

[65] Procrustes: a dataflow and accelerator for sparse deep neural network training PDF

[66] Learning Activation Functions for Sparse Neural Networks PDF

[67] Context-aware Sparse Spatiotemporal Learning for Event-based Vision PDF

[68] Data density-based clustering for regularized fuzzy neural networks based on nullneurons and robust activation function PDF

[69] Death and rebirth of neural activity in sparse inhibitory networks PDF

[70] Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters PDF

Table of Contents

[62] A Dynamic Adaptive Activation NeuronâTransistor for Dynamic Sparse Neural Networks in Advanced Driving Assistance System PDF