MoM: Linear Sequence Modeling with Mixture-of-Memories

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

Efficient/Low-Resource Methods for NLPLinear Sequence ModelingMachine Learning for NLP

Linear sequence modeling methods, such as linear attention, state space modeling, and linear RNNs, offer significant efficiency improvements by reducing the complexity of training and inference. However, these methods typically compress the entire input sequence into a single fixed-size memory state, which leads to suboptimal performance on recall-intensive tasks. To address this limitation, we introduce a novel architecture called Mixture-of-Memories (MoM). MoM utilizes multiple independent memory states, with a router network directing input tokens to specific memory states. This approach greatly enhances the overall memory capacity while minimizing memory interference. MoM serves as a general framework that can be seamlessly combined with diverse memory update mechanisms across linear models. As a result, MoM performs exceptionally well on recall-intensive tasks, surpassing existing linear sequence modeling techniques. Despite incorporating multiple memory states, the computation of each memory state remains linear in complexity, allowing MoM to retain the linear-complexity advantage during training, while constant-complexity during inference. Our experimental results show that MoM outperforms current linear sequence models on downstream language tasks, particularly recall-intensive tasks, and even achieves performance comparable to Transformer models.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Mixture-of-Memories (MoM), an architecture that employs multiple independent memory states with a router network to direct tokens to specific states, thereby increasing overall memory capacity while reducing interference. Within the taxonomy, this work occupies a unique position: it is the sole paper in the 'Mixture-of-Memories and Multi-State Architectures' leaf, which itself is a distinct branch among ten major research directions. This isolation suggests the paper explores a relatively sparse research direction—multi-state memory systems with routing—compared to more crowded areas like selective state space models or memory-augmented transformers.

The taxonomy reveals that neighboring research directions include selective state space models (e.g., Mamba) that use content-based selection within a single state, gated linear attention mechanisms that incorporate slot-based or gating strategies, and memory-augmented transformers that rely on external memory modules. MoM diverges from these by distributing memory across multiple independent states rather than enhancing a single memory mechanism. The scope notes clarify that multi-scale state space models without routing belong elsewhere, emphasizing that MoM's routing-based multi-state approach is architecturally distinct from both single-state selective models and external memory augmentation strategies.

Among the 23 candidates examined via limited semantic search, none were found to clearly refute any of the three main contributions. For the core MoM architecture, 10 candidates were reviewed with zero refutable overlaps; for the general framework claim, 7 candidates yielded no refutations; and for the hardware-efficient implementation, 6 candidates showed no prior work that directly anticipates this approach. This suggests that within the examined scope, the multi-state routing concept and its integration with diverse memory update mechanisms appear relatively novel, though the search was not exhaustive and focused on top-K semantic matches.

Overall, the analysis indicates that MoM occupies a sparsely populated research niche within linear sequence modeling. The absence of sibling papers in its taxonomy leaf and the lack of refutable prior work among examined candidates suggest the approach is architecturally distinct from existing methods. However, this assessment is based on a limited literature search of 23 papers, and a broader survey might reveal related multi-state or routing-based memory systems not captured by the current semantic search scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Linear sequence modeling with enhanced memory capacity. The field has evolved into a rich landscape of approaches that balance computational efficiency with the ability to retain and retrieve information over long sequences. At the highest level, the taxonomy reveals several major branches: selective state space and gated recurrence architectures (exemplified by Mamba[1]) that dynamically filter inputs; memory-augmented architectures with external storage (such as Memformer[3]) that maintain explicit memory banks; linear attention and associative recall mechanisms (including Gated Linear Attention[4]) that achieve subquadratic complexity; unified frameworks and theoretical foundations that analyze these models' expressive power; extended LSTM and recurrent architectures (like Extended LSTM[2]) that build on classical designs; mixture-of-memories and multi-state architectures that combine multiple memory systems; efficient training and parallelization techniques (e.g., Delta Rule Parallelization[6]); domain-specific applications; cognitive and theoretical models; LSTM optimization; and survey literature. These branches reflect a tension between architectural simplicity, memory capacity, and computational cost. Recent work has explored how to combine multiple memory mechanisms to capture different temporal scales and types of dependencies. Mixture of Memories[0] sits within the mixture-of-memories and multi-state architectures branch, proposing to integrate diverse memory components rather than relying on a single recurrent or attention-based mechanism. This contrasts with approaches like Mamba[1], which uses a single selective state space, and Memformer[3], which augments transformers with a unified external memory. The central question across these branches is whether richer memory architectures—despite added complexity—can unlock better long-range reasoning and generalization than simpler linear models. Mixture of Memories[0] addresses this by orchestrating multiple memory types, positioning itself as a flexible alternative to both monolithic state space models and purely attention-driven designs.

Claimed Contributions

Mixture-of-Memories (MoM) architecture

10 retrieved papers

The authors propose MoM, a new architecture that uses multiple independent memory states instead of a single fixed-size memory state. A router network selectively directs input tokens to specific memory states, which enhances overall memory capacity while minimizing memory interference in linear sequence models.

10 retrieved papers

General framework compatible with diverse memory update mechanisms

7 retrieved papers

MoM is designed as a flexible framework that can integrate various memory update mechanisms from different linear sequence modeling methods, such as linear attention, state space models, and linear RNNs, making it broadly applicable across existing approaches.

7 retrieved papers

Hardware-efficient implementation using varlen operations

6 retrieved papers

The authors develop a hardware-efficient implementation that reorders tokens according to routing results and uses varlen (variable-length) operations with Triton kernels. This approach enables MoM to retain linear-time training complexity and constant-time inference complexity while efficiently processing multiple memory states.

6 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Mixture-of-Memories (MoM) architecture

[57] Multi-interest network with dynamic routing for recommendation at Tmall PDF

Cannot Refute

[58] Rcr-router: Efficient role-aware context routing for multi-agent llm systems with structured memory PDF

Cannot Refute

[59] Contextual self-referential memory trajectories for large language model consistency PDF

Cannot Refute

[60] Brain-like slot representation for sequence working memory in recurrent neural networks PDF

Cannot Refute

[61] Unified Spatio-Temporal Dynamic Routing for Efficient Video Object Segmentation PDF

Cannot Refute

[62] Sequential Recommendation with Decomposed Item Feature Routing PDF

Cannot Refute

[63] Hybrid Reasoning Network for Video-based Commonsense Captioning PDF

Cannot Refute

[64] Speech Emotion Recognition Using Sequential Capsule Networks PDF

Cannot Refute

[65] A dynamic routing CapsNet based on increment prototype clustering for overcoming catastrophic forgetting PDF

Cannot Refute

[66] Transformer-style relational reasoning with dynamic memory updating for temporal network modeling PDF

Cannot Refute

Contribution

General framework compatible with diverse memory update mechanisms

[1] Mamba: Linear-Time Sequence Modeling with Selective State Spaces PDF

Cannot Refute

[27] Gated slot attention for efficient linear-time sequence modeling PDF

Cannot Refute

[67] MambaEVT: Event Stream based Visual Object Tracking using State Space Model PDF

Cannot Refute

[68] Neuromorphic principles in self-attention hardware for efficient transformers PDF

Cannot Refute

[69] Mamba-ST: State Space Model for Efficient Style Transfer PDF

Cannot Refute

[70] Demystify Mamba in Vision: A Linear Attention Perspective PDF

Cannot Refute

[71] IN-CONTEXT LEARNING AS GENERAL-PURPOSE LEARNING: A COMPREHENSIVE SURVEY AND NEW PERSPECTIVES PDF

Cannot Refute

Contribution

Hardware-efficient implementation using varlen operations

[51] MambaMIL: Enhancing Long Sequence Modeling with Sequence Reordering in Computational Pathology PDF

Cannot Refute

[52] Lingen: Towards high-resolution minute-length text-to-video generation with linear computational complexity PDF

Cannot Refute

[53] Raptor-t: A fused and memory-efficient sparse transformer for long and variable-length sequences PDF

Cannot Refute

[54] Linear Transformers as VAR Models: Aligning Autoregressive Attention Mechanisms with Autoregressive Forecasting PDF

Cannot Refute

[55] A stochastic segment model for phoneme-based continuous speech recognition PDF

Cannot Refute

[56] Hub label compression PDF

Cannot Refute

MoM: Linear Sequence Modeling with Mixture-of-Memories

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Mixture-of-Memories (MoM) architecture

[57] Multi-interest network with dynamic routing for recommendation at Tmall PDF

[58] Rcr-router: Efficient role-aware context routing for multi-agent llm systems with structured memory PDF

[59] Contextual self-referential memory trajectories for large language model consistency PDF

[60] Brain-like slot representation for sequence working memory in recurrent neural networks PDF

[61] Unified Spatio-Temporal Dynamic Routing for Efficient Video Object Segmentation PDF

[62] Sequential Recommendation with Decomposed Item Feature Routing PDF

[63] Hybrid Reasoning Network for Video-based Commonsense Captioning PDF

[64] Speech Emotion Recognition Using Sequential Capsule Networks PDF

[65] A dynamic routing CapsNet based on increment prototype clustering for overcoming catastrophic forgetting PDF

[66] Transformer-style relational reasoning with dynamic memory updating for temporal network modeling PDF

General framework compatible with diverse memory update mechanisms

[1] Mamba: Linear-Time Sequence Modeling with Selective State Spaces PDF

[27] Gated slot attention for efficient linear-time sequence modeling PDF

[67] MambaEVT: Event Stream based Visual Object Tracking using State Space Model PDF

[68] Neuromorphic principles in self-attention hardware for efficient transformers PDF

[69] Mamba-ST: State Space Model for Efficient Style Transfer PDF

[70] Demystify Mamba in Vision: A Linear Attention Perspective PDF

[71] IN-CONTEXT LEARNING AS GENERAL-PURPOSE LEARNING: A COMPREHENSIVE SURVEY AND NEW PERSPECTIVES PDF

Hardware-efficient implementation using varlen operations

[51] MambaMIL: Enhancing Long Sequence Modeling with Sequence Reordering in Computational Pathology PDF

[52] Lingen: Towards high-resolution minute-length text-to-video generation with linear computational complexity PDF

[53] Raptor-t: A fused and memory-efficient sparse transformer for long and variable-length sequences PDF

[54] Linear Transformers as VAR Models: Aligning Autoregressive Attention Mechanisms with Autoregressive Forecasting PDF

[55] A stochastic segment model for phoneme-based continuous speech recognition PDF

[56] Hub label compression PDF

Table of Contents