MoM: Linear Sequence Modeling with Mixture-of-Memories

ICLR 2026 Conference SubmissionAnonymous Authors
Efficient/Low-Resource Methods for NLPLinear Sequence ModelingMachine Learning for NLP
Abstract:

Linear sequence modeling methods, such as linear attention, state space modeling, and linear RNNs, offer significant efficiency improvements by reducing the complexity of training and inference. However, these methods typically compress the entire input sequence into a single fixed-size memory state, which leads to suboptimal performance on recall-intensive tasks. To address this limitation, we introduce a novel architecture called Mixture-of-Memories (MoM). MoM utilizes multiple independent memory states, with a router network directing input tokens to specific memory states. This approach greatly enhances the overall memory capacity while minimizing memory interference. MoM serves as a general framework that can be seamlessly combined with diverse memory update mechanisms across linear models. As a result, MoM performs exceptionally well on recall-intensive tasks, surpassing existing linear sequence modeling techniques. Despite incorporating multiple memory states, the computation of each memory state remains linear in complexity, allowing MoM to retain the linear-complexity advantage during training, while constant-complexity during inference. Our experimental results show that MoM outperforms current linear sequence models on downstream language tasks, particularly recall-intensive tasks, and even achieves performance comparable to Transformer models.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Mixture-of-Memories (MoM), an architecture that employs multiple independent memory states with a router network to direct tokens to specific states, thereby increasing overall memory capacity while reducing interference. Within the taxonomy, this work occupies a unique position: it is the sole paper in the 'Mixture-of-Memories and Multi-State Architectures' leaf, which itself is a distinct branch among ten major research directions. This isolation suggests the paper explores a relatively sparse research direction—multi-state memory systems with routing—compared to more crowded areas like selective state space models or memory-augmented transformers.

The taxonomy reveals that neighboring research directions include selective state space models (e.g., Mamba) that use content-based selection within a single state, gated linear attention mechanisms that incorporate slot-based or gating strategies, and memory-augmented transformers that rely on external memory modules. MoM diverges from these by distributing memory across multiple independent states rather than enhancing a single memory mechanism. The scope notes clarify that multi-scale state space models without routing belong elsewhere, emphasizing that MoM's routing-based multi-state approach is architecturally distinct from both single-state selective models and external memory augmentation strategies.

Among the 23 candidates examined via limited semantic search, none were found to clearly refute any of the three main contributions. For the core MoM architecture, 10 candidates were reviewed with zero refutable overlaps; for the general framework claim, 7 candidates yielded no refutations; and for the hardware-efficient implementation, 6 candidates showed no prior work that directly anticipates this approach. This suggests that within the examined scope, the multi-state routing concept and its integration with diverse memory update mechanisms appear relatively novel, though the search was not exhaustive and focused on top-K semantic matches.

Overall, the analysis indicates that MoM occupies a sparsely populated research niche within linear sequence modeling. The absence of sibling papers in its taxonomy leaf and the lack of refutable prior work among examined candidates suggest the approach is architecturally distinct from existing methods. However, this assessment is based on a limited literature search of 23 papers, and a broader survey might reveal related multi-state or routing-based memory systems not captured by the current semantic search scope.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
23
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Linear sequence modeling with enhanced memory capacity. The field has evolved into a rich landscape of approaches that balance computational efficiency with the ability to retain and retrieve information over long sequences. At the highest level, the taxonomy reveals several major branches: selective state space and gated recurrence architectures (exemplified by Mamba[1]) that dynamically filter inputs; memory-augmented architectures with external storage (such as Memformer[3]) that maintain explicit memory banks; linear attention and associative recall mechanisms (including Gated Linear Attention[4]) that achieve subquadratic complexity; unified frameworks and theoretical foundations that analyze these models' expressive power; extended LSTM and recurrent architectures (like Extended LSTM[2]) that build on classical designs; mixture-of-memories and multi-state architectures that combine multiple memory systems; efficient training and parallelization techniques (e.g., Delta Rule Parallelization[6]); domain-specific applications; cognitive and theoretical models; LSTM optimization; and survey literature. These branches reflect a tension between architectural simplicity, memory capacity, and computational cost. Recent work has explored how to combine multiple memory mechanisms to capture different temporal scales and types of dependencies. Mixture of Memories[0] sits within the mixture-of-memories and multi-state architectures branch, proposing to integrate diverse memory components rather than relying on a single recurrent or attention-based mechanism. This contrasts with approaches like Mamba[1], which uses a single selective state space, and Memformer[3], which augments transformers with a unified external memory. The central question across these branches is whether richer memory architectures—despite added complexity—can unlock better long-range reasoning and generalization than simpler linear models. Mixture of Memories[0] addresses this by orchestrating multiple memory types, positioning itself as a flexible alternative to both monolithic state space models and purely attention-driven designs.

Claimed Contributions

Mixture-of-Memories (MoM) architecture

The authors propose MoM, a new architecture that uses multiple independent memory states instead of a single fixed-size memory state. A router network selectively directs input tokens to specific memory states, which enhances overall memory capacity while minimizing memory interference in linear sequence models.

10 retrieved papers
General framework compatible with diverse memory update mechanisms

MoM is designed as a flexible framework that can integrate various memory update mechanisms from different linear sequence modeling methods, such as linear attention, state space models, and linear RNNs, making it broadly applicable across existing approaches.

7 retrieved papers
Hardware-efficient implementation using varlen operations

The authors develop a hardware-efficient implementation that reorders tokens according to routing results and uses varlen (variable-length) operations with Triton kernels. This approach enables MoM to retain linear-time training complexity and constant-time inference complexity while efficiently processing multiple memory states.

6 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Mixture-of-Memories (MoM) architecture

The authors propose MoM, a new architecture that uses multiple independent memory states instead of a single fixed-size memory state. A router network selectively directs input tokens to specific memory states, which enhances overall memory capacity while minimizing memory interference in linear sequence models.

Contribution

General framework compatible with diverse memory update mechanisms

MoM is designed as a flexible framework that can integrate various memory update mechanisms from different linear sequence modeling methods, such as linear attention, state space models, and linear RNNs, making it broadly applicable across existing approaches.

Contribution

Hardware-efficient implementation using varlen operations

The authors develop a hardware-efficient implementation that reorders tokens according to routing results and uses varlen (variable-length) operations with Triton kernels. This approach enables MoM to retain linear-time training complexity and constant-time inference complexity while efficiently processing multiple memory states.