Mamba-3: Improved Sequence Modeling using State Space Principles

ICLR 2026 Conference SubmissionAnonymous Authors
State Space ModelsMambaLLMsSubquadratic Models
Abstract:

The recent scaling of test-time compute for LLMs has restricted the practical deployment of models to those with strong capabilities that can generate high-quality outputs in an inference-efficient manner. While current Transformer-based models are the standard, their quadratic compute and linear memory bottlenecks have spurred the development of sub-quadratic models with linear-scaling compute with constant memory requirements. However, many recent linear-style models lack certain capabilities or lag behind in quality, and even their linear-time inference is not hardware-efficient. Guided by an inference-first perspective, we introduce three core methodological improvements inspired by the state-space model viewpoint of linear models. We combine a: 1) more expressive recurrence, 2) complex state update rule that enables richer state tracking, and 3) multi-input, multi-output formulation together, resulting in a stronger model that better exploits hardware parallelism during decoding. Together with architectural refinements, our Mamba-3 model achieves significant gains across retrieval, state-tracking, and downstream language modeling tasks. Our new architecture sets the Pareto-frontier for performance under a fixed inference budget and outperforms strong baselines in a head-to-head comparison.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Mamba-3, a state space model architecture that combines trapezoidal discretization, complex-valued state updates with data-dependent rotary position embeddings, and a multi-input multi-output formulation to improve inference efficiency. It resides in the State Space Model Foundations leaf, which contains four papers including foundational work like Mamba and Structured Linear CDEs. This leaf represents a moderately populated research direction within the broader Linear-Time Sequence Model Architectures branch, indicating active but not overcrowded exploration of core SSM design principles.

The taxonomy reveals that State Space Model Foundations sits alongside Linear Attention Mechanisms (three papers), Recurrent and Convolutional Sequence Models (four papers), and Hybrid and Multi-Modal Architectures (four papers). These neighboring leaves explore alternative paths to linear complexity: attention approximations, gated recurrence, and architectural fusion. Mamba-3's focus on enriching the SSM recurrence and state update rules positions it as an evolution within the SSM paradigm rather than a hybrid approach, distinguishing it from multi-modal extensions or attention-based alternatives in sibling categories.

Among 30 candidates examined, the trapezoidal discretization contribution shows no clear refutation across 10 candidates, suggesting relative novelty in this specific discretization scheme. The complex-valued state update rule encountered one refutable candidate among 10 examined, indicating some prior exploration of complex state mechanisms. The MIMO formulation found two refutable candidates among 10, suggesting more substantial prior work on multi-channel or parallel processing strategies. These statistics reflect a limited semantic search scope, not exhaustive coverage, and indicate that the discretization method appears least explored while the MIMO approach has more documented precedents.

Based on the top-30 semantic matches and taxonomy structure, the work appears to advance an active but not saturated research direction. The contribution-level analysis suggests incremental refinement of existing SSM concepts rather than entirely novel primitives, though the specific combination and hardware-oriented design may offer practical value. The limited search scope means potentially relevant work outside the top-30 candidates or in adjacent subfields may not be captured in this assessment.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: efficient inference for linear-time sequence models. The field encompasses architectures and techniques that process sequences in linear time, avoiding the quadratic complexity of standard attention mechanisms. The taxonomy reveals several major branches: foundational architectures (including state space models and their variants), optimization techniques for faster inference (such as hardware-aware kernels and memory-efficient implementations), training and compression methods that reduce model size or computational overhead, domain-specific applications spanning speech, vision, and time-series forecasting, error correction and decoding theory from coding and communication systems, theoretical analyses of convergence and expressiveness, and survey literature synthesizing recent progress. Representative works like Mamba[49] and xLSTM 7B[10] illustrate how state space model foundations enable scalable sequence modeling, while efforts such as Flash Inference[12] and Hardware-efficient Attention[4] demonstrate inference optimization in practice. Meanwhile, compression approaches like In-Training Compression SSMs[43] and modular designs such as Sparse Modular Activation[2] address efficiency from complementary angles. A particularly active line of work centers on state space model architectures that balance expressiveness with computational efficiency, contrasting with traditional recurrent and attention-based methods. Mamba-3[0] sits within this dense branch of state space model foundations, building on the selective state space framework introduced by Mamba[49] and exploring structured parameterizations akin to Structured Linear CDEs[42]. Compared to nearby efforts like In-Training Compression SSMs[43], which emphasizes reducing memory footprint during training, Mamba-3[0] focuses more directly on architectural innovations that preserve linear-time inference guarantees while enhancing modeling capacity. This positioning reflects a broader tension in the field: whether to prioritize architectural expressiveness, aggressive compression, or hardware-specific optimizations. Open questions remain around the trade-offs between these dimensions, especially as models scale and as domain-specific constraints—from real-time speech decoding to long-context document understanding—demand tailored solutions.

Claimed Contributions

Trapezoidal discretization for state-space models

The authors introduce a generalized trapezoidal discretization method for state-space models that provides a second-order accurate approximation, yielding a more expressive recurrence than Mamba-2's Euler-based approach. This discretization can be viewed as applying a data-dependent convolution and, combined with applied biases on B and C, empirically eliminates the need for short causal convolution.

10 retrieved papers
Complex-valued state update rule with data-dependent RoPE

The authors propose using complex-valued state-space models that enable rotational hidden state dynamics, addressing state-tracking limitations in prior linear models. They show this is equivalent to applying data-dependent rotary embeddings (RoPE) on input and output projections, enabling efficient implementation while recovering capabilities like parity and modular arithmetic that Mamba-2 cannot solve.

10 retrieved papers
Can Refute
Multi-input multi-output (MIMO) formulation for improved hardware utilization

The authors introduce a MIMO variant that shifts from outer-product-based to matrix-multiplication-based state updates, increasing arithmetic intensity and improving hardware utilization during decoding. This formulation allows more compute during state update without increasing state size, pushing the Pareto frontier of inference efficiency while maintaining or improving model quality.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Trapezoidal discretization for state-space models

The authors introduce a generalized trapezoidal discretization method for state-space models that provides a second-order accurate approximation, yielding a more expressive recurrence than Mamba-2's Euler-based approach. This discretization can be viewed as applying a data-dependent convolution and, combined with applied biases on B and C, empirically eliminates the need for short causal convolution.

Contribution

Complex-valued state update rule with data-dependent RoPE

The authors propose using complex-valued state-space models that enable rotational hidden state dynamics, addressing state-tracking limitations in prior linear models. They show this is equivalent to applying data-dependent rotary embeddings (RoPE) on input and output projections, enabling efficient implementation while recovering capabilities like parity and modular arithmetic that Mamba-2 cannot solve.

Contribution

Multi-input multi-output (MIMO) formulation for improved hardware utilization

The authors introduce a MIMO variant that shifts from outer-product-based to matrix-multiplication-based state updates, increasing arithmetic intensity and improving hardware utilization during decoding. This formulation allows more compute during state update without increasing state size, pushing the Pareto frontier of inference efficiency while maintaining or improving model quality.