FlashRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models

ICLR 2026 Conference SubmissionAnonymous Authors
RNNMambaSSMTransformersParallelizationParallel scanNonlinear
Abstract:

Recurrent Neural Networks (RNNs) laid the foundation for sequence modeling, but their intrinsic sequential nature restricts parallel computation, creating a fundamental barrier to scaling. This has led to the dominance of parallelizable architectures like Transformers and, more recently, State Space Models (SSMs). While SSMs achieve efficient parallelization through structured linear recurrences, this linearity constraint limits their expressive power and precludes modeling complex, nonlinear sequence-wise dependencies. To address this, we present FlashRNN, a framework that breaks the sequence-parallelization barrier for nonlinear RNNs. Building on prior work, we cast the sequence of nonlinear recurrence relationships as a single system of equations, which we solve in parallel using Newton's iterations combined with custom parallel reductions. Our implementation achieves speedups of up to 665×665\times over na"ive sequential application, allowing training nonlinear RNNs at unprecedented scales. To showcase this, we apply FlashRNN to adaptations of LSTM and GRU architectures, successfully training models of 7B parameters that attain perplexity comparable to similarly-sized Transformers and Mamba2 architectures. To accelerate research in efficient sequence modeling, we release the FlashRNN codebase as an open-source framework for automatic training-parallelization of nonlinear RNNs, enabling researchers and practitioners to explore new nonlinear RNN models at scale.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces FlashRNN, a framework enabling parallel training of nonlinear RNNs by casting recurrence relationships as a system of equations solved via Newton's method and custom parallel reductions. It resides in the 'Sequence-Level Parallelization via Fixed-Point Formulations' leaf, which contains only three papers total. This leaf sits within the broader 'Parallelization Frameworks and Algorithms' branch, indicating a relatively sparse but well-defined research direction focused on iterative fixed-point methods rather than data-parallel or non-iterative strategies.

The taxonomy reveals neighboring leaves addressing complementary parallelization strategies: 'Data-Parallel and Distributed Training Strategies' (five papers) focuses on multi-worker synchronization, while 'Non-Iterative and Extreme Learning Machine Approaches' (one paper) eliminates backpropagation through time entirely. The 'Novel Nonlinear RNN Architectures' branch explores architectural innovations like minimal convolutional variants and hybrid fusion models, which often assume or enable parallelization but do not primarily contribute algorithmic frameworks. FlashRNN's fixed-point formulation bridges algorithmic parallelization with architectural adaptations (LSTM/GRU), connecting these two major branches.

Among thirty candidates examined, the FlashRNN framework itself (Contribution A) and adapted LSTM/GRU architectures (Contribution B) show no clear refutation across ten candidates each, suggesting limited direct overlap in the examined literature. However, the open-source PyTorch+CUDA library (Contribution C) encountered two refutable candidates among ten examined, indicating prior implementations or tools with overlapping functionality. The framework and architectural contributions appear more novel within this search scope, while the software artifact faces stronger prior work in the examined sample.

Based on the limited top-30 semantic search, FlashRNN occupies a sparsely populated methodological niche—sequence-level fixed-point parallelization—with only two sibling papers in its taxonomy leaf. The framework and architectural adaptations show stronger novelty signals than the software library component. This assessment reflects the examined candidate set and does not claim exhaustive coverage of all relevant prior work in parallel RNN training or open-source tooling.

Taxonomy

Core-task Taxonomy Papers
34
3
Claimed Contributions
30
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: parallel training of nonlinear recurrent neural networks. The field organizes around three main branches that reflect distinct research emphases. The Parallelization Frameworks and Algorithms branch focuses on computational strategies to accelerate RNN training, including data-parallel methods, sequence-level parallelization via fixed-point formulations, and hardware-aware optimizations that exploit GPU or distributed architectures. The Novel Nonlinear RNN Architectures branch explores new model designs—such as minimal convolutional variants, self-organizing structures, and hybrid series-parallel topologies—that balance expressiveness with trainability. The Applications and Domain-Specific Architectures branch addresses how RNNs are tailored to particular domains, from control and system identification to vision and language tasks, often incorporating domain constraints or specialized loss functions. Together, these branches capture the interplay between algorithmic innovation, architectural design, and practical deployment. A particularly active line of work within parallelization explores sequence-level methods that reformulate recurrent dependencies as fixed-point problems, enabling greater concurrency during training. FlashRNN[0] sits squarely in this cluster, proposing a fixed-point formulation that allows parallel computation across time steps. It shares conceptual ground with Pararnn[2], which also targets sequence-level parallelism, and contrasts with more traditional data-parallel approaches like Optimized Parallel RNN[1] or hardware-centric schemes such as Single Stream GPU[32]. Meanwhile, works like Scalable Parallel RNNs[31] emphasize scalability across distributed systems, highlighting trade-offs between communication overhead and per-device computation. The central tension across these efforts is whether to parallelize over sequences, layers, or data batches, and how to manage the inherent sequential dependencies of recurrence without sacrificing model fidelity or convergence guarantees.

Claimed Contributions

FlashRNN framework for parallel training of nonlinear RNNs

The authors introduce FlashRNN, a framework that enables parallel training of nonlinear recurrent neural networks by casting the sequence of nonlinear recurrence relationships as a system of equations solved using Newton iterations combined with custom parallel reductions. This overcomes the traditional sequential computation barrier that has limited RNN scalability.

10 retrieved papers
Adapted LSTM and GRU architectures for large-scale training

The authors demonstrate that classical nonlinear RNN models (LSTM and GRU) can be trained at unprecedented scales of 7 billion parameters using FlashRNN, achieving competitive performance with Transformers and Mamba2 on language modeling tasks. This shows that nonlinear RNNs remain viable alternatives when computational barriers are removed.

10 retrieved papers
Open-source PyTorch+CUDA library for automatic RNN parallelization

The authors provide a high-performance PyTorch and CUDA library that automates sequence-parallel training for any nonlinear RNN cell from only the specification of its recurrence step. This enables researchers to explore new nonlinear RNN architectures at scale without manually implementing the underlying parallelization complexity.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

FlashRNN framework for parallel training of nonlinear RNNs

The authors introduce FlashRNN, a framework that enables parallel training of nonlinear recurrent neural networks by casting the sequence of nonlinear recurrence relationships as a system of equations solved using Newton iterations combined with custom parallel reductions. This overcomes the traditional sequential computation barrier that has limited RNN scalability.

Contribution

Adapted LSTM and GRU architectures for large-scale training

The authors demonstrate that classical nonlinear RNN models (LSTM and GRU) can be trained at unprecedented scales of 7 billion parameters using FlashRNN, achieving competitive performance with Transformers and Mamba2 on language modeling tasks. This shows that nonlinear RNNs remain viable alternatives when computational barriers are removed.

Contribution

Open-source PyTorch+CUDA library for automatic RNN parallelization

The authors provide a high-performance PyTorch and CUDA library that automates sequence-parallel training for any nonlinear RNN cell from only the specification of its recurrence step. This enables researchers to explore new nonlinear RNN architectures at scale without manually implementing the underlying parallelization complexity.