FlashRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

RNNMambaSSMTransformersParallelizationParallel scanNonlinear

Recurrent Neural Networks (RNNs) laid the foundation for sequence modeling, but their intrinsic sequential nature restricts parallel computation, creating a fundamental barrier to scaling. This has led to the dominance of parallelizable architectures like Transformers and, more recently, State Space Models (SSMs). While SSMs achieve efficient parallelization through structured linear recurrences, this linearity constraint limits their expressive power and precludes modeling complex, nonlinear sequence-wise dependencies. To address this, we present FlashRNN, a framework that breaks the sequence-parallelization barrier for nonlinear RNNs. Building on prior work, we cast the sequence of nonlinear recurrence relationships as a single system of equations, which we solve in parallel using Newton's iterations combined with custom parallel reductions. Our implementation achieves speedups of up to $665\times$ over na"ive sequential application, allowing training nonlinear RNNs at unprecedented scales. To showcase this, we apply FlashRNN to adaptations of LSTM and GRU architectures, successfully training models of 7B parameters that attain perplexity comparable to similarly-sized Transformers and Mamba2 architectures. To accelerate research in efficient sequence modeling, we release the FlashRNN codebase as an open-source framework for automatic training-parallelization of nonlinear RNNs, enabling researchers and practitioners to explore new nonlinear RNN models at scale.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces FlashRNN, a framework enabling parallel training of nonlinear RNNs by casting recurrence relationships as a system of equations solved via Newton's method and custom parallel reductions. It resides in the 'Sequence-Level Parallelization via Fixed-Point Formulations' leaf, which contains only three papers total. This leaf sits within the broader 'Parallelization Frameworks and Algorithms' branch, indicating a relatively sparse but well-defined research direction focused on iterative fixed-point methods rather than data-parallel or non-iterative strategies.

The taxonomy reveals neighboring leaves addressing complementary parallelization strategies: 'Data-Parallel and Distributed Training Strategies' (five papers) focuses on multi-worker synchronization, while 'Non-Iterative and Extreme Learning Machine Approaches' (one paper) eliminates backpropagation through time entirely. The 'Novel Nonlinear RNN Architectures' branch explores architectural innovations like minimal convolutional variants and hybrid fusion models, which often assume or enable parallelization but do not primarily contribute algorithmic frameworks. FlashRNN's fixed-point formulation bridges algorithmic parallelization with architectural adaptations (LSTM/GRU), connecting these two major branches.

Among thirty candidates examined, the FlashRNN framework itself (Contribution A) and adapted LSTM/GRU architectures (Contribution B) show no clear refutation across ten candidates each, suggesting limited direct overlap in the examined literature. However, the open-source PyTorch+CUDA library (Contribution C) encountered two refutable candidates among ten examined, indicating prior implementations or tools with overlapping functionality. The framework and architectural contributions appear more novel within this search scope, while the software artifact faces stronger prior work in the examined sample.

Based on the limited top-30 semantic search, FlashRNN occupies a sparsely populated methodological niche—sequence-level fixed-point parallelization—with only two sibling papers in its taxonomy leaf. The framework and architectural adaptations show stronger novelty signals than the software library component. This assessment reflects the examined candidate set and does not claim exhaustive coverage of all relevant prior work in parallel RNN training or open-source tooling.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: parallel training of nonlinear recurrent neural networks. The field organizes around three main branches that reflect distinct research emphases. The Parallelization Frameworks and Algorithms branch focuses on computational strategies to accelerate RNN training, including data-parallel methods, sequence-level parallelization via fixed-point formulations, and hardware-aware optimizations that exploit GPU or distributed architectures. The Novel Nonlinear RNN Architectures branch explores new model designs—such as minimal convolutional variants, self-organizing structures, and hybrid series-parallel topologies—that balance expressiveness with trainability. The Applications and Domain-Specific Architectures branch addresses how RNNs are tailored to particular domains, from control and system identification to vision and language tasks, often incorporating domain constraints or specialized loss functions. Together, these branches capture the interplay between algorithmic innovation, architectural design, and practical deployment. A particularly active line of work within parallelization explores sequence-level methods that reformulate recurrent dependencies as fixed-point problems, enabling greater concurrency during training. FlashRNN[0] sits squarely in this cluster, proposing a fixed-point formulation that allows parallel computation across time steps. It shares conceptual ground with Pararnn[2], which also targets sequence-level parallelism, and contrasts with more traditional data-parallel approaches like Optimized Parallel RNN[1] or hardware-centric schemes such as Single Stream GPU[32]. Meanwhile, works like Scalable Parallel RNNs[31] emphasize scalability across distributed systems, highlighting trade-offs between communication overhead and per-device computation. The central tension across these efforts is whether to parallelize over sequences, layers, or data batches, and how to manage the inherent sequential dependencies of recurrence without sacrificing model fidelity or convergence guarantees.

Claimed Contributions

FlashRNN framework for parallel training of nonlinear RNNs

10 retrieved papers

The authors introduce FlashRNN, a framework that enables parallel training of nonlinear recurrent neural networks by casting the sequence of nonlinear recurrence relationships as a system of equations solved using Newton iterations combined with custom parallel reductions. This overcomes the traditional sequential computation barrier that has limited RNN scalability.

10 retrieved papers

Adapted LSTM and GRU architectures for large-scale training

10 retrieved papers

The authors demonstrate that classical nonlinear RNN models (LSTM and GRU) can be trained at unprecedented scales of 7 billion parameters using FlashRNN, achieving competitive performance with Transformers and Mamba2 on language modeling tasks. This shows that nonlinear RNNs remain viable alternatives when computational barriers are removed.

10 retrieved papers

Open-source PyTorch+CUDA library for automatic RNN parallelization

Can Refute

10 retrieved papers

The authors provide a high-performance PyTorch and CUDA library that automates sequence-parallel training for any nonlinear RNN cell from only the specification of its recurrence step. This enables researchers to explore new nonlinear RNN architectures at scale without manually implementing the underlying parallelization complexity.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[30] Towards Scalable and Stable Parallelization of Nonlinear RNNs PDF

Xavier GonzÃ¡lez, Scott Linderman, Jimmy Smith, Andrew Warrington (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

FlashRNN framework for parallel training of nonlinear RNNs

[1] An optimized parallel implementation of non-iteratively trained recurrent neural networks PDF

Cannot Refute

[2] Comba: Improving Nonlinear RNNs with Closed-loop Control PDF

Cannot Refute

[3] Hybrid series/parallel all-nonlinear dynamic-static neural networks: development, training, and application to chemical processes PDF

Cannot Refute

[7] A recurrent neural network-based identification of complex nonlinear dynamical systems: a novel structure, stability analysis and a comparative study PDF

Cannot Refute

[44] Resurrecting recurrent neural networks for long sequences PDF

Cannot Refute

[45] Hybrid data-model parallel training for sequence-to-sequence recurrent neural network machine translation PDF

Cannot Refute

[46] Neural Network Approaches for Intelligent DecisionâMaking in Automation PDF

Cannot Refute

[47] Adjoint recurrent neural network technique for nonlinear electronic component modeling PDF

Cannot Refute

[48] Machineâlearningâbased predictive control of nonlinear processes. Part II: Computational implementation PDF

Cannot Refute

[49] Recurrent neural network for the identification of nonlinear dynamical systems: A comparative study PDF

Cannot Refute

Contribution

Adapted LSTM and GRU architectures for large-scale training

[50] Recurrent neural networks: A comprehensive review of architectures, variants, and applications PDF

Cannot Refute

[51] Prediction of super-large diameter shield attitude based on LSTM-Transformer PDF

Cannot Refute

[52] Low-Resource Neural Machine Translation Using Recurrent Neural Networks and Transfer Learning: A Case Study on English-to-Igbo PDF

Cannot Refute

[53] Comparative Study of LSTM and Transformer PDF

Cannot Refute

[54] Advanced hybrid LSTM-transformer architecture for real-time multi-task prediction in engineering systems PDF

Cannot Refute

[55] A comparative analysis of LSTM, GRU, and Transformer models for construction cost prediction with multidimensional feature integration PDF

Cannot Refute

[56] Time series forecasting using deep learning: a comparative study of LSTM, GRU, and transformer models PDF

Cannot Refute

[57] A multi-head attention-based transformer model for traffic flow forecasting with a comparative analysis to recurrent neural networks PDF

Cannot Refute

[58] Comparative Analysis of Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU) and Transformer Models in Predicting Stock Prices PDF

Cannot Refute

[59] LSTMâtransformer-based robust hybrid deep learning model for financial time series forecasting PDF

Cannot Refute

Contribution

Open-source PyTorch+CUDA library for automatic RNN parallelization

[36] Parallelizing non-linear sequential models over the sequence length PDF

Can Refute

[43] Parallelizing linear recurrent neural nets over sequence length PDF

Can Refute

[1] An optimized parallel implementation of non-iteratively trained recurrent neural networks PDF

Cannot Refute

[35] Supporting very large models using automatic dataflow graph partitioning PDF

Cannot Refute

[37] Accelerating rnn controllers with parallel computing and weight dropout techniques PDF

Cannot Refute

[38] Parallelizing legendre memory unit training PDF

Cannot Refute

[39] AutoML with parallel genetic algorithm for fast hyperparameters optimization in efficient IoT time series prediction PDF

Cannot Refute

[40] FINN-L: Library extensions and design trade-off analysis for variable precision LSTM networks on FPGAs PDF

Cannot Refute

[41] An Adaptive Dropout and Parallel Computing Approaches for Accelerating RNN Controller PDF

Cannot Refute

[42] Graph computing system and application based on large-scale information network PDF

Cannot Refute

FlashRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[30] Towards Scalable and Stable Parallelization of Nonlinear RNNs PDF

Contribution Analysis

FlashRNN framework for parallel training of nonlinear RNNs

[1] An optimized parallel implementation of non-iteratively trained recurrent neural networks PDF

[2] Comba: Improving Nonlinear RNNs with Closed-loop Control PDF

[3] Hybrid series/parallel all-nonlinear dynamic-static neural networks: development, training, and application to chemical processes PDF

[7] A recurrent neural network-based identification of complex nonlinear dynamical systems: a novel structure, stability analysis and a comparative study PDF

[44] Resurrecting recurrent neural networks for long sequences PDF

[45] Hybrid data-model parallel training for sequence-to-sequence recurrent neural network machine translation PDF

[46] Neural Network Approaches for Intelligent DecisionâMaking in Automation PDF

[47] Adjoint recurrent neural network technique for nonlinear electronic component modeling PDF

[48] Machineâlearningâbased predictive control of nonlinear processes. Part II: Computational implementation PDF

[49] Recurrent neural network for the identification of nonlinear dynamical systems: A comparative study PDF

Adapted LSTM and GRU architectures for large-scale training

[50] Recurrent neural networks: A comprehensive review of architectures, variants, and applications PDF

[51] Prediction of super-large diameter shield attitude based on LSTM-Transformer PDF

[52] Low-Resource Neural Machine Translation Using Recurrent Neural Networks and Transfer Learning: A Case Study on English-to-Igbo PDF

[53] Comparative Study of LSTM and Transformer PDF

[54] Advanced hybrid LSTM-transformer architecture for real-time multi-task prediction in engineering systems PDF

[55] A comparative analysis of LSTM, GRU, and Transformer models for construction cost prediction with multidimensional feature integration PDF

[56] Time series forecasting using deep learning: a comparative study of LSTM, GRU, and transformer models PDF

[57] A multi-head attention-based transformer model for traffic flow forecasting with a comparative analysis to recurrent neural networks PDF

[58] Comparative Analysis of Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU) and Transformer Models in Predicting Stock Prices PDF

[59] LSTMâtransformer-based robust hybrid deep learning model for financial time series forecasting PDF

Open-source PyTorch+CUDA library for automatic RNN parallelization

[36] Parallelizing non-linear sequential models over the sequence length PDF

[43] Parallelizing linear recurrent neural nets over sequence length PDF

[1] An optimized parallel implementation of non-iteratively trained recurrent neural networks PDF

[35] Supporting very large models using automatic dataflow graph partitioning PDF

[37] Accelerating rnn controllers with parallel computing and weight dropout techniques PDF

[38] Parallelizing legendre memory unit training PDF

[39] AutoML with parallel genetic algorithm for fast hyperparameters optimization in efficient IoT time series prediction PDF

[40] FINN-L: Library extensions and design trade-off analysis for variable precision LSTM networks on FPGAs PDF

[41] An Adaptive Dropout and Parallel Computing Approaches for Accelerating RNN Controller PDF

[42] Graph computing system and application based on large-scale information network PDF

Table of Contents

[46] Neural Network Approaches for Intelligent DecisionâMaking in Automation PDF

[48] Machineâlearningâbased predictive control of nonlinear processes. Part II: Computational implementation PDF

[59] LSTMâtransformer-based robust hybrid deep learning model for financial time series forecasting PDF