Massive Memorization with Hundreds of Trillions of Parameters for Sequential Transducer Generative Recommenders

ICLR 2026 Conference SubmissionAnonymous Authors
sequential recommendation systemsgenerative recommendationproduction-scale datauser interaction history
Abstract:

Modern large-scale recommendation systems rely heavily on user interaction history sequences to enhance the model performance. The advent of large language models and sequential modeling techniques, particularly transformer architectures, has led to significant advancements (e.g., HSTU, SIM, and TWIN models). While scaling to ultra-long user histories (10k to 100k items) generally improves model performance, it also creates significant challenges on latency, queries per second (QPS) and GPU cost in industry-scale recommendation systems. Existing models do not adequately address these industrial scalability issues. In this paper, we propose a novel two-stage modeling framework, namely \emph{VIrtual Sequential Target Attention} (VISTA), which decomposes traditional target attention from a candidate item to user history items into two distinct stages: (1) user history summarization into a few hundred tokens; followed by (2) candidate item attention to those tokens. These summarization token embeddings are then cached in storage system and then utilized as sequence features for downstream model training and inference. This novel design for scalability enables VISTA to scale to lifelong user histories (up to one million items) while keeping downstream training and inference costs fixed, which is essential in industry. Our approach achieves significant improvements in offline and online metrics and has been successfully deployed on an industrial platform serving billions of users.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes VISTA, a two-stage framework that decomposes target attention into user history summarization followed by candidate-item attention to cached tokens. It resides in the Caching and Precomputed Summarization leaf, which contains only two papers total. This leaf sits within the broader Sequence Compression and Retrieval Architectures branch, indicating a relatively sparse research direction focused on precomputing and storing compressed user representations. The small sibling count suggests this specific caching-based approach is less crowded than neighboring retrieval-based or attention-mechanism categories.

The taxonomy reveals that VISTA's parent branch, Sequence Compression and Retrieval Architectures, includes sibling leaves such as Two-Stage Retrieval-Based Frameworks (eight papers) and Clustering-Based Sequence Aggregation (one paper). These neighbors emphasize coarse retrieval or clustering before fine-grained modeling, whereas VISTA's caching strategy focuses on precomputed summarization tokens stored for downstream inference. Adjacent branches like Efficient Attention and Sequence Modeling Mechanisms explore linear-complexity attention and state-space models, offering alternative routes to scalability without explicit caching. VISTA's position bridges compression and efficient serving, diverging from pure retrieval or memory-augmented designs.

Among thirty candidates examined, the analysis identified one refutable pair for the generative sequential reconstruction loss contribution, while the two-stage attention framework and quasi-linear attention formulation each examined ten candidates with zero refutations. This suggests the core architectural innovation—decomposing attention into summarization and candidate-item stages—appears less overlapped in the limited search scope, whereas the reconstruction loss aligns with existing generative or self-supervised methods. The small refutation count across contributions indicates either genuine novelty or gaps in the candidate pool, given the modest search scale.

Based on the limited thirty-candidate search, VISTA's caching-based two-stage design occupies a sparsely populated niche within sequence compression. The taxonomy structure and low sibling count suggest this direction is less explored than retrieval-heavy or attention-mechanism branches. However, the analysis does not cover exhaustive prior work, and the single refutation for the reconstruction loss highlights potential overlap in auxiliary training objectives. Overall, the framework's industrial focus on latency and QPS distinguishes it from academic prototypes, though the search scope leaves room for undiscovered related work.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Scaling sequential recommendation systems with ultra-long user interaction histories. The field addresses the challenge of modeling user behavior when interaction sequences grow to thousands or even millions of events, which strains both computational resources and model capacity. The taxonomy reveals a diverse landscape organized around several complementary strategies. Sequence Compression and Retrieval Architectures focus on distilling or selecting salient interactions through caching, summarization, or retrieval mechanisms, as seen in works like Hierarchical Temporal Convolutional[43]. Efficient Attention and Sequence Modeling Mechanisms explore alternatives to standard transformers, including linear-complexity designs and state-space models such as Mamba Sequential[47]. Meanwhile, Large Language Model Integration for Sequential Recommendation and Foundation Models and Pretraining Strategies investigate how pretrained representations and generative frameworks can capture long-range dependencies, while System-Level Parallelism and Infrastructure branches like Context Parallelism[7] and Sequence Parallelism[28] tackle the engineering challenges of training at scale. Graph-Based and Relational Sequence Modeling, Memory and Temporal Dynamics Modeling, and Unified and Multi-Task Modeling Frameworks round out the taxonomy by addressing relational structure, explicit memory modules, and multi-objective learning. A central tension across these branches is the trade-off between expressiveness and efficiency: some methods prioritize capturing fine-grained temporal patterns through dense attention or memory networks, while others emphasize computational feasibility via aggressive compression or retrieval. The original paper ```json[0] sits within the Sequence Compression and Retrieval Architectures branch, specifically under Caching and Precomputed Summarization, suggesting an approach that precomputes or caches compact user representations to avoid reprocessing entire histories at inference time. This contrasts with neighbors like Hierarchical Temporal Convolutional[43], which uses hierarchical convolutions to encode multi-scale temporal structure directly. Compared to memory-augmented approaches or full-sequence transformers, caching strategies offer a pragmatic middle ground: they sacrifice some modeling flexibility to achieve lower latency and memory footprint, making them particularly attractive for industrial deployment scenarios where real-time serving constraints dominate.

Claimed Contributions

Two-stage attention framework (VISTA) for scalable sequential recommendation

The authors introduce VISTA, a novel framework that decomposes traditional target attention into two stages: user history summarization into cached tokens, followed by candidate item attention to those tokens. This design enables scaling to lifelong user histories (up to one million items) while keeping downstream training and inference costs fixed.

10 retrieved papers
Quasi-linear attention formulation for recommendation models

The authors propose a linear time complexity attention mechanism specifically designed for recommendation systems that avoids attention among candidate items to prevent label leakage. This includes the Quasi Linear Unit (QLU) module with non-linear activations to address expressive power limitations of standard linear attention.

10 retrieved papers
Generative sequential reconstruction loss for recommendation

The authors introduce a reconstruction loss that encourages the sequence summarization module to fully reproduce the user interaction history sequence. This loss uses a causal decoder network to reconstruct item embeddings, forcing personalized seed embeddings to maximize information retention from the user history.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Two-stage attention framework (VISTA) for scalable sequential recommendation

The authors introduce VISTA, a novel framework that decomposes traditional target attention into two stages: user history summarization into cached tokens, followed by candidate item attention to those tokens. This design enables scaling to lifelong user histories (up to one million items) while keeping downstream training and inference costs fixed.

Contribution

Quasi-linear attention formulation for recommendation models

The authors propose a linear time complexity attention mechanism specifically designed for recommendation systems that avoids attention among candidate items to prevent label leakage. This includes the Quasi Linear Unit (QLU) module with non-linear activations to address expressive power limitations of standard linear attention.

Contribution

Generative sequential reconstruction loss for recommendation

The authors introduce a reconstruction loss that encourages the sequence summarization module to fully reproduce the user interaction history sequence. This loss uses a causal decoder network to reconstruct item embeddings, forcing personalized seed embeddings to maximize information retention from the user history.