Massive Memorization with Hundreds of Trillions of Parameters for Sequential Transducer Generative Recommenders

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

sequential recommendation systemsgenerative recommendationproduction-scale datauser interaction history

Modern large-scale recommendation systems rely heavily on user interaction history sequences to enhance the model performance. The advent of large language models and sequential modeling techniques, particularly transformer architectures, has led to significant advancements (e.g., HSTU, SIM, and TWIN models). While scaling to ultra-long user histories (10k to 100k items) generally improves model performance, it also creates significant challenges on latency, queries per second (QPS) and GPU cost in industry-scale recommendation systems. Existing models do not adequately address these industrial scalability issues. In this paper, we propose a novel two-stage modeling framework, namely \emph{VIrtual Sequential Target Attention} (VISTA), which decomposes traditional target attention from a candidate item to user history items into two distinct stages: (1) user history summarization into a few hundred tokens; followed by (2) candidate item attention to those tokens. These summarization token embeddings are then cached in storage system and then utilized as sequence features for downstream model training and inference. This novel design for scalability enables VISTA to scale to lifelong user histories (up to one million items) while keeping downstream training and inference costs fixed, which is essential in industry. Our approach achieves significant improvements in offline and online metrics and has been successfully deployed on an industrial platform serving billions of users.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes VISTA, a two-stage framework that decomposes target attention into user history summarization followed by candidate-item attention to cached tokens. It resides in the Caching and Precomputed Summarization leaf, which contains only two papers total. This leaf sits within the broader Sequence Compression and Retrieval Architectures branch, indicating a relatively sparse research direction focused on precomputing and storing compressed user representations. The small sibling count suggests this specific caching-based approach is less crowded than neighboring retrieval-based or attention-mechanism categories.

The taxonomy reveals that VISTA's parent branch, Sequence Compression and Retrieval Architectures, includes sibling leaves such as Two-Stage Retrieval-Based Frameworks (eight papers) and Clustering-Based Sequence Aggregation (one paper). These neighbors emphasize coarse retrieval or clustering before fine-grained modeling, whereas VISTA's caching strategy focuses on precomputed summarization tokens stored for downstream inference. Adjacent branches like Efficient Attention and Sequence Modeling Mechanisms explore linear-complexity attention and state-space models, offering alternative routes to scalability without explicit caching. VISTA's position bridges compression and efficient serving, diverging from pure retrieval or memory-augmented designs.

Among thirty candidates examined, the analysis identified one refutable pair for the generative sequential reconstruction loss contribution, while the two-stage attention framework and quasi-linear attention formulation each examined ten candidates with zero refutations. This suggests the core architectural innovation—decomposing attention into summarization and candidate-item stages—appears less overlapped in the limited search scope, whereas the reconstruction loss aligns with existing generative or self-supervised methods. The small refutation count across contributions indicates either genuine novelty or gaps in the candidate pool, given the modest search scale.

Based on the limited thirty-candidate search, VISTA's caching-based two-stage design occupies a sparsely populated niche within sequence compression. The taxonomy structure and low sibling count suggest this direction is less explored than retrieval-heavy or attention-mechanism branches. However, the analysis does not cover exhaustive prior work, and the single refutation for the reconstruction loss highlights potential overlap in auxiliary training objectives. Overall, the framework's industrial focus on latency and QPS distinguishes it from academic prototypes, though the search scope leaves room for undiscovered related work.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Scaling sequential recommendation systems with ultra-long user interaction histories. The field addresses the challenge of modeling user behavior when interaction sequences grow to thousands or even millions of events, which strains both computational resources and model capacity. The taxonomy reveals a diverse landscape organized around several complementary strategies. Sequence Compression and Retrieval Architectures focus on distilling or selecting salient interactions through caching, summarization, or retrieval mechanisms, as seen in works like Hierarchical Temporal Convolutional[43]. Efficient Attention and Sequence Modeling Mechanisms explore alternatives to standard transformers, including linear-complexity designs and state-space models such as Mamba Sequential[47]. Meanwhile, Large Language Model Integration for Sequential Recommendation and Foundation Models and Pretraining Strategies investigate how pretrained representations and generative frameworks can capture long-range dependencies, while System-Level Parallelism and Infrastructure branches like Context Parallelism[7] and Sequence Parallelism[28] tackle the engineering challenges of training at scale. Graph-Based and Relational Sequence Modeling, Memory and Temporal Dynamics Modeling, and Unified and Multi-Task Modeling Frameworks round out the taxonomy by addressing relational structure, explicit memory modules, and multi-objective learning. A central tension across these branches is the trade-off between expressiveness and efficiency: some methods prioritize capturing fine-grained temporal patterns through dense attention or memory networks, while others emphasize computational feasibility via aggressive compression or retrieval. The original paper ```json[0] sits within the Sequence Compression and Retrieval Architectures branch, specifically under Caching and Precomputed Summarization, suggesting an approach that precomputes or caches compact user representations to avoid reprocessing entire histories at inference time. This contrasts with neighbors like Hierarchical Temporal Convolutional[43], which uses hierarchical convolutions to encode multi-scale temporal structure directly. Compared to memory-augmented approaches or full-sequence transformers, caching strategies offer a pragmatic middle ground: they sacrifice some modeling flexibility to achieve lower latency and memory footprint, making them particularly attractive for industrial deployment scenarios where real-time serving constraints dominate.

Claimed Contributions

Two-stage attention framework (VISTA) for scalable sequential recommendation

10 retrieved papers

The authors introduce VISTA, a novel framework that decomposes traditional target attention into two stages: user history summarization into cached tokens, followed by candidate item attention to those tokens. This design enables scaling to lifelong user histories (up to one million items) while keeping downstream training and inference costs fixed.

10 retrieved papers

Quasi-linear attention formulation for recommendation models

10 retrieved papers

The authors propose a linear time complexity attention mechanism specifically designed for recommendation systems that avoids attention among candidate items to prevent label leakage. This includes the Quasi Linear Unit (QLU) module with non-linear activations to address expressive power limitations of standard linear attention.

10 retrieved papers

Generative sequential reconstruction loss for recommendation

Can Refute

10 retrieved papers

The authors introduce a reconstruction loss that encourages the sequence summarization module to fully reproduce the user interaction history sequence. This loss uses a causal decoder network to reconstruct item embeddings, forcing personalized seed embeddings to maximize information retention from the user history.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[43] VQL: An End-to-End Context-Aware Vector Quantization Attention for Ultra-Long User Behavior Modeling PDF

Li Kaiyuan, Tang Yongxiang, Kaiyuan Li, Cheng Yan-hua, Yongxiang Tang, Bai Yong, Yanhua Cheng, Zeng Yanxiang, Yong Bai, Wang Chao, Yanxiang Zeng, Liu, Xialong, Chao Wang, Jiang Peng, Xialong Liu, Peng Jiang (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Two-stage attention framework (VISTA) for scalable sequential recommendation

[21] Multimodal Fusion And Sparse Attention-based Alignment Model for Long Sequential Recommendation PDF

Cannot Refute

[40] Hierarchical temporal convolutional networks for dynamic recommender systems PDF

Cannot Refute

[56] Graph-Augmented Co-Attention Model for Socio-Sequential Recommendation PDF

Cannot Refute

[60] A unified hierarchical attention framework for sequential recommendation by fusing long and short-term preferences PDF

Cannot Refute

[61] Multi-behavior hypergraph-enhanced transformer for sequential recommendation PDF

Cannot Refute

[62] HAN: Hierarchical Attention Network for Learning Latent Context-Aware User Preferences With Attribute Awareness PDF

Cannot Refute

[63] Sequential Recommender System based on Hierarchical Attention Networks PDF

Cannot Refute

[64] A hierarchical contextual attention-based network for sequential recommendation PDF

Cannot Refute

[65] Multi-granularity interest retrieval and refinement network for long-term user behavior modeling in ctr prediction PDF

Cannot Refute

[66] Fused semantic information and hierarchical attention network for course recommendation PDF

Cannot Refute

Contribution

Quasi-linear attention formulation for recommendation models

[6] Self-attentive sequential recommendation PDF

Cannot Refute

[51] Mamba4Rec: Towards Efficient Sequential Recommendation with Selective State Space Models PDF

Cannot Refute

[52] SPAR: Personalized Content-Based Recommendation via Long Engagement Attention PDF

Cannot Refute

[53] Gated Rotary-Enhanced Linear Attention for Long-term Sequential Recommendation PDF

Cannot Refute

[54] Contextualized Graph Attention Network for Recommendation With Item Knowledge Graph PDF

Cannot Refute

[55] GeoMamba: Toward Efficient Geography-Aware Sequential POI Recommendation PDF

Cannot Refute

[56] Graph-Augmented Co-Attention Model for Socio-Sequential Recommendation PDF

Cannot Refute

[57] An efficient group recommendation model with multiattention-based neural networks PDF

Cannot Refute

[58] Efficient Wavelet Attention with Trainable Frequency Filter for Multi-modal Sequential Recommendation PDF

Cannot Refute

[59] Transitivity-Encoded Graph Attention Networks for Complementary Item Recommendations PDF

Cannot Refute

Contribution

Generative sequential reconstruction loss for recommendation

[72] Encode Me If You Can: Learning Universal User Representations via Event Sequence Autoencoding PDF

Can Refute

[67] Deep Generative Session-Based Recommender System PDF

Cannot Refute

[68] DimeRec: a unified framework for enhanced sequential recommendation via generative diffusion models PDF

Cannot Refute

[69] Behavior Modeling Space Reconstruction for E-Commerce Search PDF

Cannot Refute

[70] EGA: A Unified End-to-End Generative Framework for Industrial Advertising Systems PDF

Cannot Refute

[71] LD4MRec: simplifying and powering diffusion model for multimedia recommendation: J. Zhu et al. PDF

Cannot Refute

[73] Cetd: Counterfactual explanations by considering temporal dependencies in sequential recommendation PDF

Cannot Refute

[74] Sim2Rec: A simulator-based decision-making approach to optimize real-world long-term user engagement in sequential recommender systems PDF

Cannot Refute

[75] Revisiting graph-based recommender systems from the perspective of variational auto-encoder PDF

Cannot Refute

[76] STORE: Streamlining Semantic Tokenization and Generative Recommendation with A Single LLM PDF

Cannot Refute

Massive Memorization with Hundreds of Trillions of Parameters for Sequential Transducer Generative Recommenders

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[43] VQL: An End-to-End Context-Aware Vector Quantization Attention for Ultra-Long User Behavior Modeling PDF

Contribution Analysis

Two-stage attention framework (VISTA) for scalable sequential recommendation

[21] Multimodal Fusion And Sparse Attention-based Alignment Model for Long Sequential Recommendation PDF

[40] Hierarchical temporal convolutional networks for dynamic recommender systems PDF

[56] Graph-Augmented Co-Attention Model for Socio-Sequential Recommendation PDF

[60] A unified hierarchical attention framework for sequential recommendation by fusing long and short-term preferences PDF

[61] Multi-behavior hypergraph-enhanced transformer for sequential recommendation PDF

[62] HAN: Hierarchical Attention Network for Learning Latent Context-Aware User Preferences With Attribute Awareness PDF

[63] Sequential Recommender System based on Hierarchical Attention Networks PDF

[64] A hierarchical contextual attention-based network for sequential recommendation PDF

[65] Multi-granularity interest retrieval and refinement network for long-term user behavior modeling in ctr prediction PDF

[66] Fused semantic information and hierarchical attention network for course recommendation PDF

Quasi-linear attention formulation for recommendation models

[6] Self-attentive sequential recommendation PDF

[51] Mamba4Rec: Towards Efficient Sequential Recommendation with Selective State Space Models PDF

[52] SPAR: Personalized Content-Based Recommendation via Long Engagement Attention PDF

[53] Gated Rotary-Enhanced Linear Attention for Long-term Sequential Recommendation PDF

[54] Contextualized Graph Attention Network for Recommendation With Item Knowledge Graph PDF

[55] GeoMamba: Toward Efficient Geography-Aware Sequential POI Recommendation PDF

[56] Graph-Augmented Co-Attention Model for Socio-Sequential Recommendation PDF

[57] An efficient group recommendation model with multiattention-based neural networks PDF

[58] Efficient Wavelet Attention with Trainable Frequency Filter for Multi-modal Sequential Recommendation PDF

[59] Transitivity-Encoded Graph Attention Networks for Complementary Item Recommendations PDF

Generative sequential reconstruction loss for recommendation

[72] Encode Me If You Can: Learning Universal User Representations via Event Sequence Autoencoding PDF

[67] Deep Generative Session-Based Recommender System PDF

[68] DimeRec: a unified framework for enhanced sequential recommendation via generative diffusion models PDF

[69] Behavior Modeling Space Reconstruction for E-Commerce Search PDF

[70] EGA: A Unified End-to-End Generative Framework for Industrial Advertising Systems PDF

[71] LD4MRec: simplifying and powering diffusion model for multimedia recommendation: J. Zhu et al. PDF

[73] Cetd: Counterfactual explanations by considering temporal dependencies in sequential recommendation PDF

[74] Sim2Rec: A simulator-based decision-making approach to optimize real-world long-term user engagement in sequential recommender systems PDF

[75] Revisiting graph-based recommender systems from the perspective of variational auto-encoder PDF

[76] STORE: Streamlining Semantic Tokenization and Generative Recommendation with A Single LLM PDF

Table of Contents