LongLive: Real-time Interactive Long Video Generation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

Real-timeInteractiveLong Video Generation

We present LongLive, a frame-level autoregressive (AR) framework for real-time and interactive long video generation. Long video generation presents challenges in both efficiency and quality. Diffusion and Diffusion-Forcing models can produce high-quality videos but suffer from low efficiency due to bidirectional attention. Causal attention AR models support KV caching for faster inference but often degrade in quality on long videos due to memory challenges during long-video training. In addition, beyond static prompt-based generation, interactive capabilities, such as streaming prompt inputs, are critical for dynamic content creation, enabling users to guide narratives in real time. This interactive requirement significantly increases the complexity, especially in ensuring visual consistency and semantic coherence during prompt transitions. To address these challenges, LongLive adopts a causal, frame-level AR design that integrates a KV-recache mechanism that refreshes cached states with the new prompt for smooth, adherent switches streaming long tuning to enable long video training and to align training and inference (train-long–test-long); and short window attention paired with a frame-level attention sink, preserving long-range consistency while enabling faster generation. With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip model to minute-long generation in just 32 GPU-days. At inference, LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short- and long-video settings. LongLive supports up to 240-second videos on a single H100 GPU. With FP8 quantization, LongLive boosts inference to 24.8 FPS with marginal quality loss.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

LongLive proposes a frame-level autoregressive framework for real-time interactive long video generation, combining KV-recache for prompt switching, streaming long tuning for extended temporal coherence, and short window attention with attention sinks. The paper resides in the 'Frame-Level Autoregressive Models with Memory Mechanisms' leaf, which contains only three papers total (including LongLive itself). This is a relatively sparse research direction within the broader taxonomy of fifty papers across twenty-nine leaf nodes, suggesting the work targets a specific niche at the intersection of causal autoregressive generation and long-horizon interactive synthesis.

The taxonomy reveals that LongLive's leaf sits within the 'Streaming Autoregressive Generation Architectures' branch, which also includes chunk-based and adversarial autoregressive methods. Neighboring branches explore real-time diffusion frameworks (e.g., flow matching, pipeline parallelism) and interactive world models with action conditioning. While diffusion-based methods prioritize visual quality through bidirectional attention, LongLive's causal design trades some modeling capacity for KV-caching efficiency. The sibling papers VideoSSM and RELIC address similar memory challenges but through state-space models and retrieval augmentation respectively, whereas LongLive emphasizes recache mechanisms and streaming tuning for interactive prompt transitions.

Among twenty-three candidates examined, the contribution-level analysis shows varied prior work overlap. The KV-recache mechanism for interactive prompt switching examined seven candidates with no clear refutations, suggesting relative novelty in this specific interactive control paradigm. The streaming long tuning strategy examined ten candidates and found one refutable match, indicating some existing work on long-sequence training alignment. The short window attention with frame-level attention sink examined six candidates and identified one refutable prior, suggesting that attention sink techniques for autoregressive video models have been explored previously, though possibly in different architectural contexts or application domains.

Based on the limited search scope of twenty-three semantically similar candidates, LongLive appears to combine several existing techniques (attention sinks, streaming training) with a novel interactive control mechanism (KV-recache). The work's position in a sparse taxonomy leaf and the absence of refutations for the KV-recache contribution suggest some originality in the interactive prompt-switching aspect. However, the analysis does not cover the full breadth of autoregressive video generation literature, and the two refuted contributions indicate that components of the approach build on established methods for long-sequence modeling and attention optimization.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: real-time interactive long video generation. The field encompasses diverse architectural paradigms and application domains, organized into eight main branches. Streaming Autoregressive Generation Architectures focus on frame-by-frame synthesis with memory mechanisms to maintain temporal coherence, as seen in works like VideoSSM[7] and RELIC[49]. Real-Time Diffusion-Based Interactive Generation explores efficient diffusion sampling strategies such as StreamDiffusionV2[35] and Streamdit[2] that enable low-latency synthesis. Interactive World Models and Game Engines, exemplified by Diffusion Game Engines[3] and Hunyuan GameCraft[10], build controllable environments for interactive simulation. Multimodal Interactive Avatar and Digital Human Synthesis addresses real-time character animation and conversational agents, while Domain-Specific Real-Time Interactive Video Applications target specialized use cases like autonomous driving scenarios (AIGC Traffic Scene[6]) and robotics. Supporting Techniques provide foundational methods for compression, scheduling, and optimization, and Surveys, Benchmarks, and Foundational Frameworks (Interactive Generative Video Survey[5]) offer structured evaluations. Specialized Real-Time Synthesis and Rendering Techniques handle rendering pipelines and view synthesis for immersive experiences. Several active lines of work reveal key trade-offs between generation quality, latency, and controllability. Autoregressive models with memory mechanisms balance long-range coherence against computational overhead, while diffusion-based approaches trade sampling steps for real-time responsiveness. Interactive world models emphasize user control and physical plausibility, often at the cost of visual fidelity compared to purely generative methods. LongLive[0] sits within the Streaming Autoregressive Generation branch, specifically among Frame-Level Autoregressive Models with Memory Mechanisms, alongside VideoSSM[7] and RELIC[49]. While VideoSSM[7] leverages state-space models for efficient temporal modeling and RELIC[49] emphasizes retrieval-augmented context, LongLive[0] appears to prioritize extended temporal consistency across very long sequences, addressing the challenge of maintaining coherent narratives and visual stability over extended interactive sessions without catastrophic drift.

Claimed Contributions

KV-recache mechanism for interactive prompt switching

7 retrieved papers

The authors introduce a KV-recache technique that refreshes cached key-value states at prompt boundaries by recomputing them using previously generated frames and the new prompt. This enables smooth visual transitions while maintaining semantic alignment with the new prompt during interactive video generation.

7 retrieved papers

Streaming long tuning strategy

Can Refute

10 retrieved papers

The authors propose a train-long-test-long training procedure that exposes the model to extended self-generated sequences during training. This approach aligns training with inference conditions by iteratively generating short clips conditioned on previously cached states, mitigating error accumulation and quality degradation in long videos.

10 retrieved papers

Can Refute

Short window attention with frame-level attention sink

Can Refute

6 retrieved papers

The authors introduce a combination of local short-window attention and a frame-level attention sink (frame sink) that maintains persistent global anchor tokens. This design reduces computational cost while preserving long-range temporal consistency in video generation.

6 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[7] VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory PDF

Yifei Yu, Xiaoshan Wu, Xinting Hu, Tao Hu, Yangtian Sun, Xiaoyang Lyu, Bo Wang, Lin Ma, Yuewen Ma, Zhongrui Wang, Xiaojuan Qi (2025)

[49] RELIC: Interactive Video World Model with Long-Horizon Memory PDF

Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, Kalyan Sunkavalli, Feng Liu, Zhengqi Li, Hao Tan (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

KV-recache mechanism for interactive prompt switching

[4] MotionStream: Real-Time Video Generation with Interactive Motion Controls PDF

Cannot Refute

[51] FilmWeaver: Weaving Consistent Multi-Shot Videos with Cache-Guided Autoregressive Diffusion PDF

Cannot Refute

[52] SneakPeek: Future-Guided Instructional Streaming Video Generation PDF

Cannot Refute

[53] MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives PDF

Cannot Refute

[54] Contextual Knowledge Infusion via Iterative Semantic Tracing for VisionâLanguage Understanding PDF

Cannot Refute

[55] EgoLCD: Egocentric Video Generation with Long Context Diffusion PDF

Cannot Refute

[56] Playing For You: Text Prompt-guided Joint Audio-visual Generation for Narrating Faces using Multi-entangled Latent Space PDF

Cannot Refute

Contribution

Streaming long tuning strategy

[66] Loong: Generating Minute-level Long Videos with Autoregressive Language Models PDF

Can Refute

[16] MAGI-1: Autoregressive Video Generation at Scale PDF

Cannot Refute

[20] Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation PDF

Cannot Refute

[28] Learning World Models for Interactive Video Generation PDF

Cannot Refute

[62] Progressive autoregressive video diffusion models PDF

Cannot Refute

[63] Streamingt2v: Consistent, dynamic, and extendable long video generation from text PDF

Cannot Refute

[64] From Slow Bidirectional to Fast Autoregressive Video Diffusion Models PDF

Cannot Refute

[65] Macro-from-micro planning for high-quality and parallelized autoregressive long video generation PDF

Cannot Refute

[67] Videoauteur: Towards long narrative video generation PDF

Cannot Refute

[68] DeepVerse: 4D Autoregressive Video Generation as a World Model PDF

Cannot Refute

Contribution

Short window attention with frame-level attention sink

[4] MotionStream: Real-Time Video Generation with Interactive Motion Controls PDF

Can Refute

[57] Sliding Window Attention Training for Efficient Large Language Models PDF

Cannot Refute

[58] Streamingvlm: Real-time understanding for infinite video streams PDF

Cannot Refute

[59] StreamingDialogue: Prolonged Dialogue Learning via Long Context Compression with Minimal Losses PDF

Cannot Refute

[60] Efficient Vocal Source Separation Through Windowed Sink Attention PDF

Cannot Refute

[61] Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation PDF

Cannot Refute

LongLive: Real-time Interactive Long Video Generation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[7] VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory PDF

[49] RELIC: Interactive Video World Model with Long-Horizon Memory PDF

Contribution Analysis

KV-recache mechanism for interactive prompt switching

[4] MotionStream: Real-Time Video Generation with Interactive Motion Controls PDF

[51] FilmWeaver: Weaving Consistent Multi-Shot Videos with Cache-Guided Autoregressive Diffusion PDF

[52] SneakPeek: Future-Guided Instructional Streaming Video Generation PDF

[53] MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives PDF

[54] Contextual Knowledge Infusion via Iterative Semantic Tracing for VisionâLanguage Understanding PDF

[55] EgoLCD: Egocentric Video Generation with Long Context Diffusion PDF

[56] Playing For You: Text Prompt-guided Joint Audio-visual Generation for Narrating Faces using Multi-entangled Latent Space PDF

Streaming long tuning strategy

[66] Loong: Generating Minute-level Long Videos with Autoregressive Language Models PDF

[16] MAGI-1: Autoregressive Video Generation at Scale PDF

[20] Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation PDF

[28] Learning World Models for Interactive Video Generation PDF

[62] Progressive autoregressive video diffusion models PDF

[63] Streamingt2v: Consistent, dynamic, and extendable long video generation from text PDF

[64] From Slow Bidirectional to Fast Autoregressive Video Diffusion Models PDF

[65] Macro-from-micro planning for high-quality and parallelized autoregressive long video generation PDF

[67] Videoauteur: Towards long narrative video generation PDF

[68] DeepVerse: 4D Autoregressive Video Generation as a World Model PDF

Short window attention with frame-level attention sink

[4] MotionStream: Real-Time Video Generation with Interactive Motion Controls PDF

[57] Sliding Window Attention Training for Efficient Large Language Models PDF

[58] Streamingvlm: Real-time understanding for infinite video streams PDF

[59] StreamingDialogue: Prolonged Dialogue Learning via Long Context Compression with Minimal Losses PDF

[60] Efficient Vocal Source Separation Through Windowed Sink Attention PDF

[61] Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation PDF

Table of Contents

[54] Contextual Knowledge Infusion via Iterative Semantic Tracing for VisionâLanguage Understanding PDF