LongLive: Real-time Interactive Long Video Generation

ICLR 2026 Conference SubmissionAnonymous Authors
Real-timeInteractiveLong Video Generation
Abstract:

We present LongLive, a frame-level autoregressive (AR) framework for real-time and interactive long video generation. Long video generation presents challenges in both efficiency and quality. Diffusion and Diffusion-Forcing models can produce high-quality videos but suffer from low efficiency due to bidirectional attention. Causal attention AR models support KV caching for faster inference but often degrade in quality on long videos due to memory challenges during long-video training. In addition, beyond static prompt-based generation, interactive capabilities, such as streaming prompt inputs, are critical for dynamic content creation, enabling users to guide narratives in real time. This interactive requirement significantly increases the complexity, especially in ensuring visual consistency and semantic coherence during prompt transitions. To address these challenges, LongLive adopts a causal, frame-level AR design that integrates a KV-recache mechanism that refreshes cached states with the new prompt for smooth, adherent switches streaming long tuning to enable long video training and to align training and inference (train-long–test-long); and short window attention paired with a frame-level attention sink, preserving long-range consistency while enabling faster generation. With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip model to minute-long generation in just 32 GPU-days. At inference, LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short- and long-video settings. LongLive supports up to 240-second videos on a single H100 GPU. With FP8 quantization, LongLive boosts inference to 24.8 FPS with marginal quality loss.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

LongLive proposes a frame-level autoregressive framework for real-time interactive long video generation, combining KV-recache for prompt switching, streaming long tuning for extended temporal coherence, and short window attention with attention sinks. The paper resides in the 'Frame-Level Autoregressive Models with Memory Mechanisms' leaf, which contains only three papers total (including LongLive itself). This is a relatively sparse research direction within the broader taxonomy of fifty papers across twenty-nine leaf nodes, suggesting the work targets a specific niche at the intersection of causal autoregressive generation and long-horizon interactive synthesis.

The taxonomy reveals that LongLive's leaf sits within the 'Streaming Autoregressive Generation Architectures' branch, which also includes chunk-based and adversarial autoregressive methods. Neighboring branches explore real-time diffusion frameworks (e.g., flow matching, pipeline parallelism) and interactive world models with action conditioning. While diffusion-based methods prioritize visual quality through bidirectional attention, LongLive's causal design trades some modeling capacity for KV-caching efficiency. The sibling papers VideoSSM and RELIC address similar memory challenges but through state-space models and retrieval augmentation respectively, whereas LongLive emphasizes recache mechanisms and streaming tuning for interactive prompt transitions.

Among twenty-three candidates examined, the contribution-level analysis shows varied prior work overlap. The KV-recache mechanism for interactive prompt switching examined seven candidates with no clear refutations, suggesting relative novelty in this specific interactive control paradigm. The streaming long tuning strategy examined ten candidates and found one refutable match, indicating some existing work on long-sequence training alignment. The short window attention with frame-level attention sink examined six candidates and identified one refutable prior, suggesting that attention sink techniques for autoregressive video models have been explored previously, though possibly in different architectural contexts or application domains.

Based on the limited search scope of twenty-three semantically similar candidates, LongLive appears to combine several existing techniques (attention sinks, streaming training) with a novel interactive control mechanism (KV-recache). The work's position in a sparse taxonomy leaf and the absence of refutations for the KV-recache contribution suggest some originality in the interactive prompt-switching aspect. However, the analysis does not cover the full breadth of autoregressive video generation literature, and the two refuted contributions indicate that components of the approach build on established methods for long-sequence modeling and attention optimization.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
23
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: real-time interactive long video generation. The field encompasses diverse architectural paradigms and application domains, organized into eight main branches. Streaming Autoregressive Generation Architectures focus on frame-by-frame synthesis with memory mechanisms to maintain temporal coherence, as seen in works like VideoSSM[7] and RELIC[49]. Real-Time Diffusion-Based Interactive Generation explores efficient diffusion sampling strategies such as StreamDiffusionV2[35] and Streamdit[2] that enable low-latency synthesis. Interactive World Models and Game Engines, exemplified by Diffusion Game Engines[3] and Hunyuan GameCraft[10], build controllable environments for interactive simulation. Multimodal Interactive Avatar and Digital Human Synthesis addresses real-time character animation and conversational agents, while Domain-Specific Real-Time Interactive Video Applications target specialized use cases like autonomous driving scenarios (AIGC Traffic Scene[6]) and robotics. Supporting Techniques provide foundational methods for compression, scheduling, and optimization, and Surveys, Benchmarks, and Foundational Frameworks (Interactive Generative Video Survey[5]) offer structured evaluations. Specialized Real-Time Synthesis and Rendering Techniques handle rendering pipelines and view synthesis for immersive experiences. Several active lines of work reveal key trade-offs between generation quality, latency, and controllability. Autoregressive models with memory mechanisms balance long-range coherence against computational overhead, while diffusion-based approaches trade sampling steps for real-time responsiveness. Interactive world models emphasize user control and physical plausibility, often at the cost of visual fidelity compared to purely generative methods. LongLive[0] sits within the Streaming Autoregressive Generation branch, specifically among Frame-Level Autoregressive Models with Memory Mechanisms, alongside VideoSSM[7] and RELIC[49]. While VideoSSM[7] leverages state-space models for efficient temporal modeling and RELIC[49] emphasizes retrieval-augmented context, LongLive[0] appears to prioritize extended temporal consistency across very long sequences, addressing the challenge of maintaining coherent narratives and visual stability over extended interactive sessions without catastrophic drift.

Claimed Contributions

KV-recache mechanism for interactive prompt switching

The authors introduce a KV-recache technique that refreshes cached key-value states at prompt boundaries by recomputing them using previously generated frames and the new prompt. This enables smooth visual transitions while maintaining semantic alignment with the new prompt during interactive video generation.

7 retrieved papers
Streaming long tuning strategy

The authors propose a train-long-test-long training procedure that exposes the model to extended self-generated sequences during training. This approach aligns training with inference conditions by iteratively generating short clips conditioned on previously cached states, mitigating error accumulation and quality degradation in long videos.

10 retrieved papers
Can Refute
Short window attention with frame-level attention sink

The authors introduce a combination of local short-window attention and a frame-level attention sink (frame sink) that maintains persistent global anchor tokens. This design reduces computational cost while preserving long-range temporal consistency in video generation.

6 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

KV-recache mechanism for interactive prompt switching

The authors introduce a KV-recache technique that refreshes cached key-value states at prompt boundaries by recomputing them using previously generated frames and the new prompt. This enables smooth visual transitions while maintaining semantic alignment with the new prompt during interactive video generation.

Contribution

Streaming long tuning strategy

The authors propose a train-long-test-long training procedure that exposes the model to extended self-generated sequences during training. This approach aligns training with inference conditions by iteratively generating short clips conditioned on previously cached states, mitigating error accumulation and quality degradation in long videos.

Contribution

Short window attention with frame-level attention sink

The authors introduce a combination of local short-window attention and a frame-level attention sink (frame sink) that maintains persistent global anchor tokens. This design reduces computational cost while preserving long-range temporal consistency in video generation.