Stable Video Infinity: Infinite-Length Video Generation with Error Recycling

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Infinite-Length Video GenerationError Accumulation

We propose Stable Video Infinity (SVI) that can generate non-looping, ultra-long videos with stable visual quality, while supporting per-clip prompt control and multi-modal conditioning. While existing long-video methods attempt to mitigate accumulated errors via handcrafted anti-drifting (e.g., modified noise scheduler, frame anchoring), they remain limited to single-prompt extrapolation, producing homogeneous scenes with repetitive motions. We identify that the fundamental challenge extends beyond error accumulation to a critical discrepancy between the training assumption (seeing clean data) and the test-time autoregressive reality (conditioning on self-generated, error-prone outputs). To bridge this hypothesis gap, SVI incorporates Error-Recycling Fine-Tuning, a new type of efficient training that recycles the Diffusion Transformer (DiT)’s self-generated errors into supervisory prompts, thereby encouraging DiT to actively identify and correct its own errors. This is achieved by injecting, collecting, and banking errors through closed-loop recycling, autoregressively learning from error-injected feedback. Specifically, we (i) inject historical errors made by DiT to intervene on clean inputs, simulating error-accumulated trajectories in flow matching; (ii) efficiently approximate predictions with one-step bidirectional integration and calculate errors with residuals; (iii) dynamically bank errors into replay memory across discretized timesteps, which are resampled for new input. SVI is able to scale videos from seconds to infinite durations with no additional inference cost, while remaining compatible with diverse conditions (e.g., audio, skeleton, and text streams). We evaluate SVI on three benchmarks, including consistent, creative, and conditional settings, thoroughly verifying its versatility and state-of-the-art role. Project page

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Stable Video Infinity (SVI), a system for generating ultra-long, non-looping videos with per-clip prompt control through Error-Recycling Fine-Tuning. Within the taxonomy, it occupies the 'Error Recycling and Self-Correction' leaf under 'Error Mitigation and Consistency Mechanisms'. Notably, this leaf contains only the original paper itself—no sibling papers exist in this specific category. This positioning suggests the work addresses a relatively sparse research direction focused explicitly on training models to identify and correct their own errors through closed-loop feedback.

The taxonomy reveals that error mitigation strategies are distributed across multiple branches. The parent branch 'Error Mitigation and Consistency Mechanisms' also includes 'Adaptive Caching and Inference Optimization', which addresses error propagation through computational techniques rather than self-correction. Neighboring branches like 'Autoregressive and Causal Generation Frameworks' (containing StreamingT2V and FreeLong) tackle error accumulation through architectural choices and conditioning strategies, while 'Diffusion-Based Temporal Modeling' methods prioritize temporal coherence through attention mechanisms. SVI's approach diverges by explicitly training on self-generated errors rather than relying on inference-time modifications or architectural constraints.

Among the three contributions analyzed, the literature search examined 27 candidates total. The core SVI system and Error-Recycling Fine-Tuning method each examined 10 candidates with zero refutable matches, suggesting these specific mechanisms appear novel within the limited search scope. However, the formalization of training-test hypothesis gap examined 7 candidates and found 5 refutable matches, indicating substantial prior work on exposure bias and distribution shift in autoregressive generation. The statistics reflect a focused semantic search rather than exhaustive coverage, so contributions appearing novel here may have relevant precedents outside the top-27 candidates examined.

Based on the limited search scope of 27 semantically similar papers, the error recycling mechanism appears to occupy underexplored territory within the taxonomy's sparse 'Error Recycling and Self-Correction' leaf. The formalization of hypothesis gap, however, connects to established literature on exposure bias. The analysis captures relationships within top-ranked semantic matches but does not claim comprehensive coverage of all relevant prior work in autoregressive video generation or error mitigation strategies.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: infinite-length video generation with stable quality. The field addresses the challenge of producing arbitrarily long video sequences without accumulating visual artifacts or degrading coherence over time. The taxonomy reveals a diverse landscape organized around ten major branches. Autoregressive and causal generation frameworks focus on sequential frame prediction, while diffusion-based temporal modeling leverages iterative denoising processes for high-quality synthesis. Hybrid autoregressive-diffusion architectures combine both paradigms to balance efficiency and fidelity. Error mitigation and consistency mechanisms tackle drift and quality decay through techniques like error recycling and self-correction. Structured scene and world modeling approaches build explicit representations of environments, whereas planning and hierarchical decomposition methods break generation into manageable subproblems. Compositional and object-centric approaches decompose scenes into reusable elements, and specialized application domains target specific use cases such as avatars or driving scenarios. Evaluation and quality assessment branches develop metrics for long-form coherence, while auxiliary enhancement techniques provide supporting tools like caching or reward shaping. Several active lines of work highlight contrasting strategies for maintaining stability. Autoregressive methods like StreamingT2V[4] and FreeLong[16] emphasize efficient temporal extension through sliding windows and memory mechanisms, but face compounding error challenges. Diffusion-based approaches such as LaVie[5] and MotionStream[2] prioritize visual quality but require careful temporal conditioning. Stable Video Infinity[0] sits within the error mitigation and consistency mechanisms branch, specifically addressing error recycling and self-correction. Its emphasis on actively detecting and correcting accumulated errors distinguishes it from purely autoregressive methods like StreamingT2V[4], which rely on conditioning strategies, and from diffusion-heavy approaches like SkyReels[3], which focus on temporal coherence through architectural design. The work reflects a growing recognition that infinite-length generation demands explicit mechanisms to counteract drift, rather than relying solely on model capacity or temporal attention.

Claimed Contributions

Stable Video Infinity for infinite-length video generation

10 retrieved papers

The authors introduce Stable Video Infinity, a system that generates arbitrarily long videos without looping artifacts while maintaining stable quality. It supports per-clip prompt control and diverse multi-modal conditions such as audio and skeleton inputs.

10 retrieved papers

Error-Recycling Fine-Tuning method

10 retrieved papers

A novel training approach that repurposes the model's own prediction errors as supervisory signals. This method enables the Diffusion Transformer to learn to identify and correct its mistakes through autoregressive error feedback, bridging the gap between error-free training and error-prone inference.

10 retrieved papers

Formalization of training-test hypothesis gap and error types

Can Refute

7 retrieved papers

The authors provide a systematic analysis identifying the fundamental discrepancy between training assumptions (clean data) and test-time reality (error-prone outputs). They formally define two error types: single-clip predictive error and cross-clip conditional error.

7 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Stable Video Infinity for infinite-length video generation

[8] Longlive: Real-time interactive long video generation PDF

Cannot Refute

[33] Self-Forcing++: Towards Minute-Scale High-Quality Video Generation PDF

Cannot Refute

[51] Structure and content-guided video synthesis with diffusion models PDF

Cannot Refute

[52] Dreampose: Fashion image-to-video synthesis via stable diffusion PDF

Cannot Refute

[53] Align your latents: High-resolution video synthesis with latent diffusion models PDF

Cannot Refute

[54] Frame-Level Captions for Long Video Generation with Complex Multi Scenes PDF

Cannot Refute

[55] Make-your-video: Customized video generation using textual and structural guidance PDF

Cannot Refute

[56] Storydiffusion: Consistent self-attention for long-range image and video generation PDF

Cannot Refute

[57] Dreampose: Fashion video synthesis with stable diffusion PDF

Cannot Refute

[58] Text2story: Advancing video storytelling with text guidance PDF

Cannot Refute

Contribution

Error-Recycling Fine-Tuning method

[47] Rolling Forcing: Autoregressive Long Video Diffusion in Real Time PDF

Cannot Refute

[64] Self-guided diffusion models PDF

Cannot Refute

[65] Cm-gan: Stabilizing gan training with consistency models PDF

Cannot Refute

[66] Diffrect: Latent diffusion label rectification for semi-supervised medical image segmentation PDF

Cannot Refute

[67] Your diffusion model is secretly a noise classifier and benefits from contrastive training PDF

Cannot Refute

[68] ETC: training-free diffusion models acceleration with Error-aware Trend Consistency PDF

Cannot Refute

[69] Rethinking Training Dynamics in Scale-wise Autoregressive Generation PDF

Cannot Refute

[70] Towards dynamic modeling of visual-vestibular conflict detection PDF

Cannot Refute

[71] OViP: Online Vision-Language Preference Learning PDF

Cannot Refute

[72] One to Two, Two to All: Towards Multimodal Self-supervised Learning for Earth Observation PDF

Cannot Refute

Contribution

Formalization of training-test hypothesis gap and error types

[39] BAgger: Backwards Aggregation for Mitigating Drift in Autoregressive Video Diffusion Models PDF

Can Refute

[47] Rolling Forcing: Autoregressive Long Video Diffusion in Real Time PDF

Can Refute

[60] Recurrent Neural Operators: Stable Long-Term PDE Prediction PDF

Can Refute

[61] From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model PDF

Can Refute

[63] End-to-End Training for Autoregressive Video Diffusion via Self-Resampling PDF

Can Refute

[59] Ar-diffusion: Asynchronous video generation with auto-regressive diffusion PDF

Cannot Refute

[62] Time-series generation by contrastive imitation PDF

Cannot Refute

Stable Video Infinity: Infinite-Length Video Generation with Error Recycling

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Stable Video Infinity for infinite-length video generation

[8] Longlive: Real-time interactive long video generation PDF

[33] Self-Forcing++: Towards Minute-Scale High-Quality Video Generation PDF

[51] Structure and content-guided video synthesis with diffusion models PDF

[52] Dreampose: Fashion image-to-video synthesis via stable diffusion PDF

[53] Align your latents: High-resolution video synthesis with latent diffusion models PDF

[54] Frame-Level Captions for Long Video Generation with Complex Multi Scenes PDF

[55] Make-your-video: Customized video generation using textual and structural guidance PDF

[56] Storydiffusion: Consistent self-attention for long-range image and video generation PDF

[57] Dreampose: Fashion video synthesis with stable diffusion PDF

[58] Text2story: Advancing video storytelling with text guidance PDF

Error-Recycling Fine-Tuning method

[47] Rolling Forcing: Autoregressive Long Video Diffusion in Real Time PDF

[64] Self-guided diffusion models PDF

[65] Cm-gan: Stabilizing gan training with consistency models PDF

[66] Diffrect: Latent diffusion label rectification for semi-supervised medical image segmentation PDF

[67] Your diffusion model is secretly a noise classifier and benefits from contrastive training PDF

[68] ETC: training-free diffusion models acceleration with Error-aware Trend Consistency PDF

[69] Rethinking Training Dynamics in Scale-wise Autoregressive Generation PDF

[70] Towards dynamic modeling of visual-vestibular conflict detection PDF

[71] OViP: Online Vision-Language Preference Learning PDF

[72] One to Two, Two to All: Towards Multimodal Self-supervised Learning for Earth Observation PDF

Formalization of training-test hypothesis gap and error types

[39] BAgger: Backwards Aggregation for Mitigating Drift in Autoregressive Video Diffusion Models PDF

[47] Rolling Forcing: Autoregressive Long Video Diffusion in Real Time PDF

[60] Recurrent Neural Operators: Stable Long-Term PDE Prediction PDF

[61] From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model PDF

[63] End-to-End Training for Autoregressive Video Diffusion via Self-Resampling PDF

[59] Ar-diffusion: Asynchronous video generation with auto-regressive diffusion PDF

[62] Time-series generation by contrastive imitation PDF

Table of Contents