Stable Video Infinity: Infinite-Length Video Generation with Error Recycling

ICLR 2026 Conference SubmissionAnonymous Authors
Infinite-Length Video GenerationError Accumulation
Abstract:

We propose Stable Video Infinity (SVI) that can generate non-looping, ultra-long videos with stable visual quality, while supporting per-clip prompt control and multi-modal conditioning. While existing long-video methods attempt to mitigate accumulated errors via handcrafted anti-drifting (e.g., modified noise scheduler, frame anchoring), they remain limited to single-prompt extrapolation, producing homogeneous scenes with repetitive motions. We identify that the fundamental challenge extends beyond error accumulation to a critical discrepancy between the training assumption (seeing clean data) and the test-time autoregressive reality (conditioning on self-generated, error-prone outputs). To bridge this hypothesis gap, SVI incorporates Error-Recycling Fine-Tuning, a new type of efficient training that recycles the Diffusion Transformer (DiT)’s self-generated errors into supervisory prompts, thereby encouraging DiT to actively identify and correct its own errors. This is achieved by injecting, collecting, and banking errors through closed-loop recycling, autoregressively learning from error-injected feedback. Specifically, we (i) inject historical errors made by DiT to intervene on clean inputs, simulating error-accumulated trajectories in flow matching; (ii) efficiently approximate predictions with one-step bidirectional integration and calculate errors with residuals; (iii) dynamically bank errors into replay memory across discretized timesteps, which are resampled for new input. SVI is able to scale videos from seconds to infinite durations with no additional inference cost, while remaining compatible with diverse conditions (e.g., audio, skeleton, and text streams). We evaluate SVI on three benchmarks, including consistent, creative, and conditional settings, thoroughly verifying its versatility and state-of-the-art role. Project page

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Stable Video Infinity (SVI), a system for generating ultra-long, non-looping videos with per-clip prompt control through Error-Recycling Fine-Tuning. Within the taxonomy, it occupies the 'Error Recycling and Self-Correction' leaf under 'Error Mitigation and Consistency Mechanisms'. Notably, this leaf contains only the original paper itself—no sibling papers exist in this specific category. This positioning suggests the work addresses a relatively sparse research direction focused explicitly on training models to identify and correct their own errors through closed-loop feedback.

The taxonomy reveals that error mitigation strategies are distributed across multiple branches. The parent branch 'Error Mitigation and Consistency Mechanisms' also includes 'Adaptive Caching and Inference Optimization', which addresses error propagation through computational techniques rather than self-correction. Neighboring branches like 'Autoregressive and Causal Generation Frameworks' (containing StreamingT2V and FreeLong) tackle error accumulation through architectural choices and conditioning strategies, while 'Diffusion-Based Temporal Modeling' methods prioritize temporal coherence through attention mechanisms. SVI's approach diverges by explicitly training on self-generated errors rather than relying on inference-time modifications or architectural constraints.

Among the three contributions analyzed, the literature search examined 27 candidates total. The core SVI system and Error-Recycling Fine-Tuning method each examined 10 candidates with zero refutable matches, suggesting these specific mechanisms appear novel within the limited search scope. However, the formalization of training-test hypothesis gap examined 7 candidates and found 5 refutable matches, indicating substantial prior work on exposure bias and distribution shift in autoregressive generation. The statistics reflect a focused semantic search rather than exhaustive coverage, so contributions appearing novel here may have relevant precedents outside the top-27 candidates examined.

Based on the limited search scope of 27 semantically similar papers, the error recycling mechanism appears to occupy underexplored territory within the taxonomy's sparse 'Error Recycling and Self-Correction' leaf. The formalization of hypothesis gap, however, connects to established literature on exposure bias. The analysis captures relationships within top-ranked semantic matches but does not claim comprehensive coverage of all relevant prior work in autoregressive video generation or error mitigation strategies.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
27
Contribution Candidate Papers Compared
5
Refutable Paper

Research Landscape Overview

Core task: infinite-length video generation with stable quality. The field addresses the challenge of producing arbitrarily long video sequences without accumulating visual artifacts or degrading coherence over time. The taxonomy reveals a diverse landscape organized around ten major branches. Autoregressive and causal generation frameworks focus on sequential frame prediction, while diffusion-based temporal modeling leverages iterative denoising processes for high-quality synthesis. Hybrid autoregressive-diffusion architectures combine both paradigms to balance efficiency and fidelity. Error mitigation and consistency mechanisms tackle drift and quality decay through techniques like error recycling and self-correction. Structured scene and world modeling approaches build explicit representations of environments, whereas planning and hierarchical decomposition methods break generation into manageable subproblems. Compositional and object-centric approaches decompose scenes into reusable elements, and specialized application domains target specific use cases such as avatars or driving scenarios. Evaluation and quality assessment branches develop metrics for long-form coherence, while auxiliary enhancement techniques provide supporting tools like caching or reward shaping. Several active lines of work highlight contrasting strategies for maintaining stability. Autoregressive methods like StreamingT2V[4] and FreeLong[16] emphasize efficient temporal extension through sliding windows and memory mechanisms, but face compounding error challenges. Diffusion-based approaches such as LaVie[5] and MotionStream[2] prioritize visual quality but require careful temporal conditioning. Stable Video Infinity[0] sits within the error mitigation and consistency mechanisms branch, specifically addressing error recycling and self-correction. Its emphasis on actively detecting and correcting accumulated errors distinguishes it from purely autoregressive methods like StreamingT2V[4], which rely on conditioning strategies, and from diffusion-heavy approaches like SkyReels[3], which focus on temporal coherence through architectural design. The work reflects a growing recognition that infinite-length generation demands explicit mechanisms to counteract drift, rather than relying solely on model capacity or temporal attention.

Claimed Contributions

Stable Video Infinity for infinite-length video generation

The authors introduce Stable Video Infinity, a system that generates arbitrarily long videos without looping artifacts while maintaining stable quality. It supports per-clip prompt control and diverse multi-modal conditions such as audio and skeleton inputs.

10 retrieved papers
Error-Recycling Fine-Tuning method

A novel training approach that repurposes the model's own prediction errors as supervisory signals. This method enables the Diffusion Transformer to learn to identify and correct its mistakes through autoregressive error feedback, bridging the gap between error-free training and error-prone inference.

10 retrieved papers
Formalization of training-test hypothesis gap and error types

The authors provide a systematic analysis identifying the fundamental discrepancy between training assumptions (clean data) and test-time reality (error-prone outputs). They formally define two error types: single-clip predictive error and cross-clip conditional error.

7 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Stable Video Infinity for infinite-length video generation

The authors introduce Stable Video Infinity, a system that generates arbitrarily long videos without looping artifacts while maintaining stable quality. It supports per-clip prompt control and diverse multi-modal conditions such as audio and skeleton inputs.

Contribution

Error-Recycling Fine-Tuning method

A novel training approach that repurposes the model's own prediction errors as supervisory signals. This method enables the Diffusion Transformer to learn to identify and correct its mistakes through autoregressive error feedback, bridging the gap between error-free training and error-prone inference.

Contribution

Formalization of training-test hypothesis gap and error types

The authors provide a systematic analysis identifying the fundamental discrepancy between training assumptions (clean data) and test-time reality (error-prone outputs). They formally define two error types: single-clip predictive error and cross-clip conditional error.