Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

ICLR 2026 Conference SubmissionAnonymous Authors
long video generationdiffusion modelautoregressive video generation
Abstract:

Diffusion models have revolutionized image and video generation, achieving unprecedented visual quality. However, their reliance on transformer architectures incurs prohibitively high computational costs, particularly when extending generation to long videos. Recent work has explored autoregressive formulations for long video generation, typically by distilling from short-horizon bidirectional teachers. Nevertheless, given that teacher models cannot synthesize long videos, the extrapolation of student models beyond their training horizon often leads to pronounced quality degradation, arising from the compounding of errors within the continuous latent space. In this paper, we propose a simple yet effective approach to mitigate quality degradation in long-horizon video generation without requiring supervision from long-video teachers or retraining on long video datasets. Our approach centers on exploiting the rich knowledge of teacher models to provide guidance for the student model through sampled segments drawn from self-generated long videos. Our method maintains temporal consistency while scaling video length by up to 20×\times beyond teacher's capability, avoiding common issues such as over-exposure and error-accumuation without recomputing overlapping frames like previous methods. When scaling up the computation, our method shows the capability of generating videos up to 4 minutes and 15 seconds, equivalent to 99.9% of the maximum span supported by our base model’s position embedding and more than 50x longer than that of our baseline model. Experiments on standard benchmarks and our proposed improved benchmark demonstrate that our approach substantially outperforms baseline methods in both fidelity and consistency. Our long-horizon videos demo can be found at https://self-forcing-pp.github.io.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a self-guidance framework for long-horizon video generation, enabling autoregressive models to scale beyond their training horizon without long-video supervision. It resides in the 'Self-Guidance and Iterative Refinement' leaf under 'Temporal Consistency and Error Mitigation', which contains only three papers total. This is a relatively sparse research direction within the broader taxonomy of 50 papers across 18 leaf nodes, suggesting the specific approach of using teacher-guided self-refinement for extrapolation is not yet heavily explored. The core contribution centers on leveraging short-horizon teacher models to guide student generation through sampled segments from self-generated long videos.

The taxonomy tree reveals that neighboring leaves address temporal consistency through alternative mechanisms: 'Memory-Augmented Temporal Modeling' uses explicit state-space models or memory banks, 'Hierarchical and Multi-Stage Planning' employs macro-micro decomposition, and 'Contextual Conditioning and Transition Modeling' focuses on explicit transition generation. The paper's approach diverges from these by avoiding architectural memory extensions or hierarchical planning, instead relying on iterative teacher guidance during inference. Adjacent branches like 'Hybrid Autoregressive-Diffusion Paradigms' and 'Continuous Latent Autoregression' explore different modeling families, while the paper operates within a distillation-based autoregressive framework, positioning it at the intersection of consistency mechanisms and teacher-student paradigms.

Among 20 candidates examined, the 'Self-Forcing++ training framework' contribution shows one refutable candidate out of 10 examined, indicating some prior work overlap in the self-guidance training methodology. The 'Extended Distribution Matching Distillation' contribution was not examined against candidates, leaving its novelty assessment incomplete. The 'Visual Stability metric' contribution examined 10 candidates with zero refutations, suggesting this evaluation approach may be more novel within the limited search scope. The analysis explicitly covers top-K semantic matches and citation expansion, not an exhaustive literature review, so these statistics reflect a bounded exploration of the immediate research neighborhood.

Given the limited search scope of 20 candidates, the paper appears to occupy a moderately explored niche within temporal consistency mechanisms. The self-guidance training framework shows some overlap with existing methods, while the evaluation metric contribution appears less contested among examined candidates. The sparse population of the 'Self-Guidance and Iterative Refinement' leaf suggests the specific combination of teacher-guided extrapolation and segment-based refinement may offer incremental novelty, though the analysis cannot definitively assess whether similar approaches exist beyond the examined candidate set.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
20
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: long-horizon video generation with autoregressive models. The field has evolved around several complementary directions. One major branch focuses on autoregressive architectures and prediction units, exploring how to structure temporal dependencies—whether through token-level generation, block-based prediction, or state-space mechanisms. Another branch addresses temporal consistency and error mitigation, developing techniques such as self-guidance and iterative refinement to prevent drift over extended sequences. Hybrid autoregressive-diffusion paradigms combine the strengths of both modeling families, while inference optimization and efficiency work targets practical deployment through faster sampling or reduced memory footprints. Domain-specific applications adapt these methods to autonomous driving, egocentric video, or other specialized settings, and multimodal extensions integrate text, audio, or other modalities. Latent space video modeling compresses representations to enable scalable generation, and surveys or benchmarking efforts provide systematic evaluations across these diverse approaches. A particularly active line of work centers on mitigating error accumulation in long rollouts. Self-Forcing++[0] exemplifies this direction by introducing iterative self-guidance mechanisms that refine predictions at each step, closely related to Rolling Forcing[40] and Infinity-RoPE[46], which also tackle drift through alternative training or positional encoding strategies. Meanwhile, hybrid methods such as Fast Autoregressive Diffusion[2] and Ar-diffusion[35] blend autoregressive scheduling with diffusion-based refinement, offering a contrasting trade-off between generation speed and sample quality. Domain-specific efforts like Streetscapes[6] and LiDAR Sequences[47] demonstrate how these core techniques adapt to structured environments, while works such as Beyond Next Frames[5] and Long-Context Autoregressive[13] push the boundaries of temporal span and context modeling. Self-Forcing++[0] sits squarely within the self-guidance cluster, emphasizing iterative correction to maintain coherence across hundreds of frames, distinguishing itself from simpler rollout schemes and aligning with the broader goal of stable, high-fidelity long-horizon synthesis.

Claimed Contributions

Self-Forcing++ training framework for long-horizon video generation

The authors introduce Self-Forcing++, a training method that enables autoregressive video generation models to produce videos up to 100 seconds (20× beyond the teacher model's capability) by having the student model generate long rollouts with accumulated errors and then using the teacher model to correct these errors through distribution matching distillation on sampled segments.

10 retrieved papers
Can Refute
Extended Distribution Matching Distillation with backward noise initialization

The method extends distribution matching distillation beyond the teacher's training horizon by rolling out the student model to generate long videos (N frames where N >> M), re-injecting noise into these rollouts via backward noise initialization, and then uniformly sampling contiguous windows to compute distributional discrepancy between student and teacher models.

0 retrieved papers
Visual Stability metric for long video evaluation

The authors identify biases in existing benchmarks like VBench that favor over-exposed and degraded videos, and propose Visual Stability as an improved evaluation metric that uses Gemini-2.5-Pro to assess key long-video issues such as over-exposure and error accumulation on a 0-100 scale.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Self-Forcing++ training framework for long-horizon video generation

The authors introduce Self-Forcing++, a training method that enables autoregressive video generation models to produce videos up to 100 seconds (20× beyond the teacher model's capability) by having the student model generate long rollouts with accumulated errors and then using the teacher model to correct these errors through distribution matching distillation on sampled segments.

Contribution

Extended Distribution Matching Distillation with backward noise initialization

The method extends distribution matching distillation beyond the teacher's training horizon by rolling out the student model to generate long videos (N frames where N >> M), re-injecting noise into these rollouts via backward noise initialization, and then uniformly sampling contiguous windows to compute distributional discrepancy between student and teacher models.

Contribution

Visual Stability metric for long video evaluation

The authors identify biases in existing benchmarks like VBench that favor over-exposed and degraded videos, and propose Visual Stability as an improved evaluation metric that uses Gemini-2.5-Pro to assess key long-video issues such as over-exposure and error accumulation on a 0-100 scale.