Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.3 Download Report PDF

long video generationdiffusion modelautoregressive video generation

Diffusion models have revolutionized image and video generation, achieving unprecedented visual quality. However, their reliance on transformer architectures incurs prohibitively high computational costs, particularly when extending generation to long videos. Recent work has explored autoregressive formulations for long video generation, typically by distilling from short-horizon bidirectional teachers. Nevertheless, given that teacher models cannot synthesize long videos, the extrapolation of student models beyond their training horizon often leads to pronounced quality degradation, arising from the compounding of errors within the continuous latent space. In this paper, we propose a simple yet effective approach to mitigate quality degradation in long-horizon video generation without requiring supervision from long-video teachers or retraining on long video datasets. Our approach centers on exploiting the rich knowledge of teacher models to provide guidance for the student model through sampled segments drawn from self-generated long videos. Our method maintains temporal consistency while scaling video length by up to 20 $\times$ beyond teacher's capability, avoiding common issues such as over-exposure and error-accumuation without recomputing overlapping frames like previous methods. When scaling up the computation, our method shows the capability of generating videos up to 4 minutes and 15 seconds, equivalent to 99.9% of the maximum span supported by our base model’s position embedding and more than 50x longer than that of our baseline model. Experiments on standard benchmarks and our proposed improved benchmark demonstrate that our approach substantially outperforms baseline methods in both fidelity and consistency. Our long-horizon videos demo can be found at https://self-forcing-pp.github.io.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a self-guidance framework for long-horizon video generation, enabling autoregressive models to scale beyond their training horizon without long-video supervision. It resides in the 'Self-Guidance and Iterative Refinement' leaf under 'Temporal Consistency and Error Mitigation', which contains only three papers total. This is a relatively sparse research direction within the broader taxonomy of 50 papers across 18 leaf nodes, suggesting the specific approach of using teacher-guided self-refinement for extrapolation is not yet heavily explored. The core contribution centers on leveraging short-horizon teacher models to guide student generation through sampled segments from self-generated long videos.

The taxonomy tree reveals that neighboring leaves address temporal consistency through alternative mechanisms: 'Memory-Augmented Temporal Modeling' uses explicit state-space models or memory banks, 'Hierarchical and Multi-Stage Planning' employs macro-micro decomposition, and 'Contextual Conditioning and Transition Modeling' focuses on explicit transition generation. The paper's approach diverges from these by avoiding architectural memory extensions or hierarchical planning, instead relying on iterative teacher guidance during inference. Adjacent branches like 'Hybrid Autoregressive-Diffusion Paradigms' and 'Continuous Latent Autoregression' explore different modeling families, while the paper operates within a distillation-based autoregressive framework, positioning it at the intersection of consistency mechanisms and teacher-student paradigms.

Among 20 candidates examined, the 'Self-Forcing++ training framework' contribution shows one refutable candidate out of 10 examined, indicating some prior work overlap in the self-guidance training methodology. The 'Extended Distribution Matching Distillation' contribution was not examined against candidates, leaving its novelty assessment incomplete. The 'Visual Stability metric' contribution examined 10 candidates with zero refutations, suggesting this evaluation approach may be more novel within the limited search scope. The analysis explicitly covers top-K semantic matches and citation expansion, not an exhaustive literature review, so these statistics reflect a bounded exploration of the immediate research neighborhood.

Given the limited search scope of 20 candidates, the paper appears to occupy a moderately explored niche within temporal consistency mechanisms. The self-guidance training framework shows some overlap with existing methods, while the evaluation metric contribution appears less contested among examined candidates. The sparse population of the 'Self-Guidance and Iterative Refinement' leaf suggests the specific combination of teacher-guided extrapolation and segment-based refinement may offer incremental novelty, though the analysis cannot definitively assess whether similar approaches exist beyond the examined candidate set.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: long-horizon video generation with autoregressive models. The field has evolved around several complementary directions. One major branch focuses on autoregressive architectures and prediction units, exploring how to structure temporal dependencies—whether through token-level generation, block-based prediction, or state-space mechanisms. Another branch addresses temporal consistency and error mitigation, developing techniques such as self-guidance and iterative refinement to prevent drift over extended sequences. Hybrid autoregressive-diffusion paradigms combine the strengths of both modeling families, while inference optimization and efficiency work targets practical deployment through faster sampling or reduced memory footprints. Domain-specific applications adapt these methods to autonomous driving, egocentric video, or other specialized settings, and multimodal extensions integrate text, audio, or other modalities. Latent space video modeling compresses representations to enable scalable generation, and surveys or benchmarking efforts provide systematic evaluations across these diverse approaches. A particularly active line of work centers on mitigating error accumulation in long rollouts. Self-Forcing++[0] exemplifies this direction by introducing iterative self-guidance mechanisms that refine predictions at each step, closely related to Rolling Forcing[40] and Infinity-RoPE[46], which also tackle drift through alternative training or positional encoding strategies. Meanwhile, hybrid methods such as Fast Autoregressive Diffusion[2] and Ar-diffusion[35] blend autoregressive scheduling with diffusion-based refinement, offering a contrasting trade-off between generation speed and sample quality. Domain-specific efforts like Streetscapes[6] and LiDAR Sequences[47] demonstrate how these core techniques adapt to structured environments, while works such as Beyond Next Frames[5] and Long-Context Autoregressive[13] push the boundaries of temporal span and context modeling. Self-Forcing++[0] sits squarely within the self-guidance cluster, emphasizing iterative correction to maintain coherence across hundreds of frames, distinguishing itself from simpler rollout schemes and aligning with the broader goal of stable, high-fidelity long-horizon synthesis.

Claimed Contributions

Self-Forcing++ training framework for long-horizon video generation

Can Refute

10 retrieved papers

The authors introduce Self-Forcing++, a training method that enables autoregressive video generation models to produce videos up to 100 seconds (20× beyond the teacher model's capability) by having the student model generate long rollouts with accumulated errors and then using the teacher model to correct these errors through distribution matching distillation on sampled segments.

10 retrieved papers

Can Refute

Extended Distribution Matching Distillation with backward noise initialization

0 retrieved papers

The method extends distribution matching distillation beyond the teacher's training horizon by rolling out the student model to generate long videos (N frames where N >> M), re-injecting noise into these rollouts via backward noise initialization, and then uniformly sampling contiguous windows to compute distributional discrepancy between student and teacher models.

0 retrieved papers

Visual Stability metric for long video evaluation

10 retrieved papers

The authors identify biases in existing benchmarks like VBench that favor over-exposed and degraded videos, and propose Visual Stability as an improved evaluation metric that uses Gemini-2.5-Pro to assess key long-video issues such as over-exposure and error accumulation on a 0-100 scale.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[40] Rolling forcing: Autoregressive long video diffusion in real time PDF

Liu, Kunhao, Hu Wenbo, Xu Jiale, Shan, Ying, Lu, Shijian (2025)

[46] Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout PDF

Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, Pinar Yanardag (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Self-Forcing++ training framework for long-horizon video generation

[2] From Slow Bidirectional to Fast Autoregressive Video Diffusion Models PDF

Can Refute

[7] ARTâ¢V: Auto-Regressive Text-to-Video Generation with Diffusion Models PDF

Cannot Refute

[12] VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory PDF

Cannot Refute

[23] Macro-from-Micro Planning for High-Quality and Parallelized Autoregressive Long Video Generation PDF

Cannot Refute

[35] Ar-diffusion: Asynchronous video generation with auto-regressive diffusion PDF

Cannot Refute

[51] Motionstream: Real-time video generation with interactive motion controls PDF

Cannot Refute

[52] Stable video infinity: Infinite-length video generation with error recycling PDF

Cannot Refute

[53] Evaluating Robot Policies in a World Model PDF

Cannot Refute

[54] Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation PDF

Cannot Refute

[55] Scenediffuser: Efficient and controllable driving simulation initialization and rollout PDF

Cannot Refute

Contribution

Extended Distribution Matching Distillation with backward noise initialization

Contribution

Visual Stability metric for long video evaluation

[9] Latent Video Diffusion Models for High-Fidelity Long Video Generation PDF

Cannot Refute

[37] MALT Diffusion: Memory-Augmented Latent Transformers for Any-Length Video Generation PDF

Cannot Refute

[56] Miradata: A large-scale video dataset with long durations and structured captions PDF

Cannot Refute

[57] FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention PDF

Cannot Refute

[58] Decoupling degradations with recurrent network for video restoration in under-display camera PDF

Cannot Refute

[59] LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation PDF

Cannot Refute

[60] M4V: Multi-Modal Mamba for Text-to-Video Generation PDF

Cannot Refute

[61] DiffuseSlide: Training-Free High Frame Rate Video Generation Diffusion PDF

Cannot Refute

[62] FlexiFilm: Long Video Generation with Flexible Conditions PDF

Cannot Refute

[63] JoyAvatar: Real-time and Infinite Audio-Driven Avatar Generation with Autoregressive Diffusion PDF

Cannot Refute

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[40] Rolling forcing: Autoregressive long video diffusion in real time PDF

[46] Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout PDF

Contribution Analysis

Self-Forcing++ training framework for long-horizon video generation

[2] From Slow Bidirectional to Fast Autoregressive Video Diffusion Models PDF

[7] ARTâ¢V: Auto-Regressive Text-to-Video Generation with Diffusion Models PDF

[12] VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory PDF

[23] Macro-from-Micro Planning for High-Quality and Parallelized Autoregressive Long Video Generation PDF

[35] Ar-diffusion: Asynchronous video generation with auto-regressive diffusion PDF

[51] Motionstream: Real-time video generation with interactive motion controls PDF

[52] Stable video infinity: Infinite-length video generation with error recycling PDF

[53] Evaluating Robot Policies in a World Model PDF

[54] Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation PDF

[55] Scenediffuser: Efficient and controllable driving simulation initialization and rollout PDF

Extended Distribution Matching Distillation with backward noise initialization

Visual Stability metric for long video evaluation

[9] Latent Video Diffusion Models for High-Fidelity Long Video Generation PDF

[37] MALT Diffusion: Memory-Augmented Latent Transformers for Any-Length Video Generation PDF

[56] Miradata: A large-scale video dataset with long durations and structured captions PDF

[57] FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention PDF

[58] Decoupling degradations with recurrent network for video restoration in under-display camera PDF

[59] LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation PDF

[60] M4V: Multi-Modal Mamba for Text-to-Video Generation PDF

[61] DiffuseSlide: Training-Free High Frame Rate Video Generation Diffusion PDF

[62] FlexiFilm: Long Video Generation with Flexible Conditions PDF

[63] JoyAvatar: Real-time and Infinite Audio-Driven Avatar Generation with Autoregressive Diffusion PDF

Table of Contents

[7] ARTâ¢V: Auto-Regressive Text-to-Video Generation with Diffusion Models PDF