Video-GPT via Next Clip Diffusion

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Video; Diffusion; LLM

GPT has shown its remarkable success in natural language processing. However, the language sequence is not sufficient to describe spatial-temporal details in the visual world. Alternatively, the video sequence is good at capturing such details. Motivated by this fact, we propose a concise Video-GPT in this paper by treating video as new language for visual world modeling. By analogy to next token prediction in GPT, we introduce a novel next clip diffusion paradigm for pretraining Video-GPT. Different from the previous works, this distinct paradigm allows Video-GPT to tackle both short-term generation and long-term prediction, by autoregressively denoising the noisy clip according to the clean clips in the history. Extensive experiments show our Video-GPT achieves the state-of-the-art performance on video prediction, which is the key factor towards world modeling (Physics-IQ Benchmark: Video-GPT 34.97 vs. Kling 23.64 vs. Wan 20.89). Moreover, it can be well adapted on 6 mainstream video tasks in both video generation and understanding, showing its great generalization capacity in downstream.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Video-GPT, a self-supervised model treating video as language for visual world modeling through a next clip diffusion paradigm. According to the taxonomy tree, this work resides in the 'Next Clip Diffusion for General Video Prediction' leaf under 'Autoregressive Video Prediction and World Modeling'. Notably, this leaf contains only the original paper itself with zero sibling papers, indicating a relatively sparse research direction within the broader taxonomy of eleven total papers across multiple branches. This positioning suggests the specific combination of autoregressive clip-level diffusion for general-purpose prediction represents an underexplored niche in the field.

The taxonomy reveals neighboring work primarily in 'Temporal Modeling Innovations in Video Diffusion', which explores vectorized timesteps and frame-aware approaches rather than clip-level autoregressive methods. The broader parent category encompasses general autoregressive prediction and world modeling, while sibling branches address distinct problems: single-image multi-view synthesis, camera-controlled scene exploration, audio-conditioned human generation, and domain-specific applications. The scope note explicitly excludes vectorized timestep approaches and non-autoregressive methods, positioning this work as focusing on sequential clip generation rather than alternative temporal modeling strategies explored in adjacent research directions.

Among thirty candidates examined, the contribution-level analysis shows mixed novelty signals. The core Video-GPT framework (Contribution A) examined ten candidates with zero refutable prior work, suggesting relative novelty in the overall architecture. However, the next clip diffusion paradigm (Contribution B) examined ten candidates and found one that appears to provide overlapping prior work, indicating some precedent exists within the limited search scope. The hierarchical masking method (Contribution C) also examined ten candidates with no refutations. These statistics suggest the paradigm itself has partial precedent, while the specific implementation and masking approach appear more distinctive within the examined literature.

Based on the limited top-thirty semantic search, the work appears to occupy a sparsely populated research direction with some methodological overlap in its core diffusion paradigm. The taxonomy structure confirms this sits at the intersection of autoregressive prediction and diffusion-based generation, where few papers directly combine these approaches for general video forecasting. The analysis covers semantic neighbors and citation-expanded candidates but cannot claim exhaustive coverage of all relevant prior work in video generation and prediction.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: autoregressive video generation and prediction via next clip diffusion. The field encompasses diverse approaches to generating and predicting video sequences, organized into several main branches. Autoregressive Video Prediction and World Modeling focuses on methods that iteratively generate future frames or clips, often for general-purpose prediction or interactive simulation. Single-Image to Multi-View Video Synthesis explores techniques that expand a static image into dynamic multi-view sequences, as seen in works like SV3D[1] and ViVid-1-to-3[3]. Camera-Controlled Dynamic Scene Exploration emphasizes user-guided camera trajectories to navigate and render scenes, exemplified by Cameractrl ii[2]. Audio-Conditioned Human Video Generation targets speech-driven or audio-synchronized human motion, such as FantasyTalking[4]. Finally, Domain-Specific Video Generation addresses specialized applications in fashion, anime, and anomaly detection, including FashionFlow[8], Trajectory-guided Anime Video Synthesis[9], and Pseudo Anomalies Are All[11]. These branches reflect a spectrum from general-purpose world models to highly constrained, domain-tailored synthesis. Within Autoregressive Video Prediction and World Modeling, a central theme is balancing temporal coherence with computational efficiency when predicting extended sequences. Some studies, like Loopy[5] and Contrastive Sequential-Diffusion Learning[6], explore novel training objectives or contrastive mechanisms to improve long-range consistency, while others such as Redefining Temporal Modeling in[7] revisit architectural choices for temporal dependencies. Video-GPT via Next Clip[0] sits squarely in this branch, emphasizing a next-clip diffusion paradigm for general video prediction. Compared to neighboring works that may focus on contrastive losses or specific architectural innovations, Video-GPT via Next Clip[0] adopts an autoregressive clip-level generation strategy, positioning it as a flexible framework for iterative forecasting. This approach contrasts with domain-specific methods like Dynamic Fashion Video Synthesis[10], which tailor generation to particular visual domains, highlighting the trade-off between generality and specialization in video synthesis research.

Claimed Contributions

Video-GPT: A self-supervised pretrained model treating video as language for visual world modeling

10 retrieved papers

The authors propose Video-GPT, a foundation model that treats video sequences as a new language for modeling the visual world, analogous to how GPT models language. Unlike GPT which processes discrete text tokens, Video-GPT processes video clips to capture rich spatial-temporal details through self-supervised pretraining.

10 retrieved papers

Next clip diffusion paradigm for pretraining Video-GPT

Can Refute

10 retrieved papers

The authors introduce a novel pretraining paradigm called next clip diffusion that combines autoregressive modeling between clips with diffusion modeling within clips. This hybrid approach enables the model to handle both short-term video generation and long-term prediction by autoregressively denoising noisy clips conditioned on clean historical clips.

10 retrieved papers

Can Refute

Hierarchical masking method for noise-clean interleaved clip sequences

10 retrieved papers

The authors design a hierarchical attention masking strategy operating at clip, frame, and patch levels to establish dependencies in an interleaved sequence of noisy and clean video clips. This masking allows noisy clips to attend to previous clean clips as correct temporal context while maintaining causal relationships for autoregressive generation.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Video-GPT: A self-supervised pretrained model treating video as language for visual world modeling

[12] Self-supervised Co-training for Video Representation Learning PDF

Cannot Refute

[13] Self-supervised motion perception for spatiotemporal representation learning PDF

Cannot Refute

[14] Self-supervised learning for videos: A survey PDF

Cannot Refute

[15] Spatiotemporal contrastive video representation learning PDF

Cannot Refute

[16] Cross-architecture self-supervised video representation learning PDF

Cannot Refute

[17] Self-supervised spatiotemporal learning via video clip order prediction PDF

Cannot Refute

[18] Spatial-Temporal-Decoupled Masked Pre-training for Spatiotemporal Forecasting PDF

Cannot Refute

[19] Unsupervised Learning of Video Representations using LSTMs PDF

Cannot Refute

[20] Self-supervised learning by cross-modal audio-video clustering PDF

Cannot Refute

[21] A large-scale study on unsupervised spatiotemporal representation learning PDF

Cannot Refute

Contribution

Next clip diffusion paradigm for pretraining Video-GPT

[29] Generative Pre-trained Autoregressive Diffusion Transformer PDF

Can Refute

[22] Bidirectional autoregessive diffusion model for dance generation PDF

Cannot Refute

[23] MarDini: Masked Autoregressive Diffusion for Video Generation at Scale PDF

Cannot Refute

[24] Diffusion Probabilistic Modeling for Video Generation PDF

Cannot Refute

[25] Stargen: A spatiotemporal autoregression framework with video diffusion model for scalable and controllable scene generation PDF

Cannot Refute

[26] From Slow Bidirectional to Fast Autoregressive Video Diffusion Models PDF

Cannot Refute

[27] Epona: Autoregressive Diffusion World Model for Autonomous Driving PDF

Cannot Refute

[28] Progressive Autoregressive Video Diffusion Models PDF

Cannot Refute

[30] Art-v: Auto-regressive text-to-video generation with diffusion models PDF

Cannot Refute

[31] Rolling forcing: Autoregressive long video diffusion in real time PDF

Cannot Refute

Contribution

Hierarchical masking method for noise-clean interleaved clip sequences

[32] Learning multiscale hierarchical attention for video summarization PDF

Cannot Refute

[33] Syntax-Guided Hierarchical Attention Network for Video Captioning PDF

Cannot Refute

[34] EventFormer: A Node-graph Hierarchical Attention Transformer for Action-centric Video Event Prediction PDF

Cannot Refute

[35] Hierarchical Attention Network for Action Recognition in Videos PDF

Cannot Refute

[36] Is Space-Time Attention All You Need for Video Understanding? PDF

Cannot Refute

[37] Relation-aware Hierarchical Attention Framework for Video Question Answering PDF

Cannot Refute

[38] Global2Local: A Joint-Hierarchical Attention for Video Captioning PDF

Cannot Refute

[39] MAMS: Model-Agnostic Module Selection Framework for Video Captioning PDF

Cannot Refute

[40] Intermediate Layer Attention Mechanism for Multimodal Fusion in Personality and Affect Computing PDF

Cannot Refute

[41] Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos PDF

Cannot Refute

Video-GPT via Next Clip Diffusion

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Video-GPT: A self-supervised pretrained model treating video as language for visual world modeling

[12] Self-supervised Co-training for Video Representation Learning PDF

[13] Self-supervised motion perception for spatiotemporal representation learning PDF

[14] Self-supervised learning for videos: A survey PDF

[15] Spatiotemporal contrastive video representation learning PDF

[16] Cross-architecture self-supervised video representation learning PDF

[17] Self-supervised spatiotemporal learning via video clip order prediction PDF

[18] Spatial-Temporal-Decoupled Masked Pre-training for Spatiotemporal Forecasting PDF

[19] Unsupervised Learning of Video Representations using LSTMs PDF

[20] Self-supervised learning by cross-modal audio-video clustering PDF

[21] A large-scale study on unsupervised spatiotemporal representation learning PDF

Next clip diffusion paradigm for pretraining Video-GPT

[29] Generative Pre-trained Autoregressive Diffusion Transformer PDF

[22] Bidirectional autoregessive diffusion model for dance generation PDF

[23] MarDini: Masked Autoregressive Diffusion for Video Generation at Scale PDF

[24] Diffusion Probabilistic Modeling for Video Generation PDF

[25] Stargen: A spatiotemporal autoregression framework with video diffusion model for scalable and controllable scene generation PDF

[26] From Slow Bidirectional to Fast Autoregressive Video Diffusion Models PDF

[27] Epona: Autoregressive Diffusion World Model for Autonomous Driving PDF

[28] Progressive Autoregressive Video Diffusion Models PDF

[30] Art-v: Auto-regressive text-to-video generation with diffusion models PDF

[31] Rolling forcing: Autoregressive long video diffusion in real time PDF

Hierarchical masking method for noise-clean interleaved clip sequences

[32] Learning multiscale hierarchical attention for video summarization PDF

[33] Syntax-Guided Hierarchical Attention Network for Video Captioning PDF

[34] EventFormer: A Node-graph Hierarchical Attention Transformer for Action-centric Video Event Prediction PDF

[35] Hierarchical Attention Network for Action Recognition in Videos PDF

[36] Is Space-Time Attention All You Need for Video Understanding? PDF

[37] Relation-aware Hierarchical Attention Framework for Video Question Answering PDF

[38] Global2Local: A Joint-Hierarchical Attention for Video Captioning PDF

[39] MAMS: Model-Agnostic Module Selection Framework for Video Captioning PDF

[40] Intermediate Layer Attention Mechanism for Multimodal Fusion in Personality and Affect Computing PDF

[41] Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos PDF

Table of Contents