Video-GPT via Next Clip Diffusion

ICLR 2026 Conference SubmissionAnonymous Authors
Video; Diffusion; LLM
Abstract:

GPT has shown its remarkable success in natural language processing. However, the language sequence is not sufficient to describe spatial-temporal details in the visual world. Alternatively, the video sequence is good at capturing such details. Motivated by this fact, we propose a concise Video-GPT in this paper by treating video as new language for visual world modeling. By analogy to next token prediction in GPT, we introduce a novel next clip diffusion paradigm for pretraining Video-GPT. Different from the previous works, this distinct paradigm allows Video-GPT to tackle both short-term generation and long-term prediction, by autoregressively denoising the noisy clip according to the clean clips in the history. Extensive experiments show our Video-GPT achieves the state-of-the-art performance on video prediction, which is the key factor towards world modeling (Physics-IQ Benchmark: Video-GPT 34.97 vs. Kling 23.64 vs. Wan 20.89). Moreover, it can be well adapted on 6 mainstream video tasks in both video generation and understanding, showing its great generalization capacity in downstream.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Video-GPT, a self-supervised model treating video as language for visual world modeling through a next clip diffusion paradigm. According to the taxonomy tree, this work resides in the 'Next Clip Diffusion for General Video Prediction' leaf under 'Autoregressive Video Prediction and World Modeling'. Notably, this leaf contains only the original paper itself with zero sibling papers, indicating a relatively sparse research direction within the broader taxonomy of eleven total papers across multiple branches. This positioning suggests the specific combination of autoregressive clip-level diffusion for general-purpose prediction represents an underexplored niche in the field.

The taxonomy reveals neighboring work primarily in 'Temporal Modeling Innovations in Video Diffusion', which explores vectorized timesteps and frame-aware approaches rather than clip-level autoregressive methods. The broader parent category encompasses general autoregressive prediction and world modeling, while sibling branches address distinct problems: single-image multi-view synthesis, camera-controlled scene exploration, audio-conditioned human generation, and domain-specific applications. The scope note explicitly excludes vectorized timestep approaches and non-autoregressive methods, positioning this work as focusing on sequential clip generation rather than alternative temporal modeling strategies explored in adjacent research directions.

Among thirty candidates examined, the contribution-level analysis shows mixed novelty signals. The core Video-GPT framework (Contribution A) examined ten candidates with zero refutable prior work, suggesting relative novelty in the overall architecture. However, the next clip diffusion paradigm (Contribution B) examined ten candidates and found one that appears to provide overlapping prior work, indicating some precedent exists within the limited search scope. The hierarchical masking method (Contribution C) also examined ten candidates with no refutations. These statistics suggest the paradigm itself has partial precedent, while the specific implementation and masking approach appear more distinctive within the examined literature.

Based on the limited top-thirty semantic search, the work appears to occupy a sparsely populated research direction with some methodological overlap in its core diffusion paradigm. The taxonomy structure confirms this sits at the intersection of autoregressive prediction and diffusion-based generation, where few papers directly combine these approaches for general video forecasting. The analysis covers semantic neighbors and citation-expanded candidates but cannot claim exhaustive coverage of all relevant prior work in video generation and prediction.

Taxonomy

Core-task Taxonomy Papers
11
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: autoregressive video generation and prediction via next clip diffusion. The field encompasses diverse approaches to generating and predicting video sequences, organized into several main branches. Autoregressive Video Prediction and World Modeling focuses on methods that iteratively generate future frames or clips, often for general-purpose prediction or interactive simulation. Single-Image to Multi-View Video Synthesis explores techniques that expand a static image into dynamic multi-view sequences, as seen in works like SV3D[1] and ViVid-1-to-3[3]. Camera-Controlled Dynamic Scene Exploration emphasizes user-guided camera trajectories to navigate and render scenes, exemplified by Cameractrl ii[2]. Audio-Conditioned Human Video Generation targets speech-driven or audio-synchronized human motion, such as FantasyTalking[4]. Finally, Domain-Specific Video Generation addresses specialized applications in fashion, anime, and anomaly detection, including FashionFlow[8], Trajectory-guided Anime Video Synthesis[9], and Pseudo Anomalies Are All[11]. These branches reflect a spectrum from general-purpose world models to highly constrained, domain-tailored synthesis. Within Autoregressive Video Prediction and World Modeling, a central theme is balancing temporal coherence with computational efficiency when predicting extended sequences. Some studies, like Loopy[5] and Contrastive Sequential-Diffusion Learning[6], explore novel training objectives or contrastive mechanisms to improve long-range consistency, while others such as Redefining Temporal Modeling in[7] revisit architectural choices for temporal dependencies. Video-GPT via Next Clip[0] sits squarely in this branch, emphasizing a next-clip diffusion paradigm for general video prediction. Compared to neighboring works that may focus on contrastive losses or specific architectural innovations, Video-GPT via Next Clip[0] adopts an autoregressive clip-level generation strategy, positioning it as a flexible framework for iterative forecasting. This approach contrasts with domain-specific methods like Dynamic Fashion Video Synthesis[10], which tailor generation to particular visual domains, highlighting the trade-off between generality and specialization in video synthesis research.

Claimed Contributions

Video-GPT: A self-supervised pretrained model treating video as language for visual world modeling

The authors propose Video-GPT, a foundation model that treats video sequences as a new language for modeling the visual world, analogous to how GPT models language. Unlike GPT which processes discrete text tokens, Video-GPT processes video clips to capture rich spatial-temporal details through self-supervised pretraining.

10 retrieved papers
Next clip diffusion paradigm for pretraining Video-GPT

The authors introduce a novel pretraining paradigm called next clip diffusion that combines autoregressive modeling between clips with diffusion modeling within clips. This hybrid approach enables the model to handle both short-term video generation and long-term prediction by autoregressively denoising noisy clips conditioned on clean historical clips.

10 retrieved papers
Can Refute
Hierarchical masking method for noise-clean interleaved clip sequences

The authors design a hierarchical attention masking strategy operating at clip, frame, and patch levels to establish dependencies in an interleaved sequence of noisy and clean video clips. This masking allows noisy clips to attend to previous clean clips as correct temporal context while maintaining causal relationships for autoregressive generation.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Video-GPT: A self-supervised pretrained model treating video as language for visual world modeling

The authors propose Video-GPT, a foundation model that treats video sequences as a new language for modeling the visual world, analogous to how GPT models language. Unlike GPT which processes discrete text tokens, Video-GPT processes video clips to capture rich spatial-temporal details through self-supervised pretraining.

Contribution

Next clip diffusion paradigm for pretraining Video-GPT

The authors introduce a novel pretraining paradigm called next clip diffusion that combines autoregressive modeling between clips with diffusion modeling within clips. This hybrid approach enables the model to handle both short-term video generation and long-term prediction by autoregressively denoising noisy clips conditioned on clean historical clips.

Contribution

Hierarchical masking method for noise-clean interleaved clip sequences

The authors design a hierarchical attention masking strategy operating at clip, frame, and patch levels to establish dependencies in an interleaved sequence of noisy and clean video clips. This masking allows noisy clips to attend to previous clean clips as correct temporal context while maintaining causal relationships for autoregressive generation.