Captain Cinema: Towards Short Movie Generation

ICLR 2026 Conference SubmissionAnonymous Authors
Video GenerationDiffusion Transformer
Abstract:

We present Captain Cinema, a generation framework for short movie generation. Given a detailed textual description of a movie storyline, our approach firstly generates a sequence of keyframes that outline the entire narrative, which ensures long-range coherence in both the storyline and visual appearance (e.g., scenes and characters). We refer to this step as top-down keyframe planning. These keyframes then serve as conditioning signals for a video synthesis model, which supports long context learning, to produce the spatio-temporal dynamics between them. This step is referred to as bottom-up video synthesis. To support stable and efficient generation of multi-scene long narrative cinematic works, we introduce an interleaved training strategy for Multimodal Diffusion Transformers (MM-DiT), specifically adapted for long-context video data. Our model is trained on a curated cinematic dataset consisting of interleaved samples for video generation. Our experiments demonstrate that Captain Cinema performs favorably in the automated creation of visually coherent and narratively consistent short films.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

Captain Cinema contributes a hierarchical framework for short movie generation, combining top-down keyframe planning with bottom-up video synthesis via Multimodal Diffusion Transformers. The paper resides in the 'Multi-Scene Cinematic Video Synthesis with Keyframe Planning' leaf, which contains only two papers total (including this one). This indicates a relatively sparse research direction within the broader taxonomy of thirteen papers, suggesting the specific combination of keyframe planning and long-context cinematic synthesis remains an emerging area rather than a crowded subfield.

The taxonomy reveals that Captain Cinema's parent branch—End-to-End Automated Movie Generation Systems—encompasses three distinct approaches: keyframe planning methods, direct text-to-video synthesis, and agent-based orchestration. Neighboring leaves include direct synthesis approaches that bypass intermediate planning stages and agent-based systems using large multimodal models for workflow orchestration. The scope notes clarify that Captain Cinema's explicit keyframe planning distinguishes it from direct synthesis methods, while its diffusion-based architecture separates it from agent-orchestrated pipelines. Adjacent branches address personalized storytelling and specialized animation, indicating the field balances general cinematic synthesis with domain-specific adaptations.

Among thirty candidates examined, the Captain Cinema framework contribution shows two refutable candidates from ten examined, while the GoldenMem memory mechanism has one refutable candidate from ten. The interleaved training strategy for Multimodal Diffusion Transformers appears more novel, with zero refutable candidates among ten examined. The limited search scope means these statistics reflect top-K semantic matches rather than exhaustive coverage. The framework-level contribution faces more substantial prior work overlap, whereas the training strategy component appears less explored in the examined literature, though this assessment remains constrained by the search methodology.

Given the sparse taxonomy leaf (two papers) and limited search scope (thirty candidates), the work appears to occupy a relatively underexplored niche combining keyframe planning with long-context diffusion models. The framework-level contribution encounters some overlap with existing multi-scene synthesis approaches, while the training strategy shows fewer direct precedents among examined papers. This analysis reflects semantic proximity within the search space rather than comprehensive field coverage, leaving open the possibility of relevant work outside the examined candidate set.

Taxonomy

Core-task Taxonomy Papers
13
3
Claimed Contributions
30
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: short movie generation from textual storylines. The field has coalesced around several complementary directions. End-to-End Automated Movie Generation Systems focus on transforming narrative text into coherent multi-scene videos, often employing keyframe planning and cinematic synthesis pipelines to maintain visual and temporal consistency across shots. Personalized and Context-Enhanced Video Storytelling emphasizes tailoring generated content to user preferences or contextual cues, integrating character-specific details and narrative context into the synthesis process. Specialized Animation and Story Adaptation Techniques address domain-specific challenges such as animating characters from literary sources or adapting short stories into visual formats, while Evaluation and Survey Resources provide benchmarks and overviews that help researchers assess progress and identify open problems. Representative works like Moviefactory[2] and Script to Screen[4] illustrate how end-to-end pipelines handle scene decomposition and visual grounding, whereas Personalised Video Generation[1] and ContextualStory[11] highlight the growing interest in user-driven customization. Within the automated generation branch, a central tension revolves around balancing creative control with computational efficiency: some methods prioritize detailed keyframe planning to ensure cinematic coherence, while others explore more direct text-to-video mappings that sacrifice fine-grained shot composition for speed. Captain Cinema[0] sits squarely in the multi-scene cinematic synthesis cluster, sharing with Moviefactory[2] an emphasis on keyframe-driven planning but extending the approach to handle richer narrative structures and longer sequences. Compared to Multimodal Cinematic Synthesis[9], which integrates audio and visual modalities more tightly, Captain Cinema[0] appears to concentrate on visual storytelling fidelity and scene-level consistency. Meanwhile, works like Anim Director[5] and Adaptation Short Story[7] tackle specialized animation challenges that complement but differ from the broader cinematic synthesis focus, underscoring the diversity of techniques required to bridge text and moving images effectively.

Claimed Contributions

Captain Cinema framework for short movie generation

The authors introduce a two-stage framework that combines top-down keyframe planning to generate narrative-consistent keyframes and bottom-up video synthesis to produce spatio-temporal dynamics between keyframes, enabling coherent multi-scene movie generation.

10 retrieved papers
Can Refute
GoldenMem memory mechanism for long-context compression

The authors propose GoldenMem, which uses inverse Fibonacci downsampling to compress visual context from historical frames, maintaining a fixed token budget while preserving character and scene consistency across super-long contexts.

10 retrieved papers
Can Refute
Interleaved training strategy for Multimodal Diffusion Transformers

The authors develop a progressive long-context tuning strategy with hybrid attention masking and dynamic stride sampling for MM-DiT models, enabling stable and efficient training on large-scale cinematic datasets for multi-scene video generation.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Captain Cinema framework for short movie generation

The authors introduce a two-stage framework that combines top-down keyframe planning to generate narrative-consistent keyframes and bottom-up video synthesis to produce spatio-temporal dynamics between keyframes, enabling coherent multi-scene movie generation.

Contribution

GoldenMem memory mechanism for long-context compression

The authors propose GoldenMem, which uses inverse Fibonacci downsampling to compress visual context from historical frames, maintaining a fixed token budget while preserving character and scene consistency across super-long contexts.

Contribution

Interleaved training strategy for Multimodal Diffusion Transformers

The authors develop a progressive long-context tuning strategy with hybrid attention masking and dynamic stride sampling for MM-DiT models, enabling stable and efficient training on large-scale cinematic datasets for multi-scene video generation.