LogiStory: A Logic-Aware Framework for Multi-Image Story Visualization

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Visual StorytellingMulti-Image Sequence GenerationStory PlanningVisual Logic ConsistencyCausal ReasoningNarrative Coherence

Generating coherent and communicative visual sequences, such as image sequences and videos, remains a significant challenge for current multimodal systems. Despite advances in visual quality and the integration of world knowledge, existing models still struggle to maintain logical flow, often resulting in disjointed actions, fragmented narratives, and unclear storylines. We attribute these issues to the lack of attention to visual logic, a critical yet underexplored dimension of visual sequence generation that we define as the perceptual and causal coherence among characters, actions, and scenes over time. To bridge this gap, we propose a logic-aware multi-image story visualization framework, LogiStory. The framework is built around the central innovation of explicitly modeling visual logic in story visualization. To realize this idea, we design a multi-agent system that grounds roles, extracts causal chains, and verifies story-level consistency, transforming narrative coherence from an implicit byproduct of image generation into an explicit modeling objective. This design effectively bridges structured story planning with visual generation, enhancing both narrative clarity and visual quality in story visualization. Furthermore, to evaluate the generation capacity, we construct LogicTale, a benchmark comprising richly annotated stories, emphasizing causal reasoning, and visual logic interpretability. We establish comprehensive automatic and human evaluation protocols designed to measure both visual logic and perceptual quality. Experiments demonstrate that our approach significantly improves the narrative logic of generated visual stories. This work provides a foundational step towards modeling and enforcing visual logic in general image sequence and video generation tasks.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces LogiStory, a framework for multi-image story visualization that explicitly models visual logic—defined as perceptual and causal coherence among characters, actions, and scenes. It resides in the 'Visual Logic and Causal Reasoning' leaf of the taxonomy, which contains only two papers total. This is a notably sparse research direction within the broader Story Visualization Generation Methods branch, suggesting the paper targets an underexplored niche. The sibling paper in this leaf addresses related reasoning challenges but does not appear to provide identical technical solutions.

The taxonomy reveals that neighboring leaves focus on character consistency modeling, knowledge-enhanced generation, and discourse-constrained approaches. These directions emphasize surface-level visual coherence or external knowledge integration rather than explicit causal chain modeling. The scope note for the Visual Logic leaf explicitly excludes general consistency methods, positioning LogiStory's contribution as distinct from prior work that treats narrative coherence as an implicit byproduct. The framework's multi-agent design for grounding roles and extracting causal chains appears to bridge structured planning with visual synthesis in a manner not directly addressed by sibling branches.

Among thirty candidates examined, none clearly refute the three core contributions: the visual logic concept and framework, the LogicTale benchmark dataset, and the logic-aware multi-agent system with visual enhancement module. Each contribution was assessed against ten candidates, with zero refutable overlaps identified. This suggests that within the limited search scope, the paper's technical approach and evaluation resources appear novel. However, the analysis does not claim exhaustive coverage; it reflects top-K semantic matches and citation expansion, not a comprehensive field survey.

Based on the limited literature search, LogiStory appears to occupy a sparsely populated research direction with minimal direct prior work among the examined candidates. The taxonomy structure and contribution-level statistics indicate that explicit visual logic modeling remains underexplored relative to adjacent consistency and knowledge-integration methods. Readers should interpret these findings as preliminary signals rather than definitive judgments, given the constrained search scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: multi-image story visualization with visual logic modeling. The field encompasses methods for generating coherent visual narratives from textual descriptions, analyzing the structural and causal relationships that underpin storytelling, building interactive tools for authors and creators, establishing rigorous evaluation protocols, and drawing on narrative theory from disciplines such as cognitive science and media studies. The main branches reflect these complementary concerns. Story Visualization Generation Methods focuses on algorithmic approaches—ranging from diffusion-based image synthesis to large multimodal models—that produce sequences of images faithful to plot progression and character consistency. Narrative Structure Analysis and Comprehension investigates how humans and machines parse temporal order, causal links, and semantic coherence in visual sequences, often leveraging insights from comics grammar and cognitive frameworks. Interactive Authoring Systems and Applications targets practical platforms that enable users to co-create stories with computational assistance, while Evaluation and Benchmarking develops datasets and metrics to measure fidelity, coherence, and narrative quality. Finally, Narrative Theory and Interdisciplinary Perspectives situates computational work within broader humanistic and cognitive traditions, examining how visual storytelling conventions shape interpretation. Within Story Visualization Generation Methods, a particularly active line of work emphasizes visual logic and causal reasoning—ensuring that generated image sequences respect not only surface-level consistency but also the underlying cause-and-effect relationships that drive narrative progression. LogiStory[0] exemplifies this direction by explicitly modeling logical dependencies between story events, aiming to produce images that reflect coherent temporal and causal chains. This contrasts with earlier efforts such as Visual Chain Thought[5], which introduced stepwise reasoning for visual narratives but did not fully integrate causal constraints into the generation pipeline. Meanwhile, works like StoryGPT[3] leverage large language models to guide image synthesis, prioritizing fluency and diversity over strict logical grounding. LogiStory[0] thus occupies a niche that bridges generative modeling with structured reasoning, addressing a key challenge in producing narratives that are both visually compelling and logically sound—a trade-off that remains an open question as the field matures.

Claimed Contributions

Visual logic concept and LogiStory framework

10 retrieved papers

The authors formally introduce visual logic as the perceptual and causal coherence among characters, actions, and scenes over time in visual sequences. They propose LogiStory, a framework that explicitly models visual logic through a multi-agent system for structured story planning and a visual logic enhancement module with global and local causal verification components.

10 retrieved papers

LogicTale benchmark dataset

10 retrieved papers

The authors construct LogicTale, a benchmark comprising 60 richly annotated stories with causal annotations, action-state flows, and panel-level breakdowns. They design comprehensive automatic and human evaluation protocols specifically measuring visual logic dimensions including instance consistency, narrative causality, and story readability beyond standard perceptual quality metrics.

10 retrieved papers

Logic-aware multi-agent system with visual enhancement module

10 retrieved papers

The framework includes a logic-aware multi-agent system with three collaborative agents (SceneCrafter, LogicMiner, ShotPlanner) that decompose narratives into structured representations, combined with a visual logic enhancement module featuring a Global Causal Verifier and Local Causal Monitor to enforce causal consistency during generation.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] Visual chain of thought: bridging logical gaps with multimodal infillings PDF

Rose Daniel, Daniel M. Rose, Himakunthala, Vaishnavi, Vaishnavi Himakunthala, Daniel Philip Rose, Ouyang, Andy, Andy Ouyang, He, Ryan, Ryan He, Mei, Alex, Alex Mei, Lu Yujie, Yujie Lu, Saxon, Michael, Michael Saxon, Sonar, Chinmay, Chinmay Sonar, Michael Stephen Saxon, Mirza, Diba, Diba Mirza, Wang, William Yang, William Yang Wang (2023)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Visual logic concept and LogiStory framework

[38] DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention PDF

Cannot Refute

[39] StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation PDF

Cannot Refute

[40] Multimodal Event Transformer for Image-guided Story Ending Generation PDF

Cannot Refute

[41] Causal-story: Local causal attention utilizing parameter-efficient tuning for visual story synthesis PDF

Cannot Refute

[42] Maintaining consistency and relevancy in multi-image visual storytelling PDF

Cannot Refute

[43] Developing a causally valid picture-story measure of sexual motivation: II. Effects of film clips. PDF

Cannot Refute

[44] StoryBench: A Dataset for Diverse, Explainable, Multi-hop Narrative Text-to-Image Generation PDF

Cannot Refute

[45] Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights PDF

Cannot Refute

[46] From Easy to Hard: The MIR Benchmark for Progressive Interleaved Multi-Image Reasoning PDF

Cannot Refute

[47] An automated pipeline for the discovery of conspiracy and conspiracy theory narrative frameworks: Bridgegate, Pizzagate and storytelling on the web PDF

Cannot Refute

Contribution

LogicTale benchmark dataset

[10] VinaBench: Benchmark for Faithful and Consistent Visual Narratives PDF

Cannot Refute

[41] Causal-story: Local causal attention utilizing parameter-efficient tuning for visual story synthesis PDF

Cannot Refute

[45] Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights PDF

Cannot Refute

[58] Narrativebridge: Enhancing video captioning with causal-temporal narrative PDF

Cannot Refute

[59] A corpus and cloze evaluation for deeper understanding of commonsense stories PDF

Cannot Refute

[60] Sequential vision to language as story: A storytelling dataset and benchmarking PDF

Cannot Refute

[61] Semantic relations between text segments for semantic storytelling: Annotation tool-dataset-evaluation PDF

Cannot Refute

[62] Finding the right words: Investigating machine-generated video description quality using a corpus-based approach PDF

Cannot Refute

[63] Drawing the line between constituent structure and coherence relations in visual narratives. PDF

Cannot Refute

[64] Bridging Vision and Language: Modeling Causality and Temporality in Video Narratives PDF

Cannot Refute

Contribution

Logic-aware multi-agent system with visual enhancement module

[48] DataNarrative: Automated Data-Driven Storytelling with Visualizations and Texts PDF

Cannot Refute

[49] Multi-Agent Based Casual Triple Extraction For Factuality Evaluation Using Large Language Models PDF

Cannot Refute

[50] Toward Trustworthy Automated Data Story Generation: Benchmarking, Multi-Agent Generation and Bias Evaluation in Data Storytelling PDF

Cannot Refute

[51] CoAgent: Collaborative Planning and Consistency Agent for Coherent Video Generation PDF

Cannot Refute

[52] Managing interaction between users and agents in a multi-agent storytelling environment PDF

Cannot Refute

[53] A computational approach to screenplay structure via multi-agent systems PDF

Cannot Refute

[54] A Comprehensive Survey on Multi-Agent Systems for Audio-Visual Generation and Understanding PDF

Cannot Refute

[55] The Moving Lens: Coherence Across Heterogeneous Contexts in Narrative and Biology. PDF

Cannot Refute

[56] An objective character believability evaluation procedure for multi-agent story generation systems PDF

Cannot Refute

[57] Playing Along: Building AI Agents for Co-Creation of Improvised Stories PDF

Cannot Refute

LogiStory: A Logic-Aware Framework for Multi-Image Story Visualization

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] Visual chain of thought: bridging logical gaps with multimodal infillings PDF

Contribution Analysis

Visual logic concept and LogiStory framework

[38] DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention PDF

[39] StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation PDF

[40] Multimodal Event Transformer for Image-guided Story Ending Generation PDF

[41] Causal-story: Local causal attention utilizing parameter-efficient tuning for visual story synthesis PDF

[42] Maintaining consistency and relevancy in multi-image visual storytelling PDF

[43] Developing a causally valid picture-story measure of sexual motivation: II. Effects of film clips. PDF

[44] StoryBench: A Dataset for Diverse, Explainable, Multi-hop Narrative Text-to-Image Generation PDF

[45] Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights PDF

[46] From Easy to Hard: The MIR Benchmark for Progressive Interleaved Multi-Image Reasoning PDF

[47] An automated pipeline for the discovery of conspiracy and conspiracy theory narrative frameworks: Bridgegate, Pizzagate and storytelling on the web PDF

LogicTale benchmark dataset

[10] VinaBench: Benchmark for Faithful and Consistent Visual Narratives PDF

[41] Causal-story: Local causal attention utilizing parameter-efficient tuning for visual story synthesis PDF

[45] Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights PDF

[58] Narrativebridge: Enhancing video captioning with causal-temporal narrative PDF

[59] A corpus and cloze evaluation for deeper understanding of commonsense stories PDF

[60] Sequential vision to language as story: A storytelling dataset and benchmarking PDF

[61] Semantic relations between text segments for semantic storytelling: Annotation tool-dataset-evaluation PDF

[62] Finding the right words: Investigating machine-generated video description quality using a corpus-based approach PDF

[63] Drawing the line between constituent structure and coherence relations in visual narratives. PDF

[64] Bridging Vision and Language: Modeling Causality and Temporality in Video Narratives PDF

Logic-aware multi-agent system with visual enhancement module

[48] DataNarrative: Automated Data-Driven Storytelling with Visualizations and Texts PDF

[49] Multi-Agent Based Casual Triple Extraction For Factuality Evaluation Using Large Language Models PDF

[50] Toward Trustworthy Automated Data Story Generation: Benchmarking, Multi-Agent Generation and Bias Evaluation in Data Storytelling PDF

[51] CoAgent: Collaborative Planning and Consistency Agent for Coherent Video Generation PDF

[52] Managing interaction between users and agents in a multi-agent storytelling environment PDF

[53] A computational approach to screenplay structure via multi-agent systems PDF

[54] A Comprehensive Survey on Multi-Agent Systems for Audio-Visual Generation and Understanding PDF

[55] The Moving Lens: Coherence Across Heterogeneous Contexts in Narrative and Biology. PDF

[56] An objective character believability evaluation procedure for multi-agent story generation systems PDF

[57] Playing Along: Building AI Agents for Co-Creation of Improvised Stories PDF

Table of Contents