LogiStory: A Logic-Aware Framework for Multi-Image Story Visualization

ICLR 2026 Conference SubmissionAnonymous Authors
Visual StorytellingMulti-Image Sequence GenerationStory PlanningVisual Logic ConsistencyCausal ReasoningNarrative Coherence
Abstract:

Generating coherent and communicative visual sequences, such as image sequences and videos, remains a significant challenge for current multimodal systems. Despite advances in visual quality and the integration of world knowledge, existing models still struggle to maintain logical flow, often resulting in disjointed actions, fragmented narratives, and unclear storylines. We attribute these issues to the lack of attention to visual logic, a critical yet underexplored dimension of visual sequence generation that we define as the perceptual and causal coherence among characters, actions, and scenes over time. To bridge this gap, we propose a logic-aware multi-image story visualization framework, LogiStory. The framework is built around the central innovation of explicitly modeling visual logic in story visualization. To realize this idea, we design a multi-agent system that grounds roles, extracts causal chains, and verifies story-level consistency, transforming narrative coherence from an implicit byproduct of image generation into an explicit modeling objective. This design effectively bridges structured story planning with visual generation, enhancing both narrative clarity and visual quality in story visualization. Furthermore, to evaluate the generation capacity, we construct LogicTale, a benchmark comprising richly annotated stories, emphasizing causal reasoning, and visual logic interpretability. We establish comprehensive automatic and human evaluation protocols designed to measure both visual logic and perceptual quality. Experiments demonstrate that our approach significantly improves the narrative logic of generated visual stories. This work provides a foundational step towards modeling and enforcing visual logic in general image sequence and video generation tasks.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces LogiStory, a framework for multi-image story visualization that explicitly models visual logic—defined as perceptual and causal coherence among characters, actions, and scenes. It resides in the 'Visual Logic and Causal Reasoning' leaf of the taxonomy, which contains only two papers total. This is a notably sparse research direction within the broader Story Visualization Generation Methods branch, suggesting the paper targets an underexplored niche. The sibling paper in this leaf addresses related reasoning challenges but does not appear to provide identical technical solutions.

The taxonomy reveals that neighboring leaves focus on character consistency modeling, knowledge-enhanced generation, and discourse-constrained approaches. These directions emphasize surface-level visual coherence or external knowledge integration rather than explicit causal chain modeling. The scope note for the Visual Logic leaf explicitly excludes general consistency methods, positioning LogiStory's contribution as distinct from prior work that treats narrative coherence as an implicit byproduct. The framework's multi-agent design for grounding roles and extracting causal chains appears to bridge structured planning with visual synthesis in a manner not directly addressed by sibling branches.

Among thirty candidates examined, none clearly refute the three core contributions: the visual logic concept and framework, the LogicTale benchmark dataset, and the logic-aware multi-agent system with visual enhancement module. Each contribution was assessed against ten candidates, with zero refutable overlaps identified. This suggests that within the limited search scope, the paper's technical approach and evaluation resources appear novel. However, the analysis does not claim exhaustive coverage; it reflects top-K semantic matches and citation expansion, not a comprehensive field survey.

Based on the limited literature search, LogiStory appears to occupy a sparsely populated research direction with minimal direct prior work among the examined candidates. The taxonomy structure and contribution-level statistics indicate that explicit visual logic modeling remains underexplored relative to adjacent consistency and knowledge-integration methods. Readers should interpret these findings as preliminary signals rather than definitive judgments, given the constrained search scope.

Taxonomy

Core-task Taxonomy Papers
37
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: multi-image story visualization with visual logic modeling. The field encompasses methods for generating coherent visual narratives from textual descriptions, analyzing the structural and causal relationships that underpin storytelling, building interactive tools for authors and creators, establishing rigorous evaluation protocols, and drawing on narrative theory from disciplines such as cognitive science and media studies. The main branches reflect these complementary concerns. Story Visualization Generation Methods focuses on algorithmic approaches—ranging from diffusion-based image synthesis to large multimodal models—that produce sequences of images faithful to plot progression and character consistency. Narrative Structure Analysis and Comprehension investigates how humans and machines parse temporal order, causal links, and semantic coherence in visual sequences, often leveraging insights from comics grammar and cognitive frameworks. Interactive Authoring Systems and Applications targets practical platforms that enable users to co-create stories with computational assistance, while Evaluation and Benchmarking develops datasets and metrics to measure fidelity, coherence, and narrative quality. Finally, Narrative Theory and Interdisciplinary Perspectives situates computational work within broader humanistic and cognitive traditions, examining how visual storytelling conventions shape interpretation. Within Story Visualization Generation Methods, a particularly active line of work emphasizes visual logic and causal reasoning—ensuring that generated image sequences respect not only surface-level consistency but also the underlying cause-and-effect relationships that drive narrative progression. LogiStory[0] exemplifies this direction by explicitly modeling logical dependencies between story events, aiming to produce images that reflect coherent temporal and causal chains. This contrasts with earlier efforts such as Visual Chain Thought[5], which introduced stepwise reasoning for visual narratives but did not fully integrate causal constraints into the generation pipeline. Meanwhile, works like StoryGPT[3] leverage large language models to guide image synthesis, prioritizing fluency and diversity over strict logical grounding. LogiStory[0] thus occupies a niche that bridges generative modeling with structured reasoning, addressing a key challenge in producing narratives that are both visually compelling and logically sound—a trade-off that remains an open question as the field matures.

Claimed Contributions

Visual logic concept and LogiStory framework

The authors formally introduce visual logic as the perceptual and causal coherence among characters, actions, and scenes over time in visual sequences. They propose LogiStory, a framework that explicitly models visual logic through a multi-agent system for structured story planning and a visual logic enhancement module with global and local causal verification components.

10 retrieved papers
LogicTale benchmark dataset

The authors construct LogicTale, a benchmark comprising 60 richly annotated stories with causal annotations, action-state flows, and panel-level breakdowns. They design comprehensive automatic and human evaluation protocols specifically measuring visual logic dimensions including instance consistency, narrative causality, and story readability beyond standard perceptual quality metrics.

10 retrieved papers
Logic-aware multi-agent system with visual enhancement module

The framework includes a logic-aware multi-agent system with three collaborative agents (SceneCrafter, LogicMiner, ShotPlanner) that decompose narratives into structured representations, combined with a visual logic enhancement module featuring a Global Causal Verifier and Local Causal Monitor to enforce causal consistency during generation.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Visual logic concept and LogiStory framework

The authors formally introduce visual logic as the perceptual and causal coherence among characters, actions, and scenes over time in visual sequences. They propose LogiStory, a framework that explicitly models visual logic through a multi-agent system for structured story planning and a visual logic enhancement module with global and local causal verification components.

Contribution

LogicTale benchmark dataset

The authors construct LogicTale, a benchmark comprising 60 richly annotated stories with causal annotations, action-state flows, and panel-level breakdowns. They design comprehensive automatic and human evaluation protocols specifically measuring visual logic dimensions including instance consistency, narrative causality, and story readability beyond standard perceptual quality metrics.

Contribution

Logic-aware multi-agent system with visual enhancement module

The framework includes a logic-aware multi-agent system with three collaborative agents (SceneCrafter, LogicMiner, ShotPlanner) that decompose narratives into structured representations, combined with a visual logic enhancement module featuring a Global Causal Verifier and Local Causal Monitor to enforce causal consistency during generation.