Synthetic History: Evaluating Visual Representations of the Past in Diffusion Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Generative AI EvaluationDiffusion modelsSynthetic ImageryCultural Bias in AIHistorical Representation

As Text-to-Image (TTI) diffusion models become increasingly influential in content creation, growing attention is being directed toward their societal and cultural implications. While prior research has primarily examined demographic and cultural biases, the ability of these models to accurately represent historical contexts remains largely underexplored. To address this gap, we introduce a benchmark for evaluating how TTI models depict historical contexts. The benchmark combines HistVis, a dataset of 30,000 synthetic images generated by three state-of-the-art diffusion models from carefully designed prompts covering universal human activities across multiple historical periods, with a reproducible evaluation protocol. We evaluate generated imagery across three key aspects: (1) Implicit Stylistic Associations: examining default visual styles associated with specific eras; (2) Historical Consistency: identifying anachronisms such as modern artifacts in pre-modern contexts; and (3) Demographic Representation: comparing generated racial and gender distributions against historically plausible baselines. Our findings reveal systematic inaccuracies in historically themed generated imagery, as TTI models frequently stereotype past eras by incorporating unstated stylistic cues, introduce anachronisms, and fail to reflect plausible demographic patterns. By providing a reproducible benchmark for historical representation in generated imagery, this work provides an initial step toward building more historically accurate TTI models.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces HistVis, a benchmark dataset of 30,000 synthetic images generated from prompts spanning universal human activities across historical periods, alongside a reproducible evaluation protocol. It resides in the Historical Context Accuracy Benchmarks leaf, which contains only two papers total. This represents a relatively sparse research direction within the broader taxonomy of 21 papers across 11 leaf nodes, suggesting the systematic evaluation of historical accuracy in text-to-image models remains an emerging area with limited prior benchmarking infrastructure.

The taxonomy reveals that most related work clusters in adjacent branches: Heritage Reconstruction focuses on practical restoration applications (5 papers across 3 leaves), while Demographic Bias Analysis examines identity representation (2 papers). The paper's leaf explicitly excludes demographic-focused benchmarks and domain-specific applications, positioning it as foundational measurement infrastructure rather than applied heritage work. Its sibling paper examines factual grounding, indicating the leaf emphasizes systematic accuracy assessment over reconstruction or bias critique, though these concerns intersect in the paper's demographic representation component.

Among 20 candidates examined across three contributions, none were identified as clearly refuting the work. The HistVis dataset examined 7 candidates with 0 refutable, the evaluation protocol examined 10 with 0 refutable, and the anachronism detection methodology examined 3 with 0 refutable. This suggests that within the limited search scope, no prior work provides directly overlapping benchmark infrastructure combining large-scale synthetic historical image generation with multi-aspect evaluation protocols. The anachronism detection component appears particularly underexplored, with the smallest candidate pool examined.

Based on the top-20 semantic matches, the work appears to occupy relatively novel ground in creating systematic benchmarks for historical accuracy in generative models. The limited search scope means potentially relevant work in adjacent domains (historical image analysis, temporal reasoning in vision models) may not have been captured. The sparse population of its taxonomy leaf and absence of refuting candidates suggest this represents a genuine gap in evaluation infrastructure, though the field's overall small size (21 papers) indicates this research direction is still crystallizing.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Evaluating historical representation in text-to-image diffusion models. The field has organized itself around five major branches that reflect both technical and socio-cultural concerns. Benchmark Development and Evaluation Frameworks focuses on creating systematic methods to assess how accurately generative models depict historical contexts, with works like Synthetic History[0] and The factuality tax of[4] examining factual grounding. Heritage Reconstruction and Preservation Applications explores practical uses in restoring or visualizing historical sites and artifacts, as seen in Preserving architectural heritage in[3] and From ruins to reconstruction[6]. Demographic Bias and Representation Analysis investigates how these models encode and reproduce biases related to identity and culture across time periods, with studies such as Multi-Group Proportional Representations for[2] and Majoritarian Signals[18]. Technical Methods and Model Architectures addresses the underlying algorithmic innovations, including specialized architectures like HST-GAN[13] and History-Guided Video Diffusion[14]. Finally, Socio-Technical and Cultural Analysis examines broader implications of AI-generated historical imagery, exemplified by On the Historical Gaze[17]. A particularly active tension emerges between works prioritizing factual accuracy versus those emphasizing cultural sensitivity and representation. Some studies develop rigorous benchmarks to measure historical fidelity, while others like Biases in Generative AI[12] highlight how models perpetuate anachronistic or stereotypical depictions of marginalized groups. Synthetic History[0] sits squarely within the benchmark development branch, sharing methodological concerns with The factuality tax of[4] regarding how to systematically evaluate whether generated images align with documented historical evidence. However, where some heritage-focused works like Bringing Rome to life[5] prioritize visual plausibility for reconstruction tasks, Synthetic History[0] emphasizes creating evaluation protocols that can detect subtle historical inaccuracies. This positions it as a foundational contribution to measurement infrastructure, complementing but distinct from applied heritage projects or bias-focused critiques that interrogate whose histories get represented and how.

Claimed Contributions

HistVis dataset of synthetic historical images

7 retrieved papers

The authors introduce HistVis, a dataset containing 30,000 synthetic images generated from 100 prompts describing universal human activities across 10 historical periods using three diffusion models (SDXL, SD3, FLUX.1). This dataset enables systematic evaluation of how text-to-image models represent different historical contexts.

7 retrieved papers

Reproducible evaluation protocol for historical representation

10 retrieved papers

The authors develop a reproducible evaluation framework that assesses text-to-image models across three dimensions: implicit stylistic associations with historical periods, historical consistency through anachronism detection, and demographic representation compared to historically plausible patterns.

10 retrieved papers

Automated anachronism detection methodology

3 retrieved papers

The authors propose a two-stage automated method for detecting anachronisms in generated images, combining LLM-guided anachronism proposal with VQA-based detection, validated through a user study showing 75% agreement with human judgments.

3 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[4] The factuality tax of diversity-intervened text-to-image generation: Benchmark and fact-augmented intervention PDF

Chang, Kai-Wei, Wan Yixin, Wang, Haoran, Wu Di (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

HistVis dataset of synthetic historical images

[26] Genai-bench: A holistic benchmark for compositional text-to-visual generation PDF

Cannot Refute

[27] Retro-remote sensing: Generating images from ancient texts PDF

Cannot Refute

[28] Synthetic Map Generation to Provide Unlimited Training Data for Historical Map Text Detection PDF

Cannot Refute

[29] A Comparative Analysis of Synthetically Generated Damage Types on Photographs PDF

Cannot Refute

[30] Text-to-Image Synthesis: Techniques and Applications PDF

Cannot Refute

[31] Autonomous System Edge Cases: Implementing a Reinforcement Learning Pipeline for Complex Synthetic Road Environment Images; Advancing U.S. Autonomous Vehicle Regulations: Insights from Global Frameworks and Innovative Methodologies PDF

Cannot Refute

[32] A Survey of Generative Artificial Intelligence and its Applications PDF

Cannot Refute

Contribution

Reproducible evaluation protocol for historical representation

[2] Multi-Group Proportional Representations for Text-to-Image Models PDF

Cannot Refute

[4] The factuality tax of diversity-intervened text-to-image generation: Benchmark and fact-augmented intervention PDF

Cannot Refute

[5] Bringing Rome to life: evaluating historical image generation PDF

Cannot Refute

[6] From ruins to reconstruction: Harnessing text-to-image AI for restoring historical architectures PDF

Cannot Refute

[15] SyntheticPast: Generating Historically Accurate 360Â° Panoramic Visualizations through Iterative AI Refinement PDF

Cannot Refute

[21] Generative AI in Artistic Style Transfer: Performance, Perception, and Evaluation PDF

Cannot Refute

[22] Styledrop: Text-to-image synthesis of any style PDF

Cannot Refute

[23] A Framework for Critical Evaluation of Text-to-Image Models: Integrating Art Historical Analysis, Artistic Exploration, and Critical Prompt Engineering PDF

Cannot Refute

[24] INFELM: In-depth Fairness Evaluation of Large Text-To-Image Models PDF

Cannot Refute

[25] Towards Equitable Representation in Text-to-Image Synthesis Models with the Cross-Cultural Understanding Benchmark (CCUB) Dataset PDF

Cannot Refute

Contribution

Automated anachronism detection methodology

[33] TimeBrush: An Intelligent Expert System for Restoring Historical Images With Temporal and Stylistic Guidance PDF

Cannot Refute

[34] Seeing History Unseen: Evaluating Vision-Language Models for WCAG-Compliant Alt-Text in Digital Heritage Collections PDF

Cannot Refute

[35] Cultural Evaluations of Vision-Language Models Have a Lot to Learn from Cultural Theory PDF

Cannot Refute

Synthetic History: Evaluating Visual Representations of the Past in Diffusion Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[4] The factuality tax of diversity-intervened text-to-image generation: Benchmark and fact-augmented intervention PDF

Contribution Analysis

HistVis dataset of synthetic historical images

[26] Genai-bench: A holistic benchmark for compositional text-to-visual generation PDF

[27] Retro-remote sensing: Generating images from ancient texts PDF

[28] Synthetic Map Generation to Provide Unlimited Training Data for Historical Map Text Detection PDF

[29] A Comparative Analysis of Synthetically Generated Damage Types on Photographs PDF

[30] Text-to-Image Synthesis: Techniques and Applications PDF

[31] Autonomous System Edge Cases: Implementing a Reinforcement Learning Pipeline for Complex Synthetic Road Environment Images; Advancing U.S. Autonomous Vehicle Regulations: Insights from Global Frameworks and Innovative Methodologies PDF

[32] A Survey of Generative Artificial Intelligence and its Applications PDF

Reproducible evaluation protocol for historical representation

[2] Multi-Group Proportional Representations for Text-to-Image Models PDF

[4] The factuality tax of diversity-intervened text-to-image generation: Benchmark and fact-augmented intervention PDF

[5] Bringing Rome to life: evaluating historical image generation PDF

[6] From ruins to reconstruction: Harnessing text-to-image AI for restoring historical architectures PDF

[15] SyntheticPast: Generating Historically Accurate 360Â° Panoramic Visualizations through Iterative AI Refinement PDF

[21] Generative AI in Artistic Style Transfer: Performance, Perception, and Evaluation PDF

[22] Styledrop: Text-to-image synthesis of any style PDF

[23] A Framework for Critical Evaluation of Text-to-Image Models: Integrating Art Historical Analysis, Artistic Exploration, and Critical Prompt Engineering PDF

[24] INFELM: In-depth Fairness Evaluation of Large Text-To-Image Models PDF

[25] Towards Equitable Representation in Text-to-Image Synthesis Models with the Cross-Cultural Understanding Benchmark (CCUB) Dataset PDF

Automated anachronism detection methodology

[33] TimeBrush: An Intelligent Expert System for Restoring Historical Images With Temporal and Stylistic Guidance PDF

[34] Seeing History Unseen: Evaluating Vision-Language Models for WCAG-Compliant Alt-Text in Digital Heritage Collections PDF

[35] Cultural Evaluations of Vision-Language Models Have a Lot to Learn from Cultural Theory PDF

Table of Contents