Synthetic History: Evaluating Visual Representations of the Past in Diffusion Models

ICLR 2026 Conference SubmissionAnonymous Authors
Generative AI EvaluationDiffusion modelsSynthetic ImageryCultural Bias in AIHistorical Representation
Abstract:

As Text-to-Image (TTI) diffusion models become increasingly influential in content creation, growing attention is being directed toward their societal and cultural implications. While prior research has primarily examined demographic and cultural biases, the ability of these models to accurately represent historical contexts remains largely underexplored. To address this gap, we introduce a benchmark for evaluating how TTI models depict historical contexts. The benchmark combines HistVis, a dataset of 30,000 synthetic images generated by three state-of-the-art diffusion models from carefully designed prompts covering universal human activities across multiple historical periods, with a reproducible evaluation protocol. We evaluate generated imagery across three key aspects: (1) Implicit Stylistic Associations: examining default visual styles associated with specific eras; (2) Historical Consistency: identifying anachronisms such as modern artifacts in pre-modern contexts; and (3) Demographic Representation: comparing generated racial and gender distributions against historically plausible baselines. Our findings reveal systematic inaccuracies in historically themed generated imagery, as TTI models frequently stereotype past eras by incorporating unstated stylistic cues, introduce anachronisms, and fail to reflect plausible demographic patterns. By providing a reproducible benchmark for historical representation in generated imagery, this work provides an initial step toward building more historically accurate TTI models.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces HistVis, a benchmark dataset of 30,000 synthetic images generated from prompts spanning universal human activities across historical periods, alongside a reproducible evaluation protocol. It resides in the Historical Context Accuracy Benchmarks leaf, which contains only two papers total. This represents a relatively sparse research direction within the broader taxonomy of 21 papers across 11 leaf nodes, suggesting the systematic evaluation of historical accuracy in text-to-image models remains an emerging area with limited prior benchmarking infrastructure.

The taxonomy reveals that most related work clusters in adjacent branches: Heritage Reconstruction focuses on practical restoration applications (5 papers across 3 leaves), while Demographic Bias Analysis examines identity representation (2 papers). The paper's leaf explicitly excludes demographic-focused benchmarks and domain-specific applications, positioning it as foundational measurement infrastructure rather than applied heritage work. Its sibling paper examines factual grounding, indicating the leaf emphasizes systematic accuracy assessment over reconstruction or bias critique, though these concerns intersect in the paper's demographic representation component.

Among 20 candidates examined across three contributions, none were identified as clearly refuting the work. The HistVis dataset examined 7 candidates with 0 refutable, the evaluation protocol examined 10 with 0 refutable, and the anachronism detection methodology examined 3 with 0 refutable. This suggests that within the limited search scope, no prior work provides directly overlapping benchmark infrastructure combining large-scale synthetic historical image generation with multi-aspect evaluation protocols. The anachronism detection component appears particularly underexplored, with the smallest candidate pool examined.

Based on the top-20 semantic matches, the work appears to occupy relatively novel ground in creating systematic benchmarks for historical accuracy in generative models. The limited search scope means potentially relevant work in adjacent domains (historical image analysis, temporal reasoning in vision models) may not have been captured. The sparse population of its taxonomy leaf and absence of refuting candidates suggest this represents a genuine gap in evaluation infrastructure, though the field's overall small size (21 papers) indicates this research direction is still crystallizing.

Taxonomy

Core-task Taxonomy Papers
21
3
Claimed Contributions
20
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Evaluating historical representation in text-to-image diffusion models. The field has organized itself around five major branches that reflect both technical and socio-cultural concerns. Benchmark Development and Evaluation Frameworks focuses on creating systematic methods to assess how accurately generative models depict historical contexts, with works like Synthetic History[0] and The factuality tax of[4] examining factual grounding. Heritage Reconstruction and Preservation Applications explores practical uses in restoring or visualizing historical sites and artifacts, as seen in Preserving architectural heritage in[3] and From ruins to reconstruction[6]. Demographic Bias and Representation Analysis investigates how these models encode and reproduce biases related to identity and culture across time periods, with studies such as Multi-Group Proportional Representations for[2] and Majoritarian Signals[18]. Technical Methods and Model Architectures addresses the underlying algorithmic innovations, including specialized architectures like HST-GAN[13] and History-Guided Video Diffusion[14]. Finally, Socio-Technical and Cultural Analysis examines broader implications of AI-generated historical imagery, exemplified by On the Historical Gaze[17]. A particularly active tension emerges between works prioritizing factual accuracy versus those emphasizing cultural sensitivity and representation. Some studies develop rigorous benchmarks to measure historical fidelity, while others like Biases in Generative AI[12] highlight how models perpetuate anachronistic or stereotypical depictions of marginalized groups. Synthetic History[0] sits squarely within the benchmark development branch, sharing methodological concerns with The factuality tax of[4] regarding how to systematically evaluate whether generated images align with documented historical evidence. However, where some heritage-focused works like Bringing Rome to life[5] prioritize visual plausibility for reconstruction tasks, Synthetic History[0] emphasizes creating evaluation protocols that can detect subtle historical inaccuracies. This positions it as a foundational contribution to measurement infrastructure, complementing but distinct from applied heritage projects or bias-focused critiques that interrogate whose histories get represented and how.

Claimed Contributions

HistVis dataset of synthetic historical images

The authors introduce HistVis, a dataset containing 30,000 synthetic images generated from 100 prompts describing universal human activities across 10 historical periods using three diffusion models (SDXL, SD3, FLUX.1). This dataset enables systematic evaluation of how text-to-image models represent different historical contexts.

7 retrieved papers
Reproducible evaluation protocol for historical representation

The authors develop a reproducible evaluation framework that assesses text-to-image models across three dimensions: implicit stylistic associations with historical periods, historical consistency through anachronism detection, and demographic representation compared to historically plausible patterns.

10 retrieved papers
Automated anachronism detection methodology

The authors propose a two-stage automated method for detecting anachronisms in generated images, combining LLM-guided anachronism proposal with VQA-based detection, validated through a user study showing 75% agreement with human judgments.

3 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

HistVis dataset of synthetic historical images

The authors introduce HistVis, a dataset containing 30,000 synthetic images generated from 100 prompts describing universal human activities across 10 historical periods using three diffusion models (SDXL, SD3, FLUX.1). This dataset enables systematic evaluation of how text-to-image models represent different historical contexts.

Contribution

Reproducible evaluation protocol for historical representation

The authors develop a reproducible evaluation framework that assesses text-to-image models across three dimensions: implicit stylistic associations with historical periods, historical consistency through anachronism detection, and demographic representation compared to historically plausible patterns.

Contribution

Automated anachronism detection methodology

The authors propose a two-stage automated method for detecting anachronisms in generated images, combining LLM-guided anachronism proposal with VQA-based detection, validated through a user study showing 75% agreement with human judgments.