Semantic Visual Anomaly Detection and Reasoning in AI-Generated Images

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

Anomaly Detection，AI-Generated Images

The rapid advancement of AI-generated content (AIGC) has enabled the synthesis of visually convincing images; however, many such outputs exhibit subtle \textbf{semantic anomalies}, including unrealistic object configurations, violations of physical laws, or commonsense inconsistencies, which compromise the overall plausibility of the generated scenes. Detecting these semantic-level anomalies is essential for assessing the trustworthiness of AIGC media, especially in AIGC image analysis, explainable deepfake detection and semantic authenticity assessment.In this paper, we formalize \textbf{semantic anomaly detection and reasoning} for AIGC images and introduce \textbf{AnomReason}, a large-scale benchmark with structured annotations as quadruples \emph{(Name, Phenomenon, Reasoning, Severity)}. Annotations are produced by a modular multi-agent pipeline (\textbf{AnomAgent}) with lightweight human-in-the-loop verification, enabling scale while preserving quality. At construction time, AnomAgent processed approximately 4.17,B GPT-4o tokens, providing scale evidence for the resulting structured annotations. We further show that models fine-tuned on AnomReason achieve consistent gains over strong vision-language baselines under our proposed semantic matching metric (\textit{SemAP} and \textit{SemF1}). Applications to {explainable deepfake detection} and {semantic reasonableness assessment of image generators} demonstrate practical utility. In summary, AnomReason and AnomAgent serve as a foundation for measuring and improving the semantic plausibility of AI-generated images. We will release code, metrics, data, and task-aligned models to support reproducible research on semantic authenticity and interpretable AIGC forensics.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper formalizes semantic anomaly detection and reasoning for AIGC images and introduces AnomReason, a benchmark with structured quadruple annotations, alongside AnomAgent, a multi-agent annotation pipeline. It resides in the Multimodal Large Language Model-Based Reasoning leaf, which contains seven papers including the original work. This leaf sits within the broader Semantic Anomaly Detection and Reasoning branch, indicating a moderately populated research direction focused on leveraging MLLMs for explainable semantic inconsistency detection, as opposed to low-level artifact-based methods.

The taxonomy reveals neighboring leaves such as Non-MLLM Semantic Detection (five papers) and branches like Artifact-Based Detection (multiple sub-leaves) and Explainability and Interpretability. The paper's MLLM-based approach diverges from traditional deep learning methods in the sibling Non-MLLM leaf and complements explainability work by providing structured reasoning outputs. The taxonomy's scope notes clarify that this work emphasizes vision-language reasoning and commonsense error detection, distinguishing it from purely visual or frequency-domain methods in adjacent branches.

Among thirty candidates examined, the formalization contribution shows two refutable candidates out of ten examined, suggesting some prior work on task definition exists within the limited search scope. The AnomReason benchmark and AnomAgent pipeline contributions each examined ten candidates with zero refutable matches, indicating these specific structured annotation and multi-agent pipeline designs appear less directly overlapped in the sampled literature. The statistics reflect a focused semantic search rather than exhaustive coverage, so unexamined work may exist beyond the top-thirty matches.

Based on the limited search scope of thirty semantically similar papers, the benchmark and pipeline contributions appear more distinctive than the task formalization, which has identifiable prior work among examined candidates. The taxonomy context shows the paper occupies a moderately active MLLM-based reasoning niche within a broader field spanning artifact detection, explainability, and safety. This analysis covers top-ranked semantic matches and does not claim exhaustive field coverage.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: semantic anomaly detection and reasoning in AI-generated images. The field has evolved beyond simple artifact-based detection to encompass a rich taxonomy of approaches. At the top level, the taxonomy divides into Semantic Anomaly Detection and Reasoning, which focuses on logical inconsistencies and content-level errors; Artifact-Based Detection, which targets low-level forensic traces; Multimodal and Hybrid Detection Frameworks, which combine visual and textual cues; Explainability and Interpretability, which aims to make detection decisions transparent; Safety and Content Moderation, which addresses harmful content; Domain Adaptation and Generalization, which tackles cross-dataset robustness; and Related Image Analysis Tasks, which situate this work within broader computer vision challenges. Representative works such as Explainable Fake Detection[4] and Grounded Reasoning Detection[6] illustrate how semantic reasoning and explainability have become central themes, while methods like SemID Inpainting[3] and Synthetic Photography Detection[5] highlight the diversity of technical strategies. A particularly active line of work leverages multimodal large language models to perform reasoning about semantic inconsistencies, moving beyond pixel-level cues to higher-level understanding. Semantic Visual Anomaly[0] sits squarely within this branch, emphasizing the use of advanced reasoning capabilities to identify logical flaws in generated images. This approach contrasts with nearby works such as Seeing Before Reasoning[8], which explores the interplay between visual perception and reasoning stages, and FakeReasoning[24], which also employs reasoning but may differ in architectural choices or dataset focus. The trade-offs here revolve around balancing computational cost, interpretability, and generalization: while reasoning-based methods offer richer explanations and can capture subtle semantic errors, they may require more resources and careful prompt engineering compared to purely visual or hybrid approaches like GPT Forensics[17] or ForgerySleuth[30]. Open questions include how to scale these reasoning frameworks across diverse generative models and how to ensure robustness when adversaries adapt to semantic detection strategies.

Claimed Contributions

Formalization of semantic anomaly detection and reasoning task for AIGC images

Can Refute

10 retrieved papers

The authors formally define a new task that requires detecting and explaining semantic-level anomalies in AI-generated images through structured outputs comprising Name, Phenomenon, Reasoning, and Severity Score. This formulation goes beyond surface-level artifact detection to capture commonsense violations, physical implausibilities, and logical inconsistencies.

10 retrieved papers

Can Refute

AnomReason benchmark with structured quadruple annotations

10 retrieved papers

The authors construct a large-scale benchmark dataset containing 21,539 AI-generated images annotated with structured semantic anomalies. Each anomaly is represented as a quadruple capturing what is wrong, why it is wrong, and how severe it is, enabling interpretable semantic analysis.

10 retrieved papers

AnomAgent multi-agent annotation pipeline with human-in-the-loop verification

10 retrieved papers

The authors develop a modular multi-agent framework that decomposes anomaly reasoning into specialized stages (entity parsing, attribute analysis, relational reasoning, and anomaly consolidation). This pipeline is combined with lightweight human verification to balance annotation scale and quality.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[4] Towards explainable fake image detection with multi-modal large language models PDF

Ji, Yikun, Hong Yan, Zhan Jiahui, Chen, Haoxing, Lan Jun, Zhu, Huijia, Wang Wei-qiang, Zhang Liqing, Zhang Jian-fu (2025)

[6] Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs PDF

Ji, Yikun, Yan Hong, Lan Jun, Zhu, Huijia, Wang Wei-qiang, Fan Qi, Zhang Liqing, Zhang Jian-fu (2025)

[8] Seeing before reasoning: A unified framework for generalizable and explainable fake image detection PDF

Lin Kai-qing, Yan, ZhiYuan, Kaiqing Lin, Chen, Ruoxin, Zhiyuan Yan, YE Junyan, Ruoxin Chen, Zhang, Ke-Yue, Junyan Ye, Zhou Yue, Ke-Yue Zhang, Jin Peng, Yue Zhou, Li Bin, Peng Jin, Yao Tai-ping, Bin Li, Ding, Shouhong, Taiping Yao, Shouhong Ding (2025)

[17] Can GPT tell us why these images are synthesized? Empowering Multimodal Large Language Models for Forensics PDF

He Yiran, Cao Yun, Yang, Bowen, Zhang Zeyu (2025)

[24] FakeReasoning: Towards Generalizable Forgery Detection and Reasoning PDF

Gao Yueying, Chang, Dongliang, Yu Bingyao, Qin Haotian, Chen Lei, Liang, Kongming, Ma, Zhanyu (2025)

[30] Forgerysleuth: Empowering multimodal large language models for image manipulation detection PDF

Sun Zhi-hao, Jiang Haoran, Zhihao Sun, Chen Haoran, Haoran Jiang, Cao, Yixin, Haoran Chen, Qiu, Xipeng, Yixin Cao, Wu, Zuxuan, Xipeng Qiu, Jiang, Yu-Gang, Zuxuan Wu, Yu-Gang Jiang (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Formalization of semantic anomaly detection and reasoning task for AIGC images

[24] FakeReasoning: Towards Generalizable Forgery Detection and Reasoning PDF

Can Refute

[53] LEGION: Learning to Ground and Explain for Synthetic Image Detection PDF

Can Refute

[3] SemID: Blind Image Inpainting with Semantic Inconsistency Detection PDF

Cannot Refute

[9] Blockchain-Aided Secure Semantic Communication for AI-Generated Content in Metaverse PDF

Cannot Refute

[51] Taking a closer look at domain shift: Category-level adversaries for semantics consistent domain adaptation PDF

Cannot Refute

[52] A Sanity Check for AI-generated Image Detection PDF

Cannot Refute

[54] Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data PDF

Cannot Refute

[55] Hierarchical attention and semantic refinement for advanced image captioning PDF

Cannot Refute

[56] NS-Net: Decoupling CLIP Semantic Information through NULL-Space for Generalizable AI-Generated Image Detection PDF

Cannot Refute

[57] Can chatgpt detect deepfakes? a study of using multimodal large language models for media forensics PDF

Cannot Refute

Contribution

AnomReason benchmark with structured quadruple annotations

[20] A literature review on deep learning algorithms for analysis of X-ray images PDF

Cannot Refute

[58] VisText: A Benchmark for Semantically Rich Chart Captioning PDF

Cannot Refute

[59] Segmentmeifyoucan: A benchmark for anomaly segmentation PDF

Cannot Refute

[60] Generating Robot Constitutions & Benchmarks for Semantic Safety PDF

Cannot Refute

[61] CUS3D: A new comprehensive urban-scale semantic-segmentation 3D benchmark dataset PDF

Cannot Refute

[62] Innovative Image Fraud Detection with Cross-Sample Anomaly Analysis: The Power of LLMs PDF

Cannot Refute

[63] ATLANTIS: A Benchmark for Semantic Segmentation of Waterbody Images PDF

Cannot Refute

[64] Natural Synthetic Anomalies for Self-supervised Anomaly Detection and Localization PDF

Cannot Refute

[65] StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images PDF

Cannot Refute

[66] Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation PDF

Cannot Refute

Contribution

AnomAgent multi-agent annotation pipeline with human-in-the-loop verification

[67] Human-LLM Collaborative Annotation Through Effective Verification of LLM Labels PDF

Cannot Refute

[68] Autonomous llm-driven researchâfrom data to human-verifiable research papers PDF

Cannot Refute

[69] Revolt: Collaborative crowdsourcing for labeling machine learning datasets PDF

Cannot Refute

[70] PEANUT: A Human-AI Collaborative Tool for Annotating Audio-Visual Data PDF

Cannot Refute

[71] Magentic-ui: Towards human-in-the-loop agentic systems PDF

Cannot Refute

[72] Towards a next-generation LLM empowered low-code programming industrial robotic system for human-centric smart manufacturing PDF

Cannot Refute

[73] Medical AI Consensus: A Multi-Agent Framework for Radiology Report Generation and Evaluation PDF

Cannot Refute

[74] How AI agents will change cancer research and oncology PDF

Cannot Refute

[75] Human-in-the-loop document layout analysis PDF

Cannot Refute

[76] Human-machine collaboration on data annotation of images by semi-automatic labeling PDF

Cannot Refute

Semantic Visual Anomaly Detection and Reasoning in AI-Generated Images

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[4] Towards explainable fake image detection with multi-modal large language models PDF

[6] Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs PDF

[8] Seeing before reasoning: A unified framework for generalizable and explainable fake image detection PDF

[17] Can GPT tell us why these images are synthesized? Empowering Multimodal Large Language Models for Forensics PDF

[24] FakeReasoning: Towards Generalizable Forgery Detection and Reasoning PDF

[30] Forgerysleuth: Empowering multimodal large language models for image manipulation detection PDF

Contribution Analysis

Formalization of semantic anomaly detection and reasoning task for AIGC images

[24] FakeReasoning: Towards Generalizable Forgery Detection and Reasoning PDF

[53] LEGION: Learning to Ground and Explain for Synthetic Image Detection PDF

[3] SemID: Blind Image Inpainting with Semantic Inconsistency Detection PDF

[9] Blockchain-Aided Secure Semantic Communication for AI-Generated Content in Metaverse PDF

[51] Taking a closer look at domain shift: Category-level adversaries for semantics consistent domain adaptation PDF

[52] A Sanity Check for AI-generated Image Detection PDF

[54] Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data PDF

[55] Hierarchical attention and semantic refinement for advanced image captioning PDF

[56] NS-Net: Decoupling CLIP Semantic Information through NULL-Space for Generalizable AI-Generated Image Detection PDF

[57] Can chatgpt detect deepfakes? a study of using multimodal large language models for media forensics PDF

AnomReason benchmark with structured quadruple annotations

[20] A literature review on deep learning algorithms for analysis of X-ray images PDF

[58] VisText: A Benchmark for Semantically Rich Chart Captioning PDF

[59] Segmentmeifyoucan: A benchmark for anomaly segmentation PDF

[60] Generating Robot Constitutions & Benchmarks for Semantic Safety PDF

[61] CUS3D: A new comprehensive urban-scale semantic-segmentation 3D benchmark dataset PDF

[62] Innovative Image Fraud Detection with Cross-Sample Anomaly Analysis: The Power of LLMs PDF

[63] ATLANTIS: A Benchmark for Semantic Segmentation of Waterbody Images PDF

[64] Natural Synthetic Anomalies for Self-supervised Anomaly Detection and Localization PDF

[65] StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images PDF

[66] Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation PDF

AnomAgent multi-agent annotation pipeline with human-in-the-loop verification

[67] Human-LLM Collaborative Annotation Through Effective Verification of LLM Labels PDF

[68] Autonomous llm-driven researchâfrom data to human-verifiable research papers PDF

[69] Revolt: Collaborative crowdsourcing for labeling machine learning datasets PDF

[70] PEANUT: A Human-AI Collaborative Tool for Annotating Audio-Visual Data PDF

[71] Magentic-ui: Towards human-in-the-loop agentic systems PDF

[72] Towards a next-generation LLM empowered low-code programming industrial robotic system for human-centric smart manufacturing PDF

[73] Medical AI Consensus: A Multi-Agent Framework for Radiology Report Generation and Evaluation PDF

[74] How AI agents will change cancer research and oncology PDF

[75] Human-in-the-loop document layout analysis PDF

[76] Human-machine collaboration on data annotation of images by semi-automatic labeling PDF

Table of Contents

[68] Autonomous llm-driven researchâfrom data to human-verifiable research papers PDF