THEMIS: Towards Holistic Evaluation of MLLMs for Scientific Paper Fraud Forensics

ICLR 2026 Conference SubmissionAnonymous Authors
Multimodal Large Language ModelVision Fraud ReasoningScientific Paper Fraud DetectionBenchmark
Abstract:

We present THEMIS, a novel multi-task benchmark designed to comprehensively evaluate Multimodal Large Language Models (MLLMs) on visual fraud reasoning within real-world academic scenarios. Compared to existing benchmarks, THEMIS introduces three major advancements. (1) Real-world Scenarios & Complexity: Our benchmark comprises over 4K questions spanning 7 scenarios, derived from authentic retracted-paper cases and carefully curated multimodal synthetic data. With 73.73% complex-texture images, THEMIS bridges the critical gap between existing benchmarks and the complexity of real-world academic fraud. (2) Task Diversity & Granularity: THEMIS systematically covers five challenging tasks and introduces 16 fine-grained manipulation operations. On average, each sample undergoes multiple stacked manipulation operations, with the diversity and difficulty of these manipulations demanding a high level of visual fraud reasoning from the models. (3) Multi-dimensional Capability Evaluation: We establish a mapping from fraud tasks to five core visual fraud reasoning capabilities, thereby enabling an evaluation that reveals the distinct strengths and specific weaknesses of different models across these core capabilities. Experiments on 11 leading MLLMs show that even the best-performing model still falls below the passing threshold, demonstrating that our benchmark presents a stringent test. We expect THEMIS to advance the development of MLLMs for complex, real-world fraud detection tasks. The data and code will be updated on url: https://anonymous.4open.science/r/themis1638.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces THEMIS, a multi-task benchmark for evaluating multimodal large language models on visual fraud reasoning in scientific publications. It resides in the 'Benchmarking and Evaluation Frameworks' leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader taxonomy. This leaf sits under 'Scientific Image Integrity Tools and Systems', a branch focused on practical detection platforms rather than algorithmic development. The small sibling set suggests that comprehensive MLLM-focused benchmarks for scientific fraud remain underexplored compared to traditional computer vision detection methods.

The taxonomy reveals neighboring leaves addressing automated screening systems and duplication detection methods, while sibling branches cover computational forgery detection and AI-generated fraud. THEMIS diverges from these by targeting MLLM evaluation rather than developing detection algorithms or screening tools. The 'Computational Image Forgery Detection Methods' branch contains numerous papers on copy-move and general manipulation detection, but these focus on algorithmic performance rather than model capability assessment. The benchmark's emphasis on real-world retracted cases and multi-dimensional reasoning connects it to the 'Fraud Detection and Characterization' branch, yet its evaluation-centric design keeps it firmly within the benchmarking category.

Among fourteen candidates examined across three contributions, none were found to clearly refute the paper's claims. The first contribution (THEMIS benchmark) examined four candidates with zero refutations, suggesting limited prior work on MLLM-specific fraud benchmarks. The second contribution (multi-dimensional capability framework) examined ten candidates without refutation, indicating novelty in mapping fraud tasks to reasoning abilities. The third contribution (real-world complexity) examined zero candidates, leaving its novelty assessment incomplete. This limited search scope—fourteen papers from semantic matching—means the analysis captures only a narrow slice of potentially relevant literature, particularly missing broader MLLM evaluation work outside scientific fraud contexts.

Based on the constrained search of fourteen semantically similar papers, THEMIS appears to occupy a relatively novel position within scientific image fraud benchmarking, particularly for MLLM evaluation. However, the small candidate pool and sparse taxonomy leaf suggest either genuine novelty or incomplete coverage of adjacent MLLM benchmarking literature. The analysis does not address whether similar multi-task reasoning benchmarks exist in other domains that could inform or overlap with this work's methodological contributions.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
14
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: visual fraud reasoning in scientific paper images. The field addresses the growing challenge of detecting and characterizing manipulated or fabricated images in scholarly publications, spanning multiple interconnected branches. Fraud Detection and Characterization in Scientific Publications examines the scope and nature of misconduct, from systematic reviews of fraudulent studies (Fraudulent Studies Systematic Reviews[1]) to analyses of figure plagiarism (Figure Plagiarism Academia[3]) and paper mill operations (Unveiling Paper Mills[34]). AI-Generated and Deepfake Image Fraud focuses on emerging threats from generative models (AI Generated Images Era[4], Deepfakes Threat Publications[21]), while Computational Image Forgery Detection Methods develops algorithmic approaches including copy-move detection (Binary Pattern Copy Move[5], DBSCAN Copy Move Detection[28]) and deep learning techniques (Deep Learning Image Manipulation[8], Learning Rich Features[30]). Scientific Image Integrity Tools and Systems translates these methods into practical screening platforms, and Image Integrity Standards and Guidelines establishes best practices for researchers and publishers (Guard Against Image Fraud[14], Teaching Ethics Imaging[6]). A particularly active line of work centers on building robust benchmarking frameworks that can systematically evaluate detection methods across diverse manipulation types. THEMIS[0] sits squarely within this Benchmarking and Evaluation Frameworks cluster, alongside works like Benchmarking Scientific Image Forgery[41] and Context Aware Semantic Forgery[38]. While many computational approaches focus narrowly on specific forgery techniques—such as copy-move detection or compression artifact analysis—THEMIS emphasizes comprehensive evaluation across multiple fraud categories relevant to scientific publishing. This contrasts with neighboring efforts like Context Aware Semantic Forgery[38], which targets semantic-level manipulations, and Benchmarking Scientific Image Forgery[41], which may prioritize different manipulation taxonomies. The central tension across these branches involves balancing detection sensitivity against false-positive rates, especially as AI-generated content (AI Enabled Image Fraud[20]) blurs traditional forensic signatures, making standardized benchmarks increasingly critical for validating new detection systems.

Claimed Contributions

THEMIS benchmark for evaluating MLLMs on scientific paper fraud detection

The authors introduce THEMIS, a benchmark comprising over 4,000 questions across seven academic scenarios derived from authentic retracted papers and synthetic data. It systematically evaluates MLLMs on five fraud detection tasks (AI-Generated, Splicing, Copy-Move, Duplication, Text-Image Inconsistency) with 16 fine-grained manipulation operations.

4 retrieved papers
Multi-dimensional capability evaluation framework mapping fraud tasks to core reasoning abilities

The authors develop a principled framework that maps fraud detection tasks to five core reasoning capabilities: Expert Knowledge Utilization, Visual Recognition, Spatial Reasoning, Region Localization, and Comparative Reasoning. This enables fine-grained diagnosis of MLLMs' strengths and weaknesses in expert-level visual reasoning.

10 retrieved papers
Real-world complexity through authentic retracted cases and complex-texture images

The authors construct a benchmark grounded in realistic academic fraud scenarios by incorporating authentic retracted paper cases and ensuring 73.73% of images contain complex textures. This design bridges the gap between existing benchmarks and real-world academic fraud complexity.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

THEMIS benchmark for evaluating MLLMs on scientific paper fraud detection

The authors introduce THEMIS, a benchmark comprising over 4,000 questions across seven academic scenarios derived from authentic retracted papers and synthetic data. It systematically evaluates MLLMs on five fraud detection tasks (AI-Generated, Splicing, Copy-Move, Duplication, Text-Image Inconsistency) with 16 fine-grained manipulation operations.

Contribution

Multi-dimensional capability evaluation framework mapping fraud tasks to core reasoning abilities

The authors develop a principled framework that maps fraud detection tasks to five core reasoning capabilities: Expert Knowledge Utilization, Visual Recognition, Spatial Reasoning, Region Localization, and Comparative Reasoning. This enables fine-grained diagnosis of MLLMs' strengths and weaknesses in expert-level visual reasoning.

Contribution

Real-world complexity through authentic retracted cases and complex-texture images

The authors construct a benchmark grounded in realistic academic fraud scenarios by incorporating authentic retracted paper cases and ensuring 73.73% of images contain complex textures. This design bridges the gap between existing benchmarks and real-world academic fraud complexity.

THEMIS: Towards Holistic Evaluation of MLLMs for Scientific Paper Fraud Forensics | Novelty Validation