PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies

ICLR 2026 Conference SubmissionAnonymous Authors
Large Multimodal ModelsScentific document understandingevaluation benchmark
Abstract:

Large Multimodal Models (LMMs) are increasingly applied to scientific research, yet it remains unclear whether they can reliably understand and reason over the multimodal complexity of papers. A central challenge lies in detecting and resolving inconsistencies across text, figures, tables, and equations, issues that are often subtle, domain-specific, and ultimately undermine clarity, reproducibility, and trust. Existing benchmarks overlook this issue, either isolating single modalities or relying on synthetic errors that fail to capture real-world complexity. We introduce PRISMM-Bench (Peer-Review-sourced Inconsistency Set for Multimodal Models), the first benchmark grounded in real reviewer-flagged inconsistencies in scientific papers. Through a multi-stage pipeline of review mining, LLM-assisted filtering and human verification, we curate 262 inconsistencies from 242 papers. Based on this set, we design three tasks, namely inconsistency identification, remedy and pair matching, which assess a model's capacity to detect, correct, and reason over inconsistencies across different modalities. Furthermore, to address the notorious problem of \emph{choice-only shortcuts} in multiple-choice evaluation, where models exploit answer patterns without truly understanding the question, we further introduce structured JSON-based answer representations that minimize linguistic biases by reducing reliance on superficial stylistic cues. We benchmark 21 leading LMMs, including large open-weight models (GLM-4.5V 106B, InternVL3 78B) and proprietary models (Gemini 2.5 Pro, GPT-5 with high reasoning). Results reveal strikingly low performance (26.1–54.2%), underscoring the challenge of multimodal scientific reasoning and motivating progress towards trustworthy scientific assistants. We provide the source code and dataset viewer in the appendix, and will release the full source code, dataset, and annotation tool publicly upon acceptance.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces PRISMM-Bench, a benchmark dataset of real multimodal inconsistencies in scientific papers sourced from peer reviews, alongside three evaluation tasks (identification, remedy, pair matching) and a JSON-based debiasing method for multiple-choice evaluation. According to the taxonomy, this work resides in the 'Peer-Review-Grounded Inconsistency Benchmarking' leaf under 'Multimodal Inconsistency Detection and Correction in Scientific Documents'. Notably, this leaf contains only the original paper itself—no sibling papers are listed—suggesting this is a relatively sparse and emerging research direction within the broader field of multimodal inconsistency detection.

The taxonomy reveals that the broader parent category ('Multimodal Inconsistency Detection and Correction in Scientific Documents') includes a sibling leaf on 'Compositional Conflict Detection and Knowledge Debugging', which houses frameworks for diagnosing compositional failures and contradiction detection in multimodal documents. Meanwhile, neighboring branches address cross-modal knowledge conflicts in large multimodal models (e.g., parametric conflicts, temporal robustness) and application-specific mitigation (fake news detection, sentiment analysis). The scope notes clarify that peer-review-grounded benchmarking excludes synthetic conflict generation and general compositional failure diagnosis, positioning this work as focused on authentic, expert-annotated inconsistencies rather than artificially constructed errors.

Among the three contributions analyzed, the reviewer-sourced dataset examined ten candidates with zero refutable prior work, while the JSON-based debiasing method examined seven candidates, also with zero refutations. The three-task benchmark suite examined ten candidates and found one potentially refutable example among the limited search scope of twenty-seven total candidates. These statistics indicate that, within the top-K semantic matches examined, the dataset and debiasing contributions appear relatively novel, whereas the task design has at least one overlapping prior effort. The analysis explicitly acknowledges its limited scope, covering a focused set of semantically similar papers rather than an exhaustive literature review.

Given the sparse taxonomy leaf (no sibling papers) and the limited refutation evidence across contributions, the work appears to occupy a relatively underexplored niche—peer-review-grounded multimodal inconsistency benchmarking in scientific documents. However, the analysis is constrained by the twenty-seven-candidate search scope and does not cover the full breadth of related work in compositional conflict detection or broader multimodal reasoning benchmarks. The findings suggest novelty in grounding and domain focus, though the task design shows some overlap with existing evaluation frameworks.

Taxonomy

Core-task Taxonomy Papers
32
3
Claimed Contributions
27
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Detecting and resolving multimodal inconsistencies in scientific papers. The field addresses situations where different modalities—text, images, tables, equations—present conflicting information, a challenge that spans multiple research communities. The taxonomy reveals several major branches: some focus on knowledge conflicts in large multimodal models (e.g., Multimodal Knowledge Conflicts[3], Vision Language Conflicts[5]), examining how pretrained systems handle contradictory cross-modal inputs; others target inconsistency detection and correction specifically within scientific documents, including peer-review-grounded benchmarking; a third set explores application-specific mitigation in domains like sentiment analysis (Conflict Aware Sentiment[17]) or medical imaging (Universal Medical Representation[11]); additional branches investigate fusion-level conflict resolution in perception tasks (BEV Cross Modal[4], Tracking Fusion Conflict[9]), noisy correspondence learning for retrieval (Noisy Correspondence Learning[10]), architectural optimization under modality imbalance (Modal Imbalance Mitigation[22]), and even cognitive models of cross-modal conflict (Crossmodal Conflict Resolution[26]). A particularly active line of work centers on benchmarking and mitigating inconsistencies in vision-language models and scientific contexts. Crosscheck Bench[6] and PRISMM Bench[0] both provide evaluation frameworks, yet they differ in grounding: PRISMM Bench[0] leverages peer-review feedback to identify real-world inconsistencies in scientific manuscripts, offering a naturalistic testbed closely tied to document-level reasoning, whereas Crosscheck Bench[6] emphasizes broader cross-modal verification tasks. Nearby efforts like Seeing is Fixing[8] and Mitigating Multimodal Inconsistency[2] propose correction mechanisms, while Temporal Inconsistency Robustness[1] extends the problem to temporal dimensions. PRISMM Bench[0] thus occupies a niche at the intersection of scientific document analysis and multimodal conflict detection, distinguished by its use of expert reviewer annotations to ground inconsistency examples in authentic scholarly communication.

Claimed Contributions

PRISMM-Bench: A reviewer-sourced dataset of real multimodal inconsistencies in scientific papers

The authors propose a novel benchmark dataset containing 384 real-world inconsistencies across 353 scientific papers, sourced from peer reviews on OpenReview. Unlike prior synthetic datasets, these inconsistencies are authentic reviewer-flagged errors spanning 15 categories of visual-textual and inter-visual mismatches.

10 retrieved papers
Three-task benchmark suite for multimodal inconsistency reasoning

The authors design three multiple-choice tasks: Inconsistency Identification (detecting what the inconsistency is), Inconsistency Remedy (determining how to fix it), and Inconsistency Pair Match (identifying which two elements conflict). These tasks form a tiered framework evaluating models' abilities to detect, propose remedies, and reason over relationships between different modality components.

10 retrieved papers
Can Refute
JSON-based debiasing method for multiple-choice evaluation

The authors introduce structured JSON-based answer representations (Evidence-Claim and Target-Action formats) that minimize linguistic biases in multiple-choice evaluation. This approach reduces models' reliance on superficial stylistic cues and suppresses choice-only shortcuts, compelling models to engage with actual multimodal content rather than exploiting surface patterns.

7 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

PRISMM-Bench: A reviewer-sourced dataset of real multimodal inconsistencies in scientific papers

The authors propose a novel benchmark dataset containing 384 real-world inconsistencies across 353 scientific papers, sourced from peer reviews on OpenReview. Unlike prior synthetic datasets, these inconsistencies are authentic reviewer-flagged errors spanning 15 categories of visual-textual and inter-visual mismatches.

Contribution

Three-task benchmark suite for multimodal inconsistency reasoning

The authors design three multiple-choice tasks: Inconsistency Identification (detecting what the inconsistency is), Inconsistency Remedy (determining how to fix it), and Inconsistency Pair Match (identifying which two elements conflict). These tasks form a tiered framework evaluating models' abilities to detect, propose remedies, and reason over relationships between different modality components.

Contribution

JSON-based debiasing method for multiple-choice evaluation

The authors introduce structured JSON-based answer representations (Evidence-Claim and Target-Action formats) that minimize linguistic biases in multiple-choice evaluation. This approach reduces models' reliance on superficial stylistic cues and suppresses choice-only shortcuts, compelling models to engage with actual multimodal content rather than exploiting surface patterns.