PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Large Multimodal ModelsScentific document understandingevaluation benchmark

Large Multimodal Models (LMMs) are increasingly applied to scientific research, yet it remains unclear whether they can reliably understand and reason over the multimodal complexity of papers. A central challenge lies in detecting and resolving inconsistencies across text, figures, tables, and equations, issues that are often subtle, domain-specific, and ultimately undermine clarity, reproducibility, and trust. Existing benchmarks overlook this issue, either isolating single modalities or relying on synthetic errors that fail to capture real-world complexity. We introduce PRISMM-Bench (Peer-Review-sourced Inconsistency Set for Multimodal Models), the first benchmark grounded in real reviewer-flagged inconsistencies in scientific papers. Through a multi-stage pipeline of review mining, LLM-assisted filtering and human verification, we curate 262 inconsistencies from 242 papers. Based on this set, we design three tasks, namely inconsistency identification, remedy and pair matching, which assess a model's capacity to detect, correct, and reason over inconsistencies across different modalities. Furthermore, to address the notorious problem of \emph{choice-only shortcuts} in multiple-choice evaluation, where models exploit answer patterns without truly understanding the question, we further introduce structured JSON-based answer representations that minimize linguistic biases by reducing reliance on superficial stylistic cues. We benchmark 21 leading LMMs, including large open-weight models (GLM-4.5V 106B, InternVL3 78B) and proprietary models (Gemini 2.5 Pro, GPT-5 with high reasoning). Results reveal strikingly low performance (26.1–54.2%), underscoring the challenge of multimodal scientific reasoning and motivating progress towards trustworthy scientific assistants. We provide the source code and dataset viewer in the appendix, and will release the full source code, dataset, and annotation tool publicly upon acceptance.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces PRISMM-Bench, a benchmark dataset of real multimodal inconsistencies in scientific papers sourced from peer reviews, alongside three evaluation tasks (identification, remedy, pair matching) and a JSON-based debiasing method for multiple-choice evaluation. According to the taxonomy, this work resides in the 'Peer-Review-Grounded Inconsistency Benchmarking' leaf under 'Multimodal Inconsistency Detection and Correction in Scientific Documents'. Notably, this leaf contains only the original paper itself—no sibling papers are listed—suggesting this is a relatively sparse and emerging research direction within the broader field of multimodal inconsistency detection.

The taxonomy reveals that the broader parent category ('Multimodal Inconsistency Detection and Correction in Scientific Documents') includes a sibling leaf on 'Compositional Conflict Detection and Knowledge Debugging', which houses frameworks for diagnosing compositional failures and contradiction detection in multimodal documents. Meanwhile, neighboring branches address cross-modal knowledge conflicts in large multimodal models (e.g., parametric conflicts, temporal robustness) and application-specific mitigation (fake news detection, sentiment analysis). The scope notes clarify that peer-review-grounded benchmarking excludes synthetic conflict generation and general compositional failure diagnosis, positioning this work as focused on authentic, expert-annotated inconsistencies rather than artificially constructed errors.

Among the three contributions analyzed, the reviewer-sourced dataset examined ten candidates with zero refutable prior work, while the JSON-based debiasing method examined seven candidates, also with zero refutations. The three-task benchmark suite examined ten candidates and found one potentially refutable example among the limited search scope of twenty-seven total candidates. These statistics indicate that, within the top-K semantic matches examined, the dataset and debiasing contributions appear relatively novel, whereas the task design has at least one overlapping prior effort. The analysis explicitly acknowledges its limited scope, covering a focused set of semantically similar papers rather than an exhaustive literature review.

Given the sparse taxonomy leaf (no sibling papers) and the limited refutation evidence across contributions, the work appears to occupy a relatively underexplored niche—peer-review-grounded multimodal inconsistency benchmarking in scientific documents. However, the analysis is constrained by the twenty-seven-candidate search scope and does not cover the full breadth of related work in compositional conflict detection or broader multimodal reasoning benchmarks. The findings suggest novelty in grounding and domain focus, though the task design shows some overlap with existing evaluation frameworks.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Detecting and resolving multimodal inconsistencies in scientific papers. The field addresses situations where different modalities—text, images, tables, equations—present conflicting information, a challenge that spans multiple research communities. The taxonomy reveals several major branches: some focus on knowledge conflicts in large multimodal models (e.g., Multimodal Knowledge Conflicts[3], Vision Language Conflicts[5]), examining how pretrained systems handle contradictory cross-modal inputs; others target inconsistency detection and correction specifically within scientific documents, including peer-review-grounded benchmarking; a third set explores application-specific mitigation in domains like sentiment analysis (Conflict Aware Sentiment[17]) or medical imaging (Universal Medical Representation[11]); additional branches investigate fusion-level conflict resolution in perception tasks (BEV Cross Modal[4], Tracking Fusion Conflict[9]), noisy correspondence learning for retrieval (Noisy Correspondence Learning[10]), architectural optimization under modality imbalance (Modal Imbalance Mitigation[22]), and even cognitive models of cross-modal conflict (Crossmodal Conflict Resolution[26]). A particularly active line of work centers on benchmarking and mitigating inconsistencies in vision-language models and scientific contexts. Crosscheck Bench[6] and PRISMM Bench[0] both provide evaluation frameworks, yet they differ in grounding: PRISMM Bench[0] leverages peer-review feedback to identify real-world inconsistencies in scientific manuscripts, offering a naturalistic testbed closely tied to document-level reasoning, whereas Crosscheck Bench[6] emphasizes broader cross-modal verification tasks. Nearby efforts like Seeing is Fixing[8] and Mitigating Multimodal Inconsistency[2] propose correction mechanisms, while Temporal Inconsistency Robustness[1] extends the problem to temporal dimensions. PRISMM Bench[0] thus occupies a niche at the intersection of scientific document analysis and multimodal conflict detection, distinguished by its use of expert reviewer annotations to ground inconsistency examples in authentic scholarly communication.

Claimed Contributions

PRISMM-Bench: A reviewer-sourced dataset of real multimodal inconsistencies in scientific papers

10 retrieved papers

The authors propose a novel benchmark dataset containing 384 real-world inconsistencies across 353 scientific papers, sourced from peer reviews on OpenReview. Unlike prior synthetic datasets, these inconsistencies are authentic reviewer-flagged errors spanning 15 categories of visual-textual and inter-visual mismatches.

10 retrieved papers

Three-task benchmark suite for multimodal inconsistency reasoning

Can Refute

10 retrieved papers

The authors design three multiple-choice tasks: Inconsistency Identification (detecting what the inconsistency is), Inconsistency Remedy (determining how to fix it), and Inconsistency Pair Match (identifying which two elements conflict). These tasks form a tiered framework evaluating models' abilities to detect, propose remedies, and reason over relationships between different modality components.

10 retrieved papers

Can Refute

JSON-based debiasing method for multiple-choice evaluation

7 retrieved papers

The authors introduce structured JSON-based answer representations (Evidence-Claim and Target-Action formats) that minimize linguistic biases in multiple-choice evaluation. This approach reduces models' reliance on superficial stylistic cues and suppresses choice-only shortcuts, compelling models to engage with actual multimodal content rather than exploiting surface patterns.

7 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

PRISMM-Bench: A reviewer-sourced dataset of real multimodal inconsistencies in scientific papers

[33] Re2: A Consistency-ensured Dataset for Full-stage Peer Review and Multi-turn Rebuttal Discussions PDF

Cannot Refute

[34] A Multidisciplinary Multimodal Aligned Dataset for Academic Data Processing PDF

Cannot Refute

[35] AI-Driven Scholarly Peer Review via Persistent Workflow Prompting, Meta-Prompting, and Meta-Reasoning PDF

Cannot Refute

[36] MMReview: A Multidisciplinary and Multimodal Benchmark for LLM-Based Peer Review Automation PDF

Cannot Refute

[37] No True State-of-the-Art? OOD Detection Methods are Inconsistent across Datasets PDF

Cannot Refute

[38] Inconsistency Matters: A Knowledge-guided Dual-inconsistency Network for Multi-modal Rumor Detection PDF

Cannot Refute

[39] A Multimodal Approach to Assessing Document Quality. PDF

Cannot Refute

[40] RIGOURATE: Quantifying Scientific Exaggeration with Evidence-Aligned Claim Evaluation PDF

Cannot Refute

[41] Evaluating and Steering Modality Preferences in Multimodal Large Language Model PDF

Cannot Refute

[42] When Reviewers Lock Horn: Finding Disagreement in Scientific Peer Reviews PDF

Cannot Refute

Contribution

Three-task benchmark suite for multimodal inconsistency reasoning

[54] Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models PDF

Can Refute

[6] Crosscheck-bench: Diagnosing compositional failures in multimodal conflict resolution PDF

Cannot Refute

[10] Learning with noisy correspondence for cross-modal matching PDF

Cannot Refute

[50] VisualPRM: An Effective Process Reward Model for Multimodal Reasoning PDF

Cannot Refute

[51] Fuzzy multimodal learning for trusted cross-modal retrieval PDF

Cannot Refute

[52] Learning to rematch mismatched pairs for robust cross-modal retrieval PDF

Cannot Refute

[53] Deep supervised cross-modal retrieval PDF

Cannot Refute

[55] Mihbench: Benchmarking and mitigating multi-image hallucinations in multimodal large language models PDF

Cannot Refute

[56] Variational autoencoder with cca for audioâvisual cross-modal retrieval PDF

Cannot Refute

[57] GThinker: Towards General Multimodal Reasoning via Cue-Guided Rethinking PDF

Cannot Refute

Contribution

JSON-based debiasing method for multiple-choice evaluation

[43] Large language models sensitivity to the order of options in multiple-choice questions PDF

Cannot Refute

[44] Sata-bench: Select all that apply benchmark for multiple choice questions PDF

Cannot Refute

[45] LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMs PDF

Cannot Refute

[46] Does the response options placement provide clues to the correct answers in multiple-choice tests? A systematic review PDF

Cannot Refute

[47] Decoupling Task-Solving and Output Formatting in LLM Generation PDF

Cannot Refute

[48] ChoosingRight'from Wrong: A Closer Look at Selection Bias in Spatial Multiple-Choice Questions in Large Multimodal Models PDF

Cannot Refute

[49] Fine-Tune on the Format First: Improving Multiple-Choice Evaluation for Intermediate LLM Checkpoints PDF

Cannot Refute

PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

PRISMM-Bench: A reviewer-sourced dataset of real multimodal inconsistencies in scientific papers

[33] Re2: A Consistency-ensured Dataset for Full-stage Peer Review and Multi-turn Rebuttal Discussions PDF

[34] A Multidisciplinary Multimodal Aligned Dataset for Academic Data Processing PDF

[35] AI-Driven Scholarly Peer Review via Persistent Workflow Prompting, Meta-Prompting, and Meta-Reasoning PDF

[36] MMReview: A Multidisciplinary and Multimodal Benchmark for LLM-Based Peer Review Automation PDF

[37] No True State-of-the-Art? OOD Detection Methods are Inconsistent across Datasets PDF

[38] Inconsistency Matters: A Knowledge-guided Dual-inconsistency Network for Multi-modal Rumor Detection PDF

[39] A Multimodal Approach to Assessing Document Quality. PDF

[40] RIGOURATE: Quantifying Scientific Exaggeration with Evidence-Aligned Claim Evaluation PDF

[41] Evaluating and Steering Modality Preferences in Multimodal Large Language Model PDF

[42] When Reviewers Lock Horn: Finding Disagreement in Scientific Peer Reviews PDF

Three-task benchmark suite for multimodal inconsistency reasoning

[54] Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models PDF

[6] Crosscheck-bench: Diagnosing compositional failures in multimodal conflict resolution PDF

[10] Learning with noisy correspondence for cross-modal matching PDF

[50] VisualPRM: An Effective Process Reward Model for Multimodal Reasoning PDF

[51] Fuzzy multimodal learning for trusted cross-modal retrieval PDF

[52] Learning to rematch mismatched pairs for robust cross-modal retrieval PDF

[53] Deep supervised cross-modal retrieval PDF

[55] Mihbench: Benchmarking and mitigating multi-image hallucinations in multimodal large language models PDF

[56] Variational autoencoder with cca for audioâvisual cross-modal retrieval PDF

[57] GThinker: Towards General Multimodal Reasoning via Cue-Guided Rethinking PDF

JSON-based debiasing method for multiple-choice evaluation

[43] Large language models sensitivity to the order of options in multiple-choice questions PDF

[44] Sata-bench: Select all that apply benchmark for multiple choice questions PDF

[45] LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMs PDF

[46] Does the response options placement provide clues to the correct answers in multiple-choice tests? A systematic review PDF

[47] Decoupling Task-Solving and Output Formatting in LLM Generation PDF

[48] ChoosingRight'from Wrong: A Closer Look at Selection Bias in Spatial Multiple-Choice Questions in Large Multimodal Models PDF

[49] Fine-Tune on the Format First: Improving Multiple-Choice Evaluation for Intermediate LLM Checkpoints PDF

Table of Contents

[56] Variational autoencoder with cca for audioâvisual cross-modal retrieval PDF