Not Search, But Scan: Benchmarking MLLMs on Scan-Oriented Academic Paper Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Multimodal Large Language Models; Academic Paper Reasoning; Scan-Oriented Reasoning

With the rapid progress of multimodal large language models (MLLMs), AI already performs well at literature retrieval and certain reasoning tasks, serving as a capable assistant to human researchers, yet it remains far from autonomous research. The fundamental reason is that current work on scholarly paper reasoning is largely confined to a search-oriented paradigm centered on pre-specified targets, with reasoning grounded in relevance retrieval, which struggles to support researcher-style full-document understanding, reasoning, and verification. To bridge this gap, we propose ScholScan, a new benchmark for scholarly paper reasoning. ScholScan introduces a scan-oriented task setting that asks models to read and cross-check entire papers like human researchers, scanning the document to identify consistency issues. The benchmark comprises 1,800 carefully annotated questions drawn from 9 error families across 13 natural-science domains and 715 papers, and provides detailed annotations for evidence localization and reasoning traces, together with a unified evaluation protocol. We assessed 15 models across 24 input configurations and conduct a fine-grained analysis of MLLM capabilities across error families. Across the board, retrieval-augmented generation (RAG) methods yield no significant improvements, revealing systematic deficiencies of current MLLMs on scan-oriented tasks and underscoring the challenge posed by ScholScan. We expect ScholScan to be the leading and representative work of the scan-oriented task paradigm.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces ScholScan, a benchmark for scan-oriented scholarly paper reasoning that asks models to read entire papers and identify consistency issues across nine error families. It resides in the 'Full-Document Scanning and Claim Verification' leaf, which contains only four papers total, indicating a relatively sparse research direction within the broader AI-based error detection landscape. This positioning suggests the work targets a specific gap: moving beyond search-oriented retrieval to researcher-style document understanding and cross-checking.

The taxonomy reveals that ScholScan's leaf sits within a larger branch of AI-based detection systems, which also includes specialized error detection tools (targeting genotyping, grammar, or database errors) and document classification systems. Neighboring branches address error taxonomies (statistical, citation, visualization errors), AI-generated content integrity, and publication processes. The scope note for the leaf explicitly excludes partial-document or abstract-only methods, clarifying that ScholScan's full-document scanning approach differentiates it from abstract-level screening work found elsewhere in the taxonomy.

Among thirty candidates examined, the benchmark itself and the scan-oriented task paradigm show no clear refutation (zero refutable candidates across twenty examined). However, the process-aware evaluation framework encountered three refutable candidates among ten examined, suggesting that fine-grained annotation and evaluation protocols have more substantial prior work. This pattern indicates that while the core task formulation appears relatively novel within the limited search scope, the evaluation methodology overlaps with existing practices in claim verification or evidence localization benchmarks.

Based on the top-thirty semantic matches examined, ScholScan appears to occupy a less crowded niche within full-document reasoning for scientific error detection. The analysis does not cover exhaustive literature beyond these candidates, and the taxonomy's sparse leaf structure (four papers) may reflect either genuine novelty or incomplete field coverage. The evaluation framework's overlap with prior work is notable but does not diminish the distinctiveness of the scan-oriented task paradigm itself.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Detecting scientific errors in academic papers through full-document scanning. The field encompasses a diverse landscape of approaches to understanding, identifying, and addressing errors in scholarly literature. At the highest level, the taxonomy reveals several major branches: studies that catalog error types and taxonomies in scientific literature (e.g., Statistical Errors Review[3], Statistical Reporting Errors[37]); AI-based systems designed for automated error detection and verification (including full-document scanning methods like MultiVerS[31] and Advanced Evidence Retrieval[36]); investigations into AI-generated content and its implications for scientific integrity (AI Scientific Writing[1], AI Generated Text Gap[10]); analyses of research misconduct and retraction patterns (Social Science Retractions[4], Retraction Error Sources[32]); examinations of publication and peer review processes (Peer Review Crossroads[9], Rush to Publication[42]); human factors and educational interventions (Teaching Through Errors[25], Human Fallibility Scientists[46]); and domain-specific error studies across disciplines (Biomedical Manuscript Errors[35], Computational Science Error[19]). These branches collectively address both the technical challenge of error detection and the broader ecosystem of scientific quality control. Within the AI-based detection systems branch, a particularly active line of work focuses on full-document scanning and claim verification, where methods must reason over entire papers rather than isolated snippets. Scan Oriented Paper Reasoning[0] sits squarely in this cluster, emphasizing comprehensive document analysis to identify errors that may only become apparent when considering context across sections. This contrasts with narrower approaches that target specific error types or rely on abstract-level screening (Abstract Error Analysis[15], Pediatric Abstract Quality[8]). Nearby works like MultiVerS[31] and SemanticCite[41] similarly tackle evidence retrieval and citation verification at scale, though they may differ in whether they prioritize claim-level fact-checking or broader logical consistency. A key tension across these systems involves balancing thoroughness with computational feasibility, as full-document reasoning demands substantial resources while promising more nuanced error detection than surface-level checks.

Claimed Contributions

ScholScan benchmark for scan-oriented scholarly paper reasoning

10 retrieved papers

The authors introduce ScholScan, a benchmark that evaluates multimodal large language models on scan-oriented tasks requiring full-document understanding and consistency checking across academic papers, rather than search-oriented retrieval of pre-specified targets. The benchmark comprises 1,800 questions from 715 papers spanning 13 natural-science domains and 9 error families.

10 retrieved papers

Scan-oriented task paradigm for academic paper understanding

10 retrieved papers

The authors propose a new task paradigm that contrasts with existing search-oriented approaches by requiring models to proactively discover implicit problems without prespecified targets, emphasizing consistency verification through comprehensive document scanning rather than relevance-based retrieval.

10 retrieved papers

Process-aware evaluation framework with fine-grained annotations

Can Refute

10 retrieved papers

The authors develop a structured evaluation protocol that goes beyond outcome-based metrics by providing detailed annotations for evidence localization and reasoning traces, allowing assessment of whether intermediate reasoning is evidentially grounded and logically valid.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[31] MultiVerS: Improving scientific claim verification with weak supervision and full-document context PDF

Wadden David (2021)

[36] The Next Phase of Scientific Fact-Checking: Advanced Evidence Retrieval from Complex Structured Academic Papers PDF

Deng Xing-yu, Wang, Xi, Stevenson, Mark (2025)

[41] SemanticCite: Citation Verification with AI-Powered Full-Text Analysis and Evidence-Based Reasoning PDF

Sebastian Haan (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ScholScan benchmark for scan-oriented scholarly paper reasoning

[71] R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization PDF

Cannot Refute

[72] A survey on benchmarks of multimodal large language models PDF

Cannot Refute

[73] Mind with eyes: from language reasoning to multimodal reasoning PDF

Cannot Refute

[74] Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi PDF

Cannot Refute

[75] Cross-modal causal relational reasoning for event-level visual question answering PDF

Cannot Refute

[76] FedMultimodal: A Benchmark for Multimodal Federated Learning PDF

Cannot Refute

[77] Benchmark for Research Theme Classification of Scholarly Documents PDF

Cannot Refute

[78] MMAT-1M: A large reasoning dataset for multimodal agent tuning PDF

Cannot Refute

[79] EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges PDF

Cannot Refute

[80] LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts PDF

Cannot Refute

Contribution

Scan-oriented task paradigm for academic paper understanding

[51] Evaluating the factual consistency of abstractive text summarization PDF

Cannot Refute

[52] Uncertainty-aware consistency checking in industrial settings PDF

Cannot Refute

[53] Document-Level Tabular Numerical Cross-Checking: A Coarse-to-Fine Approach PDF

Cannot Refute

[54] Consistent Document-Level Relation Extraction via Counterfactuals PDF

Cannot Refute

[55] Evaluating Superhuman Models with Consistency Checks PDF

Cannot Refute

[56] CDË2CR: Co-reference resolution across documents and domains PDF

Cannot Refute

[57] Discovering Inconsistencies in Documents with Long-Context LLMs PDF

Cannot Refute

[58] Short- vs. long-distance physics in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$B\rightarrow K^{(*)} \ell ^+\ell ^ PDF

Cannot Refute

[59] Discourse-driven evaluation: Unveiling factual inconsistency in long document summarization PDF

Cannot Refute

[60] LegalWiz: A Multi-Agent Generation Framework for Contradiction Detection in Legal Documents PDF

Cannot Refute

Contribution

Process-aware evaluation framework with fine-grained annotations

[61] Processbench: Identifying process errors in mathematical reasoning PDF

Can Refute

[63] Beyond the Answer: Advancing Multi-Hop QA with Fine-Grained Graph Reasoning and Evaluation PDF

Can Refute

[67] ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning PDF

Can Refute

[62] LLaVA-CoT: Let Vision Language Models Reason Step-by-Step PDF

Cannot Refute

[64] LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs PDF

Cannot Refute

[65] Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection PDF

Cannot Refute

[66] Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems PDF

Cannot Refute

[68] Prometheus: Inducing Fine-grained Evaluation Capability in Language Models PDF

Cannot Refute

[69] Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs PDF

Cannot Refute

[70] Video-of-thought: Step-by-step video reasoning from perception to cognition PDF

Cannot Refute

Not Search, But Scan: Benchmarking MLLMs on Scan-Oriented Academic Paper Reasoning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[31] MultiVerS: Improving scientific claim verification with weak supervision and full-document context PDF

[36] The Next Phase of Scientific Fact-Checking: Advanced Evidence Retrieval from Complex Structured Academic Papers PDF

[41] SemanticCite: Citation Verification with AI-Powered Full-Text Analysis and Evidence-Based Reasoning PDF

Contribution Analysis

ScholScan benchmark for scan-oriented scholarly paper reasoning

[71] R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization PDF

[72] A survey on benchmarks of multimodal large language models PDF

[73] Mind with eyes: from language reasoning to multimodal reasoning PDF

[74] Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi PDF

[75] Cross-modal causal relational reasoning for event-level visual question answering PDF

[76] FedMultimodal: A Benchmark for Multimodal Federated Learning PDF

[77] Benchmark for Research Theme Classification of Scholarly Documents PDF

[78] MMAT-1M: A large reasoning dataset for multimodal agent tuning PDF

[79] EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges PDF

[80] LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts PDF

Scan-oriented task paradigm for academic paper understanding

[51] Evaluating the factual consistency of abstractive text summarization PDF

[52] Uncertainty-aware consistency checking in industrial settings PDF

[53] Document-Level Tabular Numerical Cross-Checking: A Coarse-to-Fine Approach PDF

[54] Consistent Document-Level Relation Extraction via Counterfactuals PDF

[55] Evaluating Superhuman Models with Consistency Checks PDF

[56] CDË2CR: Co-reference resolution across documents and domains PDF

[57] Discovering Inconsistencies in Documents with Long-Context LLMs PDF

[59] Discourse-driven evaluation: Unveiling factual inconsistency in long document summarization PDF

[60] LegalWiz: A Multi-Agent Generation Framework for Contradiction Detection in Legal Documents PDF

Process-aware evaluation framework with fine-grained annotations

[61] Processbench: Identifying process errors in mathematical reasoning PDF

[63] Beyond the Answer: Advancing Multi-Hop QA with Fine-Grained Graph Reasoning and Evaluation PDF

[67] ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning PDF

[62] LLaVA-CoT: Let Vision Language Models Reason Step-by-Step PDF

[64] LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs PDF

[65] Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection PDF

[66] Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems PDF

[68] Prometheus: Inducing Fine-grained Evaluation Capability in Language Models PDF

[69] Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs PDF

[70] Video-of-thought: Step-by-step video reasoning from perception to cognition PDF

Table of Contents

[56] CDË2CR: Co-reference resolution across documents and domains PDF