Not Search, But Scan: Benchmarking MLLMs on Scan-Oriented Academic Paper Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors
Multimodal Large Language Models; Academic Paper Reasoning; Scan-Oriented Reasoning
Abstract:

With the rapid progress of multimodal large language models (MLLMs), AI already performs well at literature retrieval and certain reasoning tasks, serving as a capable assistant to human researchers, yet it remains far from autonomous research. The fundamental reason is that current work on scholarly paper reasoning is largely confined to a search-oriented paradigm centered on pre-specified targets, with reasoning grounded in relevance retrieval, which struggles to support researcher-style full-document understanding, reasoning, and verification. To bridge this gap, we propose ScholScan, a new benchmark for scholarly paper reasoning. ScholScan introduces a scan-oriented task setting that asks models to read and cross-check entire papers like human researchers, scanning the document to identify consistency issues. The benchmark comprises 1,800 carefully annotated questions drawn from 9 error families across 13 natural-science domains and 715 papers, and provides detailed annotations for evidence localization and reasoning traces, together with a unified evaluation protocol. We assessed 15 models across 24 input configurations and conduct a fine-grained analysis of MLLM capabilities across error families. Across the board, retrieval-augmented generation (RAG) methods yield no significant improvements, revealing systematic deficiencies of current MLLMs on scan-oriented tasks and underscoring the challenge posed by ScholScan. We expect ScholScan to be the leading and representative work of the scan-oriented task paradigm.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces ScholScan, a benchmark for scan-oriented scholarly paper reasoning that asks models to read entire papers and identify consistency issues across nine error families. It resides in the 'Full-Document Scanning and Claim Verification' leaf, which contains only four papers total, indicating a relatively sparse research direction within the broader AI-based error detection landscape. This positioning suggests the work targets a specific gap: moving beyond search-oriented retrieval to researcher-style document understanding and cross-checking.

The taxonomy reveals that ScholScan's leaf sits within a larger branch of AI-based detection systems, which also includes specialized error detection tools (targeting genotyping, grammar, or database errors) and document classification systems. Neighboring branches address error taxonomies (statistical, citation, visualization errors), AI-generated content integrity, and publication processes. The scope note for the leaf explicitly excludes partial-document or abstract-only methods, clarifying that ScholScan's full-document scanning approach differentiates it from abstract-level screening work found elsewhere in the taxonomy.

Among thirty candidates examined, the benchmark itself and the scan-oriented task paradigm show no clear refutation (zero refutable candidates across twenty examined). However, the process-aware evaluation framework encountered three refutable candidates among ten examined, suggesting that fine-grained annotation and evaluation protocols have more substantial prior work. This pattern indicates that while the core task formulation appears relatively novel within the limited search scope, the evaluation methodology overlaps with existing practices in claim verification or evidence localization benchmarks.

Based on the top-thirty semantic matches examined, ScholScan appears to occupy a less crowded niche within full-document reasoning for scientific error detection. The analysis does not cover exhaustive literature beyond these candidates, and the taxonomy's sparse leaf structure (four papers) may reflect either genuine novelty or incomplete field coverage. The evaluation framework's overlap with prior work is notable but does not diminish the distinctiveness of the scan-oriented task paradigm itself.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: Detecting scientific errors in academic papers through full-document scanning. The field encompasses a diverse landscape of approaches to understanding, identifying, and addressing errors in scholarly literature. At the highest level, the taxonomy reveals several major branches: studies that catalog error types and taxonomies in scientific literature (e.g., Statistical Errors Review[3], Statistical Reporting Errors[37]); AI-based systems designed for automated error detection and verification (including full-document scanning methods like MultiVerS[31] and Advanced Evidence Retrieval[36]); investigations into AI-generated content and its implications for scientific integrity (AI Scientific Writing[1], AI Generated Text Gap[10]); analyses of research misconduct and retraction patterns (Social Science Retractions[4], Retraction Error Sources[32]); examinations of publication and peer review processes (Peer Review Crossroads[9], Rush to Publication[42]); human factors and educational interventions (Teaching Through Errors[25], Human Fallibility Scientists[46]); and domain-specific error studies across disciplines (Biomedical Manuscript Errors[35], Computational Science Error[19]). These branches collectively address both the technical challenge of error detection and the broader ecosystem of scientific quality control. Within the AI-based detection systems branch, a particularly active line of work focuses on full-document scanning and claim verification, where methods must reason over entire papers rather than isolated snippets. Scan Oriented Paper Reasoning[0] sits squarely in this cluster, emphasizing comprehensive document analysis to identify errors that may only become apparent when considering context across sections. This contrasts with narrower approaches that target specific error types or rely on abstract-level screening (Abstract Error Analysis[15], Pediatric Abstract Quality[8]). Nearby works like MultiVerS[31] and SemanticCite[41] similarly tackle evidence retrieval and citation verification at scale, though they may differ in whether they prioritize claim-level fact-checking or broader logical consistency. A key tension across these systems involves balancing thoroughness with computational feasibility, as full-document reasoning demands substantial resources while promising more nuanced error detection than surface-level checks.

Claimed Contributions

ScholScan benchmark for scan-oriented scholarly paper reasoning

The authors introduce ScholScan, a benchmark that evaluates multimodal large language models on scan-oriented tasks requiring full-document understanding and consistency checking across academic papers, rather than search-oriented retrieval of pre-specified targets. The benchmark comprises 1,800 questions from 715 papers spanning 13 natural-science domains and 9 error families.

10 retrieved papers
Scan-oriented task paradigm for academic paper understanding

The authors propose a new task paradigm that contrasts with existing search-oriented approaches by requiring models to proactively discover implicit problems without prespecified targets, emphasizing consistency verification through comprehensive document scanning rather than relevance-based retrieval.

10 retrieved papers
Process-aware evaluation framework with fine-grained annotations

The authors develop a structured evaluation protocol that goes beyond outcome-based metrics by providing detailed annotations for evidence localization and reasoning traces, allowing assessment of whether intermediate reasoning is evidentially grounded and logically valid.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ScholScan benchmark for scan-oriented scholarly paper reasoning

The authors introduce ScholScan, a benchmark that evaluates multimodal large language models on scan-oriented tasks requiring full-document understanding and consistency checking across academic papers, rather than search-oriented retrieval of pre-specified targets. The benchmark comprises 1,800 questions from 715 papers spanning 13 natural-science domains and 9 error families.

Contribution

Scan-oriented task paradigm for academic paper understanding

The authors propose a new task paradigm that contrasts with existing search-oriented approaches by requiring models to proactively discover implicit problems without prespecified targets, emphasizing consistency verification through comprehensive document scanning rather than relevance-based retrieval.

Contribution

Process-aware evaluation framework with fine-grained annotations

The authors develop a structured evaluation protocol that goes beyond outcome-based metrics by providing detailed annotations for evidence localization and reasoning traces, allowing assessment of whether intermediate reasoning is evidentially grounded and logically valid.