OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors
multimodal slow-thinking systems;text-rich image understadning;reasoning model
Abstract:

Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across various visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the absence of a dedicated and systematic benchmark. To address this gap, we propose OCR-Reasoning, a novel benchmark designed to systematically assess Multimodal Large Language Models on text-rich image reasoning tasks. Specifically, OCR-Reasoning comprises 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Unlike existing text-rich image understanding benchmarks that only provide a final answer, this benchmark additionally provides a detailed step-by-step reasoning process. This dual annotation enables the evaluation of both the models' final answers and their reasoning processes, thereby offering a holistic assessment of text-rich reasoning capabilities. By leveraging this benchmark, we conducted a comprehensive evaluation of the latest MLLMs. Our results demonstrate that even the most advanced MLLMs exhibit substantial difficulties in text-rich image reasoning tasks, with none achieving an accuracy above 50% on our benchmark, indicating that the challenges of text-rich image reasoning are an urgent issue to be addressed. The dataset and evaluation scripts will be made publicly available.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces OCR-Reasoning, a benchmark comprising 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical tasks in text-rich visual scenarios. It resides in the 'Specialized Task Benchmarks' leaf alongside two sibling papers (Contextual and Mctbench), indicating a focused but not overcrowded research direction. The taxonomy shows 50 papers across the entire field, with this leaf containing only three works, suggesting that specialized benchmarks for text-rich reasoning remain relatively sparse compared to broader multimodal evaluation efforts.

The taxonomy reveals that OCR-Reasoning sits within the 'Evaluation Benchmarks and Datasets' branch, distinct from the 'Comprehensive Multimodal Understanding Benchmarks' leaf (which houses general-purpose suites like MMMU-Pro). Neighboring branches include 'Task-Specific Applications' (covering OCR, text-based VQA, and domain-specific reasoning) and 'Reasoning Mechanisms' (addressing chain-of-thought and multi-stage inference). The benchmark's emphasis on text-rich scenarios connects it to application-focused work in OCR and text localization, yet its evaluation-centric design keeps it separate from those implementation-oriented papers.

Among 30 candidates examined, the dual annotation scheme (reasoning processes plus final answers) encountered 2 refutable candidates, while the systematic definition of text-rich reasoning abilities found 1 refutable candidate. The core benchmark contribution itself showed no clear refutations across 10 examined papers. These statistics suggest that while the overall benchmark concept appears relatively novel within the limited search scope, the dual annotation approach and systematic ability taxonomy have more substantial prior work. The analysis does not claim exhaustive coverage; it reflects patterns among top-30 semantic matches and their citations.

Based on the limited literature search, the benchmark appears to occupy a moderately novel position in specialized text-rich evaluation. The taxonomy structure indicates this is an emerging rather than saturated area, though the dual annotation and systematic ability frameworks show partial overlap with existing work. The analysis covers top-30 candidates and does not account for potentially relevant papers outside this scope or in adjacent subfields.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: text-rich image reasoning. This field addresses the challenge of understanding and reasoning over images that contain substantial textual content—such as documents, charts, infographics, web pages, and scene text—where models must integrate visual perception with language comprehension. The taxonomy organizes research into four main branches: Model Architecture and Training Paradigms explores foundational designs and learning strategies (e.g., Llavar[1], Scaling Text-Rich[4]); Reasoning Mechanisms and Cognitive Strategies examines how models perform multi-step inference and leverage structured knowledge (e.g., Textual-Visual Logic[5], Textcot[6]); Task-Specific Applications and Domains targets specialized use cases like medical imaging (Medisee[15]), mathematical reasoning (MathReal[36]), and web navigation (Webwatcher[38]); and Evaluation Benchmarks and Datasets provides resources to measure progress, including general-purpose suites (MMMU-Pro[13]) and specialized task benchmarks that assess targeted capabilities. Within the evaluation landscape, a particularly active line of work focuses on specialized task benchmarks that probe specific reasoning skills beyond generic visual question answering. These benchmarks often emphasize the interplay between OCR accuracy and higher-level inference, testing whether models can extract text and then reason about relationships, logic, or context. OCR-Reasoning Benchmark[0] sits squarely in this cluster, designed to evaluate models on tasks that require both precise text recognition and subsequent reasoning steps. It shares thematic ground with Contextual[17] and Mctbench[49], which similarly target nuanced comprehension of text-rich content, though each emphasizes different facets—contextual understanding versus multi-choice reasoning formats. This specialization reflects a broader trend: as general-purpose models improve, the community increasingly values fine-grained diagnostics that reveal where text-image integration still falls short.

Claimed Contributions

OCR-Reasoning benchmark for text-rich image reasoning

The authors introduce OCR-Reasoning, a benchmark containing 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Unlike existing benchmarks that only provide final answers, this benchmark additionally provides detailed step-by-step reasoning processes for holistic assessment.

10 retrieved papers
Dual annotation scheme with reasoning processes and final answers

The benchmark provides annotations for both final answers and step-by-step reasoning processes, enabling comprehensive evaluation of MLLMs' reasoning capabilities rather than just answer accuracy. This distinguishes it from existing text-rich image understanding benchmarks that only annotate final answers.

10 retrieved papers
Can Refute
Systematic definition and evaluation of text-rich image reasoning abilities

The authors claim to be the first to concretely define various core sub-abilities (6 core reasoning abilities across 18 tasks) for text-rich image reasoning and provide a systematic evaluation framework. This addresses the gap in existing benchmarks that lack systematic assessment of reasoning capabilities in text-rich visual contexts.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

OCR-Reasoning benchmark for text-rich image reasoning

The authors introduce OCR-Reasoning, a benchmark containing 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Unlike existing benchmarks that only provide final answers, this benchmark additionally provides detailed step-by-step reasoning processes for holistic assessment.

Contribution

Dual annotation scheme with reasoning processes and final answers

The benchmark provides annotations for both final answers and step-by-step reasoning processes, enabling comprehensive evaluation of MLLMs' reasoning capabilities rather than just answer accuracy. This distinguishes it from existing text-rich image understanding benchmarks that only annotate final answers.

Contribution

Systematic definition and evaluation of text-rich image reasoning abilities

The authors claim to be the first to concretely define various core sub-abilities (6 core reasoning abilities across 18 tasks) for text-rich image reasoning and provide a systematic evaluation framework. This addresses the gap in existing benchmarks that lack systematic assessment of reasoning capabilities in text-rich visual contexts.