OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

multimodal slow-thinking systems;text-rich image understadning;reasoning model

Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across various visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the absence of a dedicated and systematic benchmark. To address this gap, we propose OCR-Reasoning, a novel benchmark designed to systematically assess Multimodal Large Language Models on text-rich image reasoning tasks. Specifically, OCR-Reasoning comprises 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Unlike existing text-rich image understanding benchmarks that only provide a final answer, this benchmark additionally provides a detailed step-by-step reasoning process. This dual annotation enables the evaluation of both the models' final answers and their reasoning processes, thereby offering a holistic assessment of text-rich reasoning capabilities. By leveraging this benchmark, we conducted a comprehensive evaluation of the latest MLLMs. Our results demonstrate that even the most advanced MLLMs exhibit substantial difficulties in text-rich image reasoning tasks, with none achieving an accuracy above 50% on our benchmark, indicating that the challenges of text-rich image reasoning are an urgent issue to be addressed. The dataset and evaluation scripts will be made publicly available.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces OCR-Reasoning, a benchmark comprising 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical tasks in text-rich visual scenarios. It resides in the 'Specialized Task Benchmarks' leaf alongside two sibling papers (Contextual and Mctbench), indicating a focused but not overcrowded research direction. The taxonomy shows 50 papers across the entire field, with this leaf containing only three works, suggesting that specialized benchmarks for text-rich reasoning remain relatively sparse compared to broader multimodal evaluation efforts.

The taxonomy reveals that OCR-Reasoning sits within the 'Evaluation Benchmarks and Datasets' branch, distinct from the 'Comprehensive Multimodal Understanding Benchmarks' leaf (which houses general-purpose suites like MMMU-Pro). Neighboring branches include 'Task-Specific Applications' (covering OCR, text-based VQA, and domain-specific reasoning) and 'Reasoning Mechanisms' (addressing chain-of-thought and multi-stage inference). The benchmark's emphasis on text-rich scenarios connects it to application-focused work in OCR and text localization, yet its evaluation-centric design keeps it separate from those implementation-oriented papers.

Among 30 candidates examined, the dual annotation scheme (reasoning processes plus final answers) encountered 2 refutable candidates, while the systematic definition of text-rich reasoning abilities found 1 refutable candidate. The core benchmark contribution itself showed no clear refutations across 10 examined papers. These statistics suggest that while the overall benchmark concept appears relatively novel within the limited search scope, the dual annotation approach and systematic ability taxonomy have more substantial prior work. The analysis does not claim exhaustive coverage; it reflects patterns among top-30 semantic matches and their citations.

Based on the limited literature search, the benchmark appears to occupy a moderately novel position in specialized text-rich evaluation. The taxonomy structure indicates this is an emerging rather than saturated area, though the dual annotation and systematic ability frameworks show partial overlap with existing work. The analysis covers top-30 candidates and does not account for potentially relevant papers outside this scope or in adjacent subfields.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: text-rich image reasoning. This field addresses the challenge of understanding and reasoning over images that contain substantial textual content—such as documents, charts, infographics, web pages, and scene text—where models must integrate visual perception with language comprehension. The taxonomy organizes research into four main branches: Model Architecture and Training Paradigms explores foundational designs and learning strategies (e.g., Llavar[1], Scaling Text-Rich[4]); Reasoning Mechanisms and Cognitive Strategies examines how models perform multi-step inference and leverage structured knowledge (e.g., Textual-Visual Logic[5], Textcot[6]); Task-Specific Applications and Domains targets specialized use cases like medical imaging (Medisee[15]), mathematical reasoning (MathReal[36]), and web navigation (Webwatcher[38]); and Evaluation Benchmarks and Datasets provides resources to measure progress, including general-purpose suites (MMMU-Pro[13]) and specialized task benchmarks that assess targeted capabilities. Within the evaluation landscape, a particularly active line of work focuses on specialized task benchmarks that probe specific reasoning skills beyond generic visual question answering. These benchmarks often emphasize the interplay between OCR accuracy and higher-level inference, testing whether models can extract text and then reason about relationships, logic, or context. OCR-Reasoning Benchmark[0] sits squarely in this cluster, designed to evaluate models on tasks that require both precise text recognition and subsequent reasoning steps. It shares thematic ground with Contextual[17] and Mctbench[49], which similarly target nuanced comprehension of text-rich content, though each emphasizes different facets—contextual understanding versus multi-choice reasoning formats. This specialization reflects a broader trend: as general-purpose models improve, the community increasingly values fine-grained diagnostics that reveal where text-image integration still falls short.

Claimed Contributions

OCR-Reasoning benchmark for text-rich image reasoning

10 retrieved papers

The authors introduce OCR-Reasoning, a benchmark containing 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Unlike existing benchmarks that only provide final answers, this benchmark additionally provides detailed step-by-step reasoning processes for holistic assessment.

10 retrieved papers

Dual annotation scheme with reasoning processes and final answers

Can Refute

10 retrieved papers

The benchmark provides annotations for both final answers and step-by-step reasoning processes, enabling comprehensive evaluation of MLLMs' reasoning capabilities rather than just answer accuracy. This distinguishes it from existing text-rich image understanding benchmarks that only annotate final answers.

10 retrieved papers

Can Refute

Systematic definition and evaluation of text-rich image reasoning abilities

Can Refute

10 retrieved papers

The authors claim to be the first to concretely define various core sub-abilities (6 core reasoning abilities across 18 tasks) for text-rich image reasoning and provide a systematic evaluation framework. This addresses the gap in existing benchmarks that lack systematic assessment of reasoning capabilities in text-rich visual contexts.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[17] Contextual: Evaluating context-sensitive text-rich visual reasoning in large multimodal models PDF

Wadhawan, Rohan, Bansal, Hritik, Rohan Wadhawan, Chang, Kai-Wei, Hritik Bansal, Peng, Nanyun, Kai-Wei Chang, Nanyun Peng (2024)

[49] Mctbench: Multimodal cognition towards text-rich visual scenes benchmark PDF

Shan, Bin, Fei Xiang, Bin Shan, Shi Wei, Xiang Fei, Wei Shi, Tang, Guozhi, An-Lan Wang, Liao Lei, Guozhi Tang, Jingqun, Lei Liao, Bai Xiang, Jingqun Tang, Huang Can, Xiang Bai, Can Huang (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

OCR-Reasoning benchmark for text-rich image reasoning

[51] A survey on benchmarks of multimodal large language models PDF

Cannot Refute

[52] A token-level text image foundation model for document understanding PDF

Cannot Refute

[53] mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding PDF

Cannot Refute

[54] Ocrbench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning PDF

Cannot Refute

[55] Colpali: Efficient document retrieval with vision language models PDF

Cannot Refute

[56] Bliva: A simple multimodal llm for better handling of text-rich visual questions PDF

Cannot Refute

[57] Visuriddles: Fine-grained perception is a primary bottleneck for multimodal large language models in abstract visual reasoning PDF

Cannot Refute

[58] MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding PDF

Cannot Refute

[59] Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning PDF

Cannot Refute

[60] Ocrbench: on the hidden mystery of ocr in large multimodal models PDF

Cannot Refute

Contribution

Dual annotation scheme with reasoning processes and final answers

[69] Measuring and improving chain-of-thought reasoning in vision-language models PDF

Can Refute

[75] Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models PDF

Can Refute

[66] Insight-v: Exploring long-chain visual reasoning with multimodal large language models PDF

Cannot Refute

[70] Vision-r1: Incentivizing reasoning capability in multimodal large language models PDF

Cannot Refute

[71] Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark PDF

Cannot Refute

[72] Visual cognition in multimodal large language models PDF

Cannot Refute

[73] Commonsense reasoning for legged robot adaptation with vision-language models PDF

Cannot Refute

[74] End-to-end chart summarization via visual chain-of-thought in vision-language models PDF

Cannot Refute

[76] Llava-cot: Let vision language models reason step-by-step PDF

Cannot Refute

[77] Reasoning grasping via multimodal large language model PDF

Cannot Refute

Contribution

Systematic definition and evaluation of text-rich image reasoning abilities

[65] Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts PDF

Can Refute

[1] Llavar: Enhanced visual instruction tuning for text-rich image understanding PDF

Cannot Refute

[61] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

Cannot Refute

[62] Cot-vla: Visual chain-of-thought reasoning for vision-language-action models PDF

Cannot Refute

[63] Measuring multimodal mathematical reasoning with math-vision dataset PDF

Cannot Refute

[64] Navgpt-2: Unleashing navigational reasoning capability for large vision-language models PDF

Cannot Refute

[66] Insight-v: Exploring long-chain visual reasoning with multimodal large language models PDF

Cannot Refute

[67] Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi PDF

Cannot Refute

[68] Vla-r1: Enhancing reasoning in vision-language-action models PDF

Cannot Refute

[69] Measuring and improving chain-of-thought reasoning in vision-language models PDF

Cannot Refute

OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[17] Contextual: Evaluating context-sensitive text-rich visual reasoning in large multimodal models PDF

[49] Mctbench: Multimodal cognition towards text-rich visual scenes benchmark PDF

Contribution Analysis

OCR-Reasoning benchmark for text-rich image reasoning

[51] A survey on benchmarks of multimodal large language models PDF

[52] A token-level text image foundation model for document understanding PDF

[53] mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding PDF

[54] Ocrbench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning PDF

[55] Colpali: Efficient document retrieval with vision language models PDF

[56] Bliva: A simple multimodal llm for better handling of text-rich visual questions PDF

[57] Visuriddles: Fine-grained perception is a primary bottleneck for multimodal large language models in abstract visual reasoning PDF

[58] MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding PDF

[59] Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning PDF

[60] Ocrbench: on the hidden mystery of ocr in large multimodal models PDF

Dual annotation scheme with reasoning processes and final answers

[69] Measuring and improving chain-of-thought reasoning in vision-language models PDF

[75] Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models PDF

[66] Insight-v: Exploring long-chain visual reasoning with multimodal large language models PDF

[70] Vision-r1: Incentivizing reasoning capability in multimodal large language models PDF

[71] Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark PDF

[72] Visual cognition in multimodal large language models PDF

[73] Commonsense reasoning for legged robot adaptation with vision-language models PDF

[74] End-to-end chart summarization via visual chain-of-thought in vision-language models PDF

[76] Llava-cot: Let vision language models reason step-by-step PDF

[77] Reasoning grasping via multimodal large language model PDF

Systematic definition and evaluation of text-rich image reasoning abilities

[65] Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts PDF

[1] Llavar: Enhanced visual instruction tuning for text-rich image understanding PDF

[61] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

[62] Cot-vla: Visual chain-of-thought reasoning for vision-language-action models PDF

[63] Measuring multimodal mathematical reasoning with math-vision dataset PDF

[64] Navgpt-2: Unleashing navigational reasoning capability for large vision-language models PDF

[66] Insight-v: Exploring long-chain visual reasoning with multimodal large language models PDF

[67] Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi PDF

[68] Vla-r1: Enhancing reasoning in vision-language-action models PDF

[69] Measuring and improving chain-of-thought reasoning in vision-language models PDF

Table of Contents