How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

vision benchmarkmultimodal foundation modelsvision language modelsstandard computer vision tasks

Multimodal foundation models, such as GPT-4o, have recently made remarkable progress, but it is not clear where exactly these models stand in terms of understanding vision. In this paper, we benchmark the performance of popular multimodal foundation models (GPT-4o, o4-mini, Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, Llama 3.2) on standard computer vision tasks (semantic segmentation, object detection, image classification, depth and surface normal prediction) and using established datasets (e.g., COCO, ImageNet and its variants, etc).

The main challenges to performing this are: 1) most models are trained to output text and cannot natively express versatile domains, such as segments or 3D geometry, and 2) many leading models are proprietary and accessible only at an API level, i.e., there is no weight access to adapt them. We address these challenges by translating standard vision tasks into equivalent text-promptable and API-compatible tasks via prompt chaining to create a standardized benchmarking framework.

We observe that 1) the models are not close to the state-of-the-art specialist models at any tasks, and 2) they perform semantic tasks notably better than geometric ones. However, 3) they are respectable generalists; this is remarkable as they are presumably trained on primarily image-text-based tasks. 4) While the prompt-chaining techniques affect performance, better models exhibit less sensitivity to prompt variations. 5) GPT-4o performs the best among non-reasoning models, securing the top position in 4 out of 6 tasks and 6) reasoning models, e.g. o3, show improvements in geometric tasks.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

This paper benchmarks popular multimodal foundation models (GPT-4o, Gemini, Claude, etc.) on standard computer vision tasks using a prompt chaining framework to translate vision outputs into text-compatible formats. It resides in the 'Standard Computer Vision Task Evaluation' leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader taxonomy. The sibling papers focus on detection/segmentation surveys and feature upsampling methods, suggesting this leaf emphasizes systematic performance measurement on canonical vision benchmarks rather than architectural innovation or domain specialization.

The taxonomy reveals a rich ecosystem with parallel branches: architecture design (InternVL, CoCa), domain-specific models (medical imaging, document understanding), and task-specific adaptations (open-vocabulary detection, in-context learning). This paper's leaf sits within 'Evaluation Benchmarks and Performance Analysis,' adjacent to leaves examining perception-demanding tasks (depth, correspondence) and multimodal reasoning benchmarks. The scope note explicitly excludes specialized or low-level vision tasks, positioning this work as focused on established high-level vision problems (segmentation, detection, classification) rather than fine-grained perceptual abilities or expert-domain applications.

Among thirty candidates examined, none clearly refute any of the three contributions: the prompt chaining framework (ten candidates, zero refutable), the standardized benchmarking methodology (ten candidates, zero refutable), and the empirical evaluation revealing performance gaps (ten candidates, zero refutable). This suggests that within the limited search scope, the specific combination of prompt chaining for API-only models and systematic comparison against vision specialists on standard benchmarks appears relatively underexplored. However, the search examined only top-K semantic matches plus citation expansion, not an exhaustive literature review.

Given the sparse leaf (three papers) and zero refutable candidates across thirty examined, the work appears to occupy a distinct niche: adapting proprietary text-output models to standard vision tasks via prompting. The taxonomy shows crowded activity in architecture design and domain applications, but less saturation in systematic evaluation of general-purpose models on canonical benchmarks. The analysis is limited to the examined candidate set and does not cover all possible prior work in vision-language evaluation or prompting strategies.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: evaluating multimodal foundation models on standard computer vision tasks. The field has evolved into a rich ecosystem organized around several complementary dimensions. At the highest level, one branch focuses on Multimodal Foundation Model Architectures and Pre-training, exploring how models like InternVL[1], mplug-2[2], FLAVA[9], and CoCa[10] integrate vision and language through diverse pre-training strategies. A second major branch, Evaluation Benchmarks and Performance Analysis, develops systematic protocols and datasets—such as Q-bench[4] and Blink[5]—to rigorously assess model capabilities on standard vision tasks. Meanwhile, Domain-Specific Multimodal Foundation Models adapt these architectures to specialized areas like biomedicine (Biomedical Foundation[8], EyeFound[25]) and dermatology (Dermatology Foundation[15]), and Task-Specific Applications and Adaptations address challenges in areas ranging from video understanding (VideoLLaMA[6]) to detection and segmentation (Detection Segmentation Review[23]). Additional branches examine Security, Robustness, and Model Comparison, including adversarial concerns (TrojVLM[39]) and fairness (Fairness Evaluation[13]), as well as Surveys and Literature Reviews that synthesize progress (VLM Survey[7]). Within this landscape, a particularly active line of work centers on establishing robust evaluation protocols for standard computer vision benchmarks, where GPT-4o Vision[0] situates itself. This cluster emphasizes systematic performance measurement across canonical tasks such as image classification, object detection, and segmentation, often comparing newer multimodal models against traditional vision-only baselines. GPT-4o Vision[0] aligns closely with efforts like Detection Segmentation Review[23], which surveys detection and segmentation methods, and Feature Upsampling[16], which explores architectural refinements for dense prediction tasks. In contrast to domain-specialized models (e.g., Biomedical Foundation[8]) or those targeting novel reasoning capabilities (Thinking in Space[3]), this evaluation-focused thread prioritizes breadth and reproducibility on well-established benchmarks, helping the community understand how general-purpose multimodal architectures perform when applied to core vision problems without extensive task-specific tuning.

Claimed Contributions

Prompt chaining framework for benchmarking MFMs on standard vision tasks

10 retrieved papers

The authors develop a prompt chaining framework that translates standard computer vision tasks into text-promptable, API-compatible formats. This enables benchmarking of multimodal foundation models on classical vision tasks such as semantic segmentation, object detection, classification, depth prediction, and surface normal estimation using established datasets like COCO and ImageNet.

10 retrieved papers

Standardized benchmarking methodology enabling direct comparison with vision specialists

10 retrieved papers

The framework provides a standardized method to measure and benchmark any MFM that can input images and output text. Crucially, this enables quantifiable and holistic understanding of MFMs' vision capabilities on various established vision tasks, as well as direct comparison with vision-only specialist models using standard task-specific metrics.

10 retrieved papers

Comprehensive empirical evaluation revealing performance gaps and task-specific strengths

10 retrieved papers

The authors conduct extensive experiments showing that current MFMs lag behind specialist models across all tasks, yet demonstrate respectable generalist capabilities. They reveal that MFMs perform significantly better on semantic tasks compared to geometric ones, with GPT-4o achieving top performance in 4 out of 6 tasks among non-reasoning models.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[16] Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive Segmentation PDF

Zhang Dan, Geiger, Andreas (2025) • arXiv.org

[23] Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation PDF

Feng Yong-Chao, Liu Ya-jie, Yongchao Feng, Yang, Shuai, Yajie Liu, Cai, Wenrui, Shuai Yang, Zhang Jin-qing, Wenrui Cai, Zhan Qi-qi, Jinqing Zhang, Huang ZiYue, Qiqi Zhan, Yan Hongxi, Ziyue Huang, Wan Qiao, Hongxi Yan, Liu Chenguang, Qiao Wan, Wang, Junzhe, Chenguang Liu, Lv Jiahui, Junzhe Wang, Liu Ziqi, Jiahui Lv, Ziqi Liu, Liu Qingjie, Tengyuan Shi, Yunhong, Qingjie Liu, Yunhong Wang (2025) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Prompt chaining framework for benchmarking MFMs on standard vision tasks

[60] Llava-cot: Let vision language models reason step-by-step PDF

Cannot Refute

[61] Multimodal Foundation Models: From Specialists to General-Purpose Assistants PDF

Cannot Refute

[62] Prompting Visual-Language Models for Efficient Video Understanding PDF

Cannot Refute

[63] Vhelm: A holistic evaluation of vision language models PDF

Cannot Refute

[64] Learning to Prompt for Vision-Language Models PDF

Cannot Refute

[65] Multimodal chain-of-thought reasoning: A comprehensive survey PDF

Cannot Refute

[66] Llava-plus: Learning to use tools for creating multimodal agents PDF

Cannot Refute

[67] Mutual prompt leaning for vision language models PDF

Cannot Refute

[68] Sequential modeling enables scalable learning for large vision models PDF

Cannot Refute

[69] Can Large Vision Language Models Read Maps Like a Human? PDF

Cannot Refute

Contribution

Standardized benchmarking methodology enabling direct comparison with vision specialists

[53] SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension PDF

Cannot Refute

[70] Mvbench: A comprehensive multi-modal video understanding benchmark PDF

Cannot Refute

[71] Mm-safetybench: A benchmark for safety evaluation of multimodal large language models PDF

Cannot Refute

[72] Mmbench: Is your multi-modal model an all-around player? PDF

Cannot Refute

[73] Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark PDF

Cannot Refute

[74] Are we on the right way for evaluating large vision-language models? PDF

Cannot Refute

[75] MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models PDF

Cannot Refute

[76] Perception Test: A Diagnostic Benchmark for Multimodal Video Models PDF

Cannot Refute

[77] Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark PDF

Cannot Refute

[78] Benchmark evaluations, applications, and challenges of large vision language models: A survey PDF

Cannot Refute

Contribution

Comprehensive empirical evaluation revealing performance gaps and task-specific strengths

[3] Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces PDF

Cannot Refute

[51] Spatialvlm: Endowing vision-language models with spatial reasoning capabilities PDF

Cannot Refute

[52] Measuring multimodal mathematical reasoning with math-vision dataset PDF

Cannot Refute

[53] SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension PDF

Cannot Refute

[54] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

Cannot Refute

[55] Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing PDF

Cannot Refute

[56] What's "up" with vision-language models? Investigating their struggle with spatial reasoning PDF

Cannot Refute

[57] Cross-city matters: A multimodal remote sensing benchmark dataset for cross-city semantic segmentation using high-resolution domain adaptation networks PDF

Cannot Refute

[58] SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model PDF

Cannot Refute

[59] CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models PDF

Cannot Refute

How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[16] Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive Segmentation PDF

[23] Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation PDF

Contribution Analysis

Prompt chaining framework for benchmarking MFMs on standard vision tasks

[60] Llava-cot: Let vision language models reason step-by-step PDF

[61] Multimodal Foundation Models: From Specialists to General-Purpose Assistants PDF

[62] Prompting Visual-Language Models for Efficient Video Understanding PDF

[63] Vhelm: A holistic evaluation of vision language models PDF

[64] Learning to Prompt for Vision-Language Models PDF

[65] Multimodal chain-of-thought reasoning: A comprehensive survey PDF

[66] Llava-plus: Learning to use tools for creating multimodal agents PDF

[67] Mutual prompt leaning for vision language models PDF

[68] Sequential modeling enables scalable learning for large vision models PDF

[69] Can Large Vision Language Models Read Maps Like a Human? PDF

Standardized benchmarking methodology enabling direct comparison with vision specialists

[53] SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension PDF

[70] Mvbench: A comprehensive multi-modal video understanding benchmark PDF

[71] Mm-safetybench: A benchmark for safety evaluation of multimodal large language models PDF

[72] Mmbench: Is your multi-modal model an all-around player? PDF

[73] Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark PDF

[74] Are we on the right way for evaluating large vision-language models? PDF

[75] MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models PDF

[76] Perception Test: A Diagnostic Benchmark for Multimodal Video Models PDF

[77] Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark PDF

[78] Benchmark evaluations, applications, and challenges of large vision language models: A survey PDF

Comprehensive empirical evaluation revealing performance gaps and task-specific strengths

[3] Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces PDF

[51] Spatialvlm: Endowing vision-language models with spatial reasoning capabilities PDF

[52] Measuring multimodal mathematical reasoning with math-vision dataset PDF

[53] SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension PDF

[54] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

[55] Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing PDF

[56] What's "up" with vision-language models? Investigating their struggle with spatial reasoning PDF

[57] Cross-city matters: A multimodal remote sensing benchmark dataset for cross-city semantic segmentation using high-resolution domain adaptation networks PDF

[58] SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model PDF

[59] CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models PDF

Table of Contents