How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks

ICLR 2026 Conference SubmissionAnonymous Authors
vision benchmarkmultimodal foundation modelsvision language modelsstandard computer vision tasks
Abstract:

Multimodal foundation models, such as GPT-4o, have recently made remarkable progress, but it is not clear where exactly these models stand in terms of understanding vision. In this paper, we benchmark the performance of popular multimodal foundation models (GPT-4o, o4-mini, Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, Llama 3.2) on standard computer vision tasks (semantic segmentation, object detection, image classification, depth and surface normal prediction) and using established datasets (e.g., COCO, ImageNet and its variants, etc).

The main challenges to performing this are: 1) most models are trained to output text and cannot natively express versatile domains, such as segments or 3D geometry, and 2) many leading models are proprietary and accessible only at an API level, i.e., there is no weight access to adapt them. We address these challenges by translating standard vision tasks into equivalent text-promptable and API-compatible tasks via prompt chaining to create a standardized benchmarking framework.

We observe that 1) the models are not close to the state-of-the-art specialist models at any tasks, and 2) they perform semantic tasks notably better than geometric ones. However, 3) they are respectable generalists; this is remarkable as they are presumably trained on primarily image-text-based tasks. 4) While the prompt-chaining techniques affect performance, better models exhibit less sensitivity to prompt variations. 5) GPT-4o performs the best among non-reasoning models, securing the top position in 4 out of 6 tasks and 6) reasoning models, e.g. o3, show improvements in geometric tasks.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

This paper benchmarks popular multimodal foundation models (GPT-4o, Gemini, Claude, etc.) on standard computer vision tasks using a prompt chaining framework to translate vision outputs into text-compatible formats. It resides in the 'Standard Computer Vision Task Evaluation' leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader taxonomy. The sibling papers focus on detection/segmentation surveys and feature upsampling methods, suggesting this leaf emphasizes systematic performance measurement on canonical vision benchmarks rather than architectural innovation or domain specialization.

The taxonomy reveals a rich ecosystem with parallel branches: architecture design (InternVL, CoCa), domain-specific models (medical imaging, document understanding), and task-specific adaptations (open-vocabulary detection, in-context learning). This paper's leaf sits within 'Evaluation Benchmarks and Performance Analysis,' adjacent to leaves examining perception-demanding tasks (depth, correspondence) and multimodal reasoning benchmarks. The scope note explicitly excludes specialized or low-level vision tasks, positioning this work as focused on established high-level vision problems (segmentation, detection, classification) rather than fine-grained perceptual abilities or expert-domain applications.

Among thirty candidates examined, none clearly refute any of the three contributions: the prompt chaining framework (ten candidates, zero refutable), the standardized benchmarking methodology (ten candidates, zero refutable), and the empirical evaluation revealing performance gaps (ten candidates, zero refutable). This suggests that within the limited search scope, the specific combination of prompt chaining for API-only models and systematic comparison against vision specialists on standard benchmarks appears relatively underexplored. However, the search examined only top-K semantic matches plus citation expansion, not an exhaustive literature review.

Given the sparse leaf (three papers) and zero refutable candidates across thirty examined, the work appears to occupy a distinct niche: adapting proprietary text-output models to standard vision tasks via prompting. The taxonomy shows crowded activity in architecture design and domain applications, but less saturation in systematic evaluation of general-purpose models on canonical benchmarks. The analysis is limited to the examined candidate set and does not cover all possible prior work in vision-language evaluation or prompting strategies.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: evaluating multimodal foundation models on standard computer vision tasks. The field has evolved into a rich ecosystem organized around several complementary dimensions. At the highest level, one branch focuses on Multimodal Foundation Model Architectures and Pre-training, exploring how models like InternVL[1], mplug-2[2], FLAVA[9], and CoCa[10] integrate vision and language through diverse pre-training strategies. A second major branch, Evaluation Benchmarks and Performance Analysis, develops systematic protocols and datasets—such as Q-bench[4] and Blink[5]—to rigorously assess model capabilities on standard vision tasks. Meanwhile, Domain-Specific Multimodal Foundation Models adapt these architectures to specialized areas like biomedicine (Biomedical Foundation[8], EyeFound[25]) and dermatology (Dermatology Foundation[15]), and Task-Specific Applications and Adaptations address challenges in areas ranging from video understanding (VideoLLaMA[6]) to detection and segmentation (Detection Segmentation Review[23]). Additional branches examine Security, Robustness, and Model Comparison, including adversarial concerns (TrojVLM[39]) and fairness (Fairness Evaluation[13]), as well as Surveys and Literature Reviews that synthesize progress (VLM Survey[7]). Within this landscape, a particularly active line of work centers on establishing robust evaluation protocols for standard computer vision benchmarks, where GPT-4o Vision[0] situates itself. This cluster emphasizes systematic performance measurement across canonical tasks such as image classification, object detection, and segmentation, often comparing newer multimodal models against traditional vision-only baselines. GPT-4o Vision[0] aligns closely with efforts like Detection Segmentation Review[23], which surveys detection and segmentation methods, and Feature Upsampling[16], which explores architectural refinements for dense prediction tasks. In contrast to domain-specialized models (e.g., Biomedical Foundation[8]) or those targeting novel reasoning capabilities (Thinking in Space[3]), this evaluation-focused thread prioritizes breadth and reproducibility on well-established benchmarks, helping the community understand how general-purpose multimodal architectures perform when applied to core vision problems without extensive task-specific tuning.

Claimed Contributions

Prompt chaining framework for benchmarking MFMs on standard vision tasks

The authors develop a prompt chaining framework that translates standard computer vision tasks into text-promptable, API-compatible formats. This enables benchmarking of multimodal foundation models on classical vision tasks such as semantic segmentation, object detection, classification, depth prediction, and surface normal estimation using established datasets like COCO and ImageNet.

10 retrieved papers
Standardized benchmarking methodology enabling direct comparison with vision specialists

The framework provides a standardized method to measure and benchmark any MFM that can input images and output text. Crucially, this enables quantifiable and holistic understanding of MFMs' vision capabilities on various established vision tasks, as well as direct comparison with vision-only specialist models using standard task-specific metrics.

10 retrieved papers
Comprehensive empirical evaluation revealing performance gaps and task-specific strengths

The authors conduct extensive experiments showing that current MFMs lag behind specialist models across all tasks, yet demonstrate respectable generalist capabilities. They reveal that MFMs perform significantly better on semantic tasks compared to geometric ones, with GPT-4o achieving top performance in 4 out of 6 tasks among non-reasoning models.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Prompt chaining framework for benchmarking MFMs on standard vision tasks

The authors develop a prompt chaining framework that translates standard computer vision tasks into text-promptable, API-compatible formats. This enables benchmarking of multimodal foundation models on classical vision tasks such as semantic segmentation, object detection, classification, depth prediction, and surface normal estimation using established datasets like COCO and ImageNet.

Contribution

Standardized benchmarking methodology enabling direct comparison with vision specialists

The framework provides a standardized method to measure and benchmark any MFM that can input images and output text. Crucially, this enables quantifiable and holistic understanding of MFMs' vision capabilities on various established vision tasks, as well as direct comparison with vision-only specialist models using standard task-specific metrics.

Contribution

Comprehensive empirical evaluation revealing performance gaps and task-specific strengths

The authors conduct extensive experiments showing that current MFMs lag behind specialist models across all tasks, yet demonstrate respectable generalist capabilities. They reveal that MFMs perform significantly better on semantic tasks compared to geometric ones, with GPT-4o achieving top performance in 4 out of 6 tasks among non-reasoning models.