Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models

ICLR 2026 Conference SubmissionAnonymous Authors
BenchmarkMultimodal Large Language Models
Abstract:

Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks. However, their capacity to comprehend human-centric scenes has rarely been explored, primarily due to the absence of comprehensive evaluation benchmarks that take into account both the human-oriented granular level and higher-dimensional causal reasoning ability. Such high-quality evaluation benchmarks face tough obstacles, given the physical complexity of the human body and the difficulty of annotating granular structures. In this paper, we propose Human-MME, a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric scene understanding. Compared with other existing benchmarks, our work provides three key features: (1) Diversity in human scene, spanning 4 primary visual domains with 15 secondary domains and 43 sub-fields to ensure broad scenario coverage. (2) Progressive and diverse evaluation dimensions, evaluating the human-based activities progressively from the human-oriented granular perception to the higher-dimensional multi-target and causal reasoning, consisting of eight dimensions with 19,945 real-world image question pairs and an evaluation suite. (3) High-quality annotations with rich data paradigms, constructing the automated annotation pipeline and human-annotation platform, supporting rigorous manual labeling by expert annotators to facilitate precise and reliable model assessment. Our benchmark extends the single-person and single-image understanding to the multi-person and multi-image mutual understanding by constructing the choice, short-answer, grounding, ranking and judgment question components, and complex question-answer pairs of their combination. The extensive experiments on 20 state-of-the-art MLLMs effectively expose the limitations and guide future MLLMs research toward better human-centric image understanding and reasoning. Data and code are available at https://anonymous.4open.science/r/Human-MME-FDE7.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Human-MME, a benchmark for evaluating multimodal large language models on human-centric scene understanding, spanning diverse visual domains and progressive evaluation dimensions from granular perception to causal reasoning. It resides in the Human-Centric Evaluation Benchmarks leaf, which contains seven papers total, indicating a moderately populated research direction. This leaf sits within the broader Benchmark Design and Evaluation Frameworks branch, suggesting the work contributes to an active area focused on systematic MLLM assessment rather than architectural innovation or domain-specific applications.

The taxonomy reveals neighboring leaves addressing General Multimodal Evaluation (seven papers on broad scene understanding) and Domain-Specific Evaluation (six papers on specialized tasks like autonomous driving). Human-MME diverges from these by concentrating exclusively on human-related perception and reasoning, excluding general compositional tasks or non-human domains. The scope note clarifies that human-centric benchmarks probe fine-grained attributes, social interactions, and contextual reasoning, distinguishing them from broader multimodal testbeds. This positioning suggests the work occupies a well-defined niche within the larger evaluation landscape.

Among 28 candidates examined, the analysis found limited prior work overlap. The benchmark contribution examined ten candidates with one refutable match, while the annotation pipeline contribution also examined ten candidates with one refutable match. The progressive evaluation dimensions contribution examined eight candidates with no refutations. These statistics indicate that within the top-K semantic search scope, most contributions appear relatively distinct, though the small candidate pool means the search was not exhaustive. The annotation and evaluation design aspects show slightly more prior work than the progressive dimension framework.

Based on the limited search scope of 28 candidates, the work appears to occupy a moderately novel position within human-centric MLLM evaluation. The analysis covers top semantic matches and does not claim exhaustive coverage of all relevant prior work. The taxonomy structure suggests the paper contributes to an established but not overcrowded research direction, with sibling papers addressing complementary aspects of human-centric understanding rather than directly overlapping contributions.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
28
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: evaluating human-centric scene understanding in multimodal large language models. The field has organized itself around several complementary directions. Benchmark Design and Evaluation Frameworks focuses on creating rigorous testbeds that probe models' abilities to interpret human-centered visual content, ranging from general human-centric benchmarks like HumanVBench[7] and HumaniBench[14] to specialized evaluations such as FaceXBench[35] and ActionArt[47]. Model Architecture and Visual Representation explores how different encoder designs and fusion strategies—exemplified by works like Cambrian[2] and DreamLLM[5]—affect the quality of visual grounding. Application Domains and Task-Specific Methods targets concrete use cases, including autonomous driving (Traffic-IT[12], NuPlanQA[17]), robotics (Mobile Robot Navigation[45]), and accessibility (MAIDR AI[49]). Attention Mechanisms and Prompt Engineering investigates how guided visual search (Guided Visual Search[4]) and prompt-aware adapters (Prompt-Aware Adapter[29]) can steer model focus, while Video Understanding and Temporal Reasoning addresses dynamic scenes through methods like Video-of-Thought[22] and MME-VideoOCR[28]. Within the Benchmark Design branch, a particularly active line of work centers on human-centric evaluation benchmarks that stress-test models on fine-grained human attributes, social interactions, and contextual reasoning. Human-MME[0] sits squarely in this cluster, emphasizing comprehensive evaluation of human-centric scene understanding alongside neighbors like HumanVBench[7], which targets video-based human behavior, and HumanPCR[21], which focuses on person-centric reasoning. Compared to more general-purpose benchmarks such as SEED-Bench[23], these human-centric testbeds probe deeper into nuanced aspects of human appearance, activity, and intent. A key open question across these efforts is how to balance breadth—covering diverse human-centric phenomena—with depth in capturing subtle social cues and contextual dependencies, a trade-off that Human-MME[0] addresses by integrating multiple granularities of human-centered tasks into a unified evaluation framework.

Claimed Contributions

Human-MME benchmark for human-centric MLLM evaluation

The authors introduce Human-MME, a comprehensive benchmark that evaluates multimodal large language models on human-centric image understanding. It features diverse human scenes across 43 sub-fields, progressive evaluation dimensions from granular perception to higher-level reasoning, and 19,945 real-world image-question pairs with rich question formats.

10 retrieved papers
Can Refute
Automated annotation pipeline and manual adjustment platform

The authors develop an automated annotation pipeline that extracts fine-grained human features (bounding boxes, facial attributes, body parts, human-object interactions) and a Gradio-based manual adjustment platform that enables expert annotators to refine and verify annotations efficiently, ensuring high-quality data.

10 retrieved papers
Can Refute
Progressive evaluation dimensions and diverse question paradigms

The authors design eight evaluation dimensions that progressively assess models from fine-grained perception (face, body, human-object interaction) to complex reasoning (multi-person, multi-image, intention, emotion, causal discrimination). They introduce diverse question formats including choice, short-answer, grounding, ranking, and judgment components.

8 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Human-MME benchmark for human-centric MLLM evaluation

The authors introduce Human-MME, a comprehensive benchmark that evaluates multimodal large language models on human-centric image understanding. It features diverse human scenes across 43 sub-fields, progressive evaluation dimensions from granular perception to higher-level reasoning, and 19,945 real-world image-question pairs with rich question formats.

Contribution

Automated annotation pipeline and manual adjustment platform

The authors develop an automated annotation pipeline that extracts fine-grained human features (bounding boxes, facial attributes, body parts, human-object interactions) and a Gradio-based manual adjustment platform that enables expert annotators to refine and verify annotations efficiently, ensuring high-quality data.

Contribution

Progressive evaluation dimensions and diverse question paradigms

The authors design eight evaluation dimensions that progressively assess models from fine-grained perception (face, body, human-object interaction) to complex reasoning (multi-person, multi-image, intention, emotion, causal discrimination). They introduce diverse question formats including choice, short-answer, grounding, ranking, and judgment components.