Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

BenchmarkMultimodal Large Language Models

Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks. However, their capacity to comprehend human-centric scenes has rarely been explored, primarily due to the absence of comprehensive evaluation benchmarks that take into account both the human-oriented granular level and higher-dimensional causal reasoning ability. Such high-quality evaluation benchmarks face tough obstacles, given the physical complexity of the human body and the difficulty of annotating granular structures. In this paper, we propose Human-MME, a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric scene understanding. Compared with other existing benchmarks, our work provides three key features: (1) Diversity in human scene, spanning 4 primary visual domains with 15 secondary domains and 43 sub-fields to ensure broad scenario coverage. (2) Progressive and diverse evaluation dimensions, evaluating the human-based activities progressively from the human-oriented granular perception to the higher-dimensional multi-target and causal reasoning, consisting of eight dimensions with 19,945 real-world image question pairs and an evaluation suite. (3) High-quality annotations with rich data paradigms, constructing the automated annotation pipeline and human-annotation platform, supporting rigorous manual labeling by expert annotators to facilitate precise and reliable model assessment. Our benchmark extends the single-person and single-image understanding to the multi-person and multi-image mutual understanding by constructing the choice, short-answer, grounding, ranking and judgment question components, and complex question-answer pairs of their combination. The extensive experiments on 20 state-of-the-art MLLMs effectively expose the limitations and guide future MLLMs research toward better human-centric image understanding and reasoning. Data and code are available at https://anonymous.4open.science/r/Human-MME-FDE7.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Human-MME, a benchmark for evaluating multimodal large language models on human-centric scene understanding, spanning diverse visual domains and progressive evaluation dimensions from granular perception to causal reasoning. It resides in the Human-Centric Evaluation Benchmarks leaf, which contains seven papers total, indicating a moderately populated research direction. This leaf sits within the broader Benchmark Design and Evaluation Frameworks branch, suggesting the work contributes to an active area focused on systematic MLLM assessment rather than architectural innovation or domain-specific applications.

The taxonomy reveals neighboring leaves addressing General Multimodal Evaluation (seven papers on broad scene understanding) and Domain-Specific Evaluation (six papers on specialized tasks like autonomous driving). Human-MME diverges from these by concentrating exclusively on human-related perception and reasoning, excluding general compositional tasks or non-human domains. The scope note clarifies that human-centric benchmarks probe fine-grained attributes, social interactions, and contextual reasoning, distinguishing them from broader multimodal testbeds. This positioning suggests the work occupies a well-defined niche within the larger evaluation landscape.

Among 28 candidates examined, the analysis found limited prior work overlap. The benchmark contribution examined ten candidates with one refutable match, while the annotation pipeline contribution also examined ten candidates with one refutable match. The progressive evaluation dimensions contribution examined eight candidates with no refutations. These statistics indicate that within the top-K semantic search scope, most contributions appear relatively distinct, though the small candidate pool means the search was not exhaustive. The annotation and evaluation design aspects show slightly more prior work than the progressive dimension framework.

Based on the limited search scope of 28 candidates, the work appears to occupy a moderately novel position within human-centric MLLM evaluation. The analysis covers top semantic matches and does not claim exhaustive coverage of all relevant prior work. The taxonomy structure suggests the paper contributes to an established but not overcrowded research direction, with sibling papers addressing complementary aspects of human-centric understanding rather than directly overlapping contributions.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: evaluating human-centric scene understanding in multimodal large language models. The field has organized itself around several complementary directions. Benchmark Design and Evaluation Frameworks focuses on creating rigorous testbeds that probe models' abilities to interpret human-centered visual content, ranging from general human-centric benchmarks like HumanVBench[7] and HumaniBench[14] to specialized evaluations such as FaceXBench[35] and ActionArt[47]. Model Architecture and Visual Representation explores how different encoder designs and fusion strategies—exemplified by works like Cambrian[2] and DreamLLM[5]—affect the quality of visual grounding. Application Domains and Task-Specific Methods targets concrete use cases, including autonomous driving (Traffic-IT[12], NuPlanQA[17]), robotics (Mobile Robot Navigation[45]), and accessibility (MAIDR AI[49]). Attention Mechanisms and Prompt Engineering investigates how guided visual search (Guided Visual Search[4]) and prompt-aware adapters (Prompt-Aware Adapter[29]) can steer model focus, while Video Understanding and Temporal Reasoning addresses dynamic scenes through methods like Video-of-Thought[22] and MME-VideoOCR[28]. Within the Benchmark Design branch, a particularly active line of work centers on human-centric evaluation benchmarks that stress-test models on fine-grained human attributes, social interactions, and contextual reasoning. Human-MME[0] sits squarely in this cluster, emphasizing comprehensive evaluation of human-centric scene understanding alongside neighbors like HumanVBench[7], which targets video-based human behavior, and HumanPCR[21], which focuses on person-centric reasoning. Compared to more general-purpose benchmarks such as SEED-Bench[23], these human-centric testbeds probe deeper into nuanced aspects of human appearance, activity, and intent. A key open question across these efforts is how to balance breadth—covering diverse human-centric phenomena—with depth in capturing subtle social cues and contextual dependencies, a trade-off that Human-MME[0] addresses by integrating multiple granularities of human-centered tasks into a unified evaluation framework.

Claimed Contributions

Human-MME benchmark for human-centric MLLM evaluation

Can Refute

10 retrieved papers

The authors introduce Human-MME, a comprehensive benchmark that evaluates multimodal large language models on human-centric image understanding. It features diverse human scenes across 43 sub-fields, progressive evaluation dimensions from granular perception to higher-level reasoning, and 19,945 real-world image-question pairs with rich question formats.

10 retrieved papers

Can Refute

Automated annotation pipeline and manual adjustment platform

Can Refute

10 retrieved papers

The authors develop an automated annotation pipeline that extracts fine-grained human features (bounding boxes, facial attributes, body parts, human-object interactions) and a Gradio-based manual adjustment platform that enables expert annotators to refine and verify annotations efficiently, ensuring high-quality data.

10 retrieved papers

Can Refute

Progressive evaluation dimensions and diverse question paradigms

8 retrieved papers

The authors design eight evaluation dimensions that progressively assess models from fine-grained perception (face, body, human-object interaction) to complex reasoning (multi-person, multi-image, intention, emotion, causal discrimination). They introduce diverse question formats including choice, short-answer, grounding, ranking, and judgment components.

8 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[7] Humanvbench: Exploring human-centric video understanding capabilities of mllms with synthetic benchmark data PDF

Zhou Ting, Ting Zhou, Chen, Daoyuan, Daoyuan Chen, JIAO Qirui, Qirui Jiao, Ding, Bolin, Bolin Ding, Li, Yaliang, Yaliang Li, Shen, Ying, Ying Shen (2024)

[14] HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation PDF

Raza, Shaina, Narayanan Aravind, Khazaie, Vahid Reza, Singh, Amandeep, Shah, Mubarak, Pandya, Deval (2025)

[21] Humanpcr: Probing mllm capabilities in diverse human-centric scenes PDF

Li Keliang, Keliang Li, Shi, Hao, Hongze Shen, Hou Ruibing, Hao Shi, Chang Hong, Ruibing Hou, Huang, Jie, Hong Chang, JIa Chenghao, Jie Huang, WANG Wen, Chenghao Jia, Wu Yiling, Wen Wang, Jiang Dong-mei, Yiling Wu, Shan, Shiguang, Dongmei Jiang, Chen Xi-lin, Shiguang Shan, Xilin Chen (2025)

[24] A large-scale human-centric benchmark for referring expression comprehension in the LMM era PDF

Fangyun Wei, Chang Xu, Kun Yan, Jinjing Zhao, Hongyang Zhang (2024)

[35] FaceXBench: Evaluating Multimodal LLMs on Face Understanding PDF

Narayan, Kartik, VS Vibashan, Kartik Narayan, Patel, Vishal M., Vishal M. Patel (2025)

[47] ActionArt: Advancing Multimodal Large Models for Fine-Grained Human-Centric Video Understanding PDF

Peng Yi-xing, Tang Yu-ming, Fu Shenghao, Lin, Kun-Yu, Wei, Xihan, Zheng, Wei-Shi (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Human-MME benchmark for human-centric MLLM evaluation

[61] HERM: Benchmarking and Enhancing Multimodal LLMs for Human-Centric Understanding PDF

Can Refute

[14] HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation PDF

Cannot Refute

[31] HIS-GPT: Towards 3D Human-In-Scene Multimodal Understanding PDF

Cannot Refute

[44] Humanomni: A large vision-speech language model for human-centric video understanding PDF

Cannot Refute

[62] Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension PDF

Cannot Refute

[63] MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark PDF

Cannot Refute

[64] OpenScene: 3D Scene Understanding with Open Vocabularies PDF

Cannot Refute

[65] Situational Scene Graph for Structured Human-Centric Situation Understanding PDF

Cannot Refute

[66] Unihcp: A unified model for human-centric perceptions PDF

Cannot Refute

[67] @ BENCH: Benchmarking Vision-Language Models for Human-centered Assistive Technology PDF

Cannot Refute

Contribution

Automated annotation pipeline and manual adjustment platform

[52] Semi-automation of gesture annotation by machine learning and human collaboration PDF

Can Refute

[51] MaintIE: A Fine-Grained Annotation Schema and Benchmark for Information Extraction from Maintenance Short Texts. PDF

Cannot Refute

[53] Discovering localized attributes for fine-grained recognition PDF

Cannot Refute

[54] Human-machine collaboration on data annotation of images by semi-automatic labeling PDF

Cannot Refute

[55] Capturing fine-grained details for video-based automation of suturing skills assessment PDF

Cannot Refute

[56] A Semi-automatic Annotation Framework for Neutrophil Ultrastructure from TEM Images PDF

Cannot Refute

[57] A scoping review of automatic and semi-automatic MRI segmentation in human brain imaging. PDF

Cannot Refute

[58] Surveillance video querying with a human-in-the-loop PDF

Cannot Refute

[59] Computer-aided cephalometric landmark annotation for CBCT data PDF

Cannot Refute

[60] of Non-manual and Spatial Features in Pidgin Sign Japanese for SLR PDF

Cannot Refute

Contribution

Progressive evaluation dimensions and diverse question paradigms

[68] Hierarchical motion perception as causal inference PDF

Cannot Refute

[69] A Knowledge-Based Hierarchical Causal Inference Network for Video Action Recognition PDF

Cannot Refute

[70] FineCausal: A Causal-Based Framework for Interpretable Fine-Grained Action Quality Assessment PDF

Cannot Refute

[71] Causality-centric narratives reasoning PDF

Cannot Refute

[72] What do people think they're doing? Action identification and human behavior. PDF

Cannot Refute

[73] A multi-modal global instance tracking benchmark (mgit): Better locating target in complex spatio-temporal and causal relationship PDF

Cannot Refute

[74] Activity, context, and plan recognition with computational causal behaviour models PDF

Cannot Refute

[75] SPEAR-net: a neuro-inspired causal perception and episodic memory framework for fine-grained and context-aware soccer action recognition PDF

Cannot Refute

Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[7] Humanvbench: Exploring human-centric video understanding capabilities of mllms with synthetic benchmark data PDF

[14] HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation PDF

[21] Humanpcr: Probing mllm capabilities in diverse human-centric scenes PDF

[24] A large-scale human-centric benchmark for referring expression comprehension in the LMM era PDF

[35] FaceXBench: Evaluating Multimodal LLMs on Face Understanding PDF

[47] ActionArt: Advancing Multimodal Large Models for Fine-Grained Human-Centric Video Understanding PDF

Contribution Analysis

Human-MME benchmark for human-centric MLLM evaluation

[61] HERM: Benchmarking and Enhancing Multimodal LLMs for Human-Centric Understanding PDF

[14] HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation PDF

[31] HIS-GPT: Towards 3D Human-In-Scene Multimodal Understanding PDF

[44] Humanomni: A large vision-speech language model for human-centric video understanding PDF

[62] Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension PDF

[63] MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark PDF

[64] OpenScene: 3D Scene Understanding with Open Vocabularies PDF

[65] Situational Scene Graph for Structured Human-Centric Situation Understanding PDF

[66] Unihcp: A unified model for human-centric perceptions PDF

[67] @ BENCH: Benchmarking Vision-Language Models for Human-centered Assistive Technology PDF

Automated annotation pipeline and manual adjustment platform

[52] Semi-automation of gesture annotation by machine learning and human collaboration PDF

[51] MaintIE: A Fine-Grained Annotation Schema and Benchmark for Information Extraction from Maintenance Short Texts. PDF

[53] Discovering localized attributes for fine-grained recognition PDF

[54] Human-machine collaboration on data annotation of images by semi-automatic labeling PDF

[55] Capturing fine-grained details for video-based automation of suturing skills assessment PDF

[56] A Semi-automatic Annotation Framework for Neutrophil Ultrastructure from TEM Images PDF

[57] A scoping review of automatic and semi-automatic MRI segmentation in human brain imaging. PDF

[58] Surveillance video querying with a human-in-the-loop PDF

[59] Computer-aided cephalometric landmark annotation for CBCT data PDF

[60] of Non-manual and Spatial Features in Pidgin Sign Japanese for SLR PDF

Progressive evaluation dimensions and diverse question paradigms

[68] Hierarchical motion perception as causal inference PDF

[69] A Knowledge-Based Hierarchical Causal Inference Network for Video Action Recognition PDF

[70] FineCausal: A Causal-Based Framework for Interpretable Fine-Grained Action Quality Assessment PDF

[71] Causality-centric narratives reasoning PDF

[72] What do people think they're doing? Action identification and human behavior. PDF

[73] A multi-modal global instance tracking benchmark (mgit): Better locating target in complex spatio-temporal and causal relationship PDF

[74] Activity, context, and plan recognition with computational causal behaviour models PDF

[75] SPEAR-net: a neuro-inspired causal perception and episodic memory framework for fine-grained and context-aware soccer action recognition PDF

Table of Contents