Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models
Overview
Overall Novelty Assessment
The paper introduces Human-MME, a benchmark for evaluating multimodal large language models on human-centric scene understanding, spanning diverse visual domains and progressive evaluation dimensions from granular perception to causal reasoning. It resides in the Human-Centric Evaluation Benchmarks leaf, which contains seven papers total, indicating a moderately populated research direction. This leaf sits within the broader Benchmark Design and Evaluation Frameworks branch, suggesting the work contributes to an active area focused on systematic MLLM assessment rather than architectural innovation or domain-specific applications.
The taxonomy reveals neighboring leaves addressing General Multimodal Evaluation (seven papers on broad scene understanding) and Domain-Specific Evaluation (six papers on specialized tasks like autonomous driving). Human-MME diverges from these by concentrating exclusively on human-related perception and reasoning, excluding general compositional tasks or non-human domains. The scope note clarifies that human-centric benchmarks probe fine-grained attributes, social interactions, and contextual reasoning, distinguishing them from broader multimodal testbeds. This positioning suggests the work occupies a well-defined niche within the larger evaluation landscape.
Among 28 candidates examined, the analysis found limited prior work overlap. The benchmark contribution examined ten candidates with one refutable match, while the annotation pipeline contribution also examined ten candidates with one refutable match. The progressive evaluation dimensions contribution examined eight candidates with no refutations. These statistics indicate that within the top-K semantic search scope, most contributions appear relatively distinct, though the small candidate pool means the search was not exhaustive. The annotation and evaluation design aspects show slightly more prior work than the progressive dimension framework.
Based on the limited search scope of 28 candidates, the work appears to occupy a moderately novel position within human-centric MLLM evaluation. The analysis covers top semantic matches and does not claim exhaustive coverage of all relevant prior work. The taxonomy structure suggests the paper contributes to an established but not overcrowded research direction, with sibling papers addressing complementary aspects of human-centric understanding rather than directly overlapping contributions.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce Human-MME, a comprehensive benchmark that evaluates multimodal large language models on human-centric image understanding. It features diverse human scenes across 43 sub-fields, progressive evaluation dimensions from granular perception to higher-level reasoning, and 19,945 real-world image-question pairs with rich question formats.
The authors develop an automated annotation pipeline that extracts fine-grained human features (bounding boxes, facial attributes, body parts, human-object interactions) and a Gradio-based manual adjustment platform that enables expert annotators to refine and verify annotations efficiently, ensuring high-quality data.
The authors design eight evaluation dimensions that progressively assess models from fine-grained perception (face, body, human-object interaction) to complex reasoning (multi-person, multi-image, intention, emotion, causal discrimination). They introduce diverse question formats including choice, short-answer, grounding, ranking, and judgment components.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[7] Humanvbench: Exploring human-centric video understanding capabilities of mllms with synthetic benchmark data PDF
[14] HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation PDF
[21] Humanpcr: Probing mllm capabilities in diverse human-centric scenes PDF
[24] A large-scale human-centric benchmark for referring expression comprehension in the LMM era PDF
[35] FaceXBench: Evaluating Multimodal LLMs on Face Understanding PDF
[47] ActionArt: Advancing Multimodal Large Models for Fine-Grained Human-Centric Video Understanding PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Human-MME benchmark for human-centric MLLM evaluation
The authors introduce Human-MME, a comprehensive benchmark that evaluates multimodal large language models on human-centric image understanding. It features diverse human scenes across 43 sub-fields, progressive evaluation dimensions from granular perception to higher-level reasoning, and 19,945 real-world image-question pairs with rich question formats.
[61] HERM: Benchmarking and Enhancing Multimodal LLMs for Human-Centric Understanding PDF
[14] HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation PDF
[31] HIS-GPT: Towards 3D Human-In-Scene Multimodal Understanding PDF
[44] Humanomni: A large vision-speech language model for human-centric video understanding PDF
[62] Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension PDF
[63] MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark PDF
[64] OpenScene: 3D Scene Understanding with Open Vocabularies PDF
[65] Situational Scene Graph for Structured Human-Centric Situation Understanding PDF
[66] Unihcp: A unified model for human-centric perceptions PDF
[67] @ BENCH: Benchmarking Vision-Language Models for Human-centered Assistive Technology PDF
Automated annotation pipeline and manual adjustment platform
The authors develop an automated annotation pipeline that extracts fine-grained human features (bounding boxes, facial attributes, body parts, human-object interactions) and a Gradio-based manual adjustment platform that enables expert annotators to refine and verify annotations efficiently, ensuring high-quality data.
[52] Semi-automation of gesture annotation by machine learning and human collaboration PDF
[51] MaintIE: A Fine-Grained Annotation Schema and Benchmark for Information Extraction from Maintenance Short Texts. PDF
[53] Discovering localized attributes for fine-grained recognition PDF
[54] Human-machine collaboration on data annotation of images by semi-automatic labeling PDF
[55] Capturing fine-grained details for video-based automation of suturing skills assessment PDF
[56] A Semi-automatic Annotation Framework for Neutrophil Ultrastructure from TEM Images PDF
[57] A scoping review of automatic and semi-automatic MRI segmentation in human brain imaging. PDF
[58] Surveillance video querying with a human-in-the-loop PDF
[59] Computer-aided cephalometric landmark annotation for CBCT data PDF
[60] of Non-manual and Spatial Features in Pidgin Sign Japanese for SLR PDF
Progressive evaluation dimensions and diverse question paradigms
The authors design eight evaluation dimensions that progressively assess models from fine-grained perception (face, body, human-object interaction) to complex reasoning (multi-person, multi-image, intention, emotion, causal discrimination). They introduce diverse question formats including choice, short-answer, grounding, ranking, and judgment components.