HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

MLLMsBenchmarkDatasetHumanities and Social Sciences

Multimodal Large Language Models (MLLMs) have demonstrated significant potential to advance a broad range of domains. However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences (HSS). Tasks in the HSS domain require more horizontal, interdisciplinary thinking and a deep integration of knowledge across related fields, which presents unique challenges for MLLMs, particularly in linking abstract concepts with corresponding visual representations. Addressing this gap, we present HSSBench, a dedicated benchmark designed to assess the capabilities of MLLMs on HSS tasks in multiple languages, including the six official languages of the United Nations. We also introduce a novel data generation pipeline tailored for HSS scenarios, in which multiple domain experts and automated agents collaborate to generate and iteratively refine each sample. HSSBench contains over 13,000 meticulously designed samples, covering six key categories. We benchmark more than 20 mainstream MLLMs on HSSBench and demonstrate that it poses significant challenges even for state-of-the-art models. We hope that this benchmark will inspire further research into enhancing the cross-disciplinary reasoning abilities of MLLMs, especially their capacity to internalize and connect knowledge across fields.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces HSSBench, a benchmark evaluating multimodal large language models on humanities and social sciences tasks across multiple languages. It resides in the 'Humanities and Social Sciences Task Benchmarks' leaf, which contains five papers total, including the original work. This leaf sits within the broader 'Domain-Specific Benchmark Development' branch, indicating a moderately populated research direction focused on creating specialized evaluation suites. The taxonomy reveals that while domain-specific benchmarking is active, this particular leaf represents a concentrated effort to address HSS-specific evaluation needs rather than a crowded subfield.

The taxonomy structure shows that HSSBench's leaf neighbors include 'Cultural and Historical Knowledge Assessment' (five papers) and 'Social and Behavioral Understanding Benchmarks' (three papers), both emphasizing narrower aspects of humanistic evaluation. Nearby branches like 'Cross-Disciplinary Evaluation' contain multi-domain benchmarks (e.g., MMMU, CMMMU) that span STEM and humanities but lack HSS-specific depth. The taxonomy's scope notes clarify that HSSBench's comprehensive HSS focus distinguishes it from purely cultural heritage benchmarks or general expert-level assessments, positioning it at the intersection of breadth and domain specialization within the humanities evaluation landscape.

Among thirty candidates examined, the 'HSSBench benchmark' contribution shows one refutable candidate out of ten examined, suggesting some prior work in comprehensive HSS benchmarking exists but is limited. The 'VQA Generation Pipeline for HSS scenarios' contribution found no refutable candidates among ten examined, indicating potential novelty in data generation methodology. The 'multilingual evaluation' contribution similarly found no refutable candidates among ten examined. These statistics reflect a focused search scope rather than exhaustive coverage, with the single refutable instance likely representing overlap with existing multi-domain benchmarks that include HSS components.

Based on the limited search of thirty candidates, the work appears to occupy a relatively underexplored niche combining comprehensive HSS coverage with multilingual evaluation. The taxonomy context suggests that while related benchmarks exist, few target the specific intersection of broad humanities disciplines, social sciences reasoning, and multilingual assessment. The analysis does not capture potential work outside the top-thirty semantic matches or recent preprints, so the novelty assessment remains provisional pending broader literature review.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Evaluating multimodal large language models on humanities and social sciences tasks. The field has organized itself around several complementary directions. Domain-Specific Benchmark Development focuses on creating targeted evaluation suites for humanities and social sciences, such as MMMU[5], Exams-v[16], and MMVU[17], which test models on discipline-specific knowledge and reasoning. Cross-Disciplinary Evaluation examines how models perform across multiple domains simultaneously, often revealing gaps in cultural or contextual understanding. Applied Domain Analysis investigates real-world applications in areas like urban studies, historical document processing, and cultural heritage, with works such as Urban Inequality Imagery[15] and HistBench HistAgent[7] demonstrating practical use cases. Methodological Frameworks and Approaches develop systematic ways to probe model capabilities, while Model Behavior and Bias Analysis scrutinizes fairness and representation issues through benchmarks like VLBiasBench[41]. Theoretical and Interdisciplinary Perspectives bridge computational methods with humanistic inquiry, as seen in Bridging Technology Humanities[3]. A particularly active line of work centers on comprehensive benchmarking that spans multiple humanities and social sciences disciplines, balancing breadth with depth of evaluation. HSSBench[0] exemplifies this approach by providing a broad assessment framework across diverse humanistic tasks, positioning itself alongside other general-purpose benchmarks like MMMU[5] and CMMMU[38] but with a stronger emphasis on social sciences and humanities-specific reasoning. In contrast, works such as Christian Iconography Benchmark[4] or HistBench HistAgent[7] pursue narrower, domain-expert-level evaluations within specialized subfields. The tension between creating widely applicable benchmarks versus deeply specialized assessments remains a central question, as does the challenge of capturing culturally situated knowledge that varies across linguistic and geographic contexts—a concern highlighted by efforts like Pangea[27] and Cultural Understanding Benchmark[25].

Claimed Contributions

HSSBench benchmark for Humanities and Social Sciences

Can Refute

10 retrieved papers

The authors introduce HSSBench, a large-scale multilingual benchmark containing over 13,000 samples across six categories and 45 types, specifically designed to evaluate multimodal large language models on Humanities and Social Sciences tasks that require horizontal, interdisciplinary reasoning rather than vertical STEM-style reasoning.

10 retrieved papers

Can Refute

VQA Generation Pipeline for HSS scenarios

10 retrieved papers

The authors develop a three-stage data construction pipeline (Dataset Preparation, Dataset Construction, and Validation) that combines domain expert annotation with a multi-agent framework to efficiently generate high-quality visual question answering data tailored to the unique requirements of Humanities and Social Sciences domains.

10 retrieved papers

Comprehensive multilingual evaluation of MLLMs on HSS tasks

10 retrieved papers

The authors conduct extensive evaluations of over 20 mainstream multimodal large language models across six languages, revealing that current state-of-the-art models struggle with HSS tasks and demonstrating the benchmark's effectiveness in identifying limitations in cross-disciplinary reasoning and cross-modal knowledge transfer.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI PDF

Xiang Yue, Yuansheng Ni, Tianyu Zheng, Kai Zhang, Ruoqi Liu, Ge Zhang, Samuel Stevens, Jiang Dongfu, Weiming Ren, Dongfu Jiang, Yuxuan Sun, Cong Wei, Bo-Tao Yu, Ruibin Yuan, Botao Yu, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, Wenhu Chen (2023) • Computer Vision and Pattern Recognition

[16] Exams-v: A multi-discipline multilingual multimodal exam benchmark for evaluating vision language models PDF

Simeon Hristov, Haonan Li, Dimitar Dimitrov, Ivan Koychev, Preslav Nakov (2024)

[17] MMVU: Measuring Expert-Level Multi-Discipline Video Understanding PDF

Yilun Zhao, Haowei Zhang, Lujing Xie, Guo Gan, Yitao Long, Zhiyuan Hu, Wei-Yuan Chen, Tongyan Hu, Chuhan Li, Weiyuan Chen, Zhijian Xu, Cheng-Ye Wang, Junyang Song, Ziyao Shangguan, Zhenwen Liang, Chengye Wang, Yixin Liu, Weifeng Pan, Chen Zhao, Arman Cohan, Xiangru Tang (2025) • Computer Vision and Pattern Recognition

[38] CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark PDF

Zhang Ge, Du Xinrun, Ge Zhang, Chen Bei, Xinrun Du, Liang Yiming, Bei Chen, Yiming Liang, Zheng Tianyu, Tongxu Luo, Zhu Kang, Tianyu Zheng, Cheng Yuyang, Kang Zhu, Xu, Chunpu, Yuyang Cheng, Guo Shuyue, Chunpu Xu, Zhang Hao-ran, Shuyue Guo, Qu Xingwei, Haoran Zhang, Wang Jun-jie, Xingwei Qu, Yuan, Ruibin, Junjie Wang, Li Yi-Zhi, Ruibin Yuan, Wang Zekun, Yizhi Li, Liu Yu-dong, Z. Wang, Tsai Yu-Hsuan, Yudong Liu, Zhang Feng-ji, Yu-Hsuan Tsai, Lin, Chenghua, Fengji Zhang, Huang Wen-Hao, Chenghua Lin, Fu Jie, Wenhao Huang, Wenhu Chen, Jie Fu (2024) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

HSSBench benchmark for Humanities and Social Sciences

[5] MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI PDF

Can Refute

[51] Digital multimodal composing as translanguaging assessment in CLIL classrooms PDF

Cannot Refute

[52] Designing a multilingual, multimodal and collaborative platform of resources for higher education PDF

Cannot Refute

[53] Multimodal composing in multilingual learning and teaching contexts PDF

Cannot Refute

[54] Multilingual and multimodal composition at school: ScribJab in action PDF

Cannot Refute

[55] Multilingualism and multimodality in the CLIL/EMI classroom PDF

Cannot Refute

[56] Mlm: a benchmark dataset for multitask learning with multiple languages and modalities PDF

Cannot Refute

[57] EverydayMMQA: A Multilingual and Multimodal Framework for Culturally Grounded Spoken Visual QA PDF

Cannot Refute

[58] Placing multi-modal, and multi-lingual Data in the Humanities Domain on the Map: the Mythotopia Geo-tagged Corpus PDF

Cannot Refute

[59] MMReview: A Multidisciplinary and Multimodal Benchmark for LLM-Based Peer Review Automation PDF

Cannot Refute

Contribution

VQA Generation Pipeline for HSS scenarios

[68] Drivelm: Driving with graph visual question answering PDF

Cannot Refute

[69] A question-type guided and progressive self-attention network for remote sensing visual question answering PDF

Cannot Refute

[70] VizGenie: Toward Self-Refining, Domain-Aware Workflows for Next-Generation Scientific Visualization PDF

Cannot Refute

[71] Eagle: Expert-guided self-enhancement for preference alignment in pathology large vision-language model PDF

Cannot Refute

[72] MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning PDF

Cannot Refute

[73] Toolvqa: A dataset for multi-step reasoning vqa with external tools PDF

Cannot Refute

[74] VilBias: A Study of Bias Detection through Linguistic and Visual Cues, presenting Annotation Strategies, Evaluation, and Key Challenges PDF

Cannot Refute

[75] Towards Human-Level Understanding of Complex Process Engineering Schematics: A Pedagogical, Introspective Multi-Agent Framework for Open-Domain Question â¦ PDF

Cannot Refute

[76] An AI-Assisted Bridge Inspection System: Ontology-Based Visual Question-Answering Methodology Using Large Language Models PDF

Cannot Refute

[77] Explaining CLIP's performance disparities on data from blind/low vision users PDF

Cannot Refute

Contribution

Comprehensive multilingual evaluation of MLLMs on HSS tasks

[5] MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI PDF

Cannot Refute

[7] On path to multimodal historical reasoning: Histbench and histagent PDF

Cannot Refute

[60] Multimodal chain-of-thought reasoning: A comprehensive survey PDF

Cannot Refute

[61] Position: Multimodal large language models can significantly advance scientific reasoning PDF

Cannot Refute

[62] Multimodal reasoning with multimodal knowledge graph PDF

Cannot Refute

[63] Language Is Not All You Need: Aligning Perception with Language Models PDF

Cannot Refute

[64] Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought PDF

Cannot Refute

[65] A survey of scientific large language models: From data foundations to agent frontiers PDF

Cannot Refute

[66] Interdisciplinary-QG: An LLM-Based Framework for Generating High-Quality Interdisciplinary Test Questions with Knowledge Graphs and Chain-of-Thought â¦ PDF

Cannot Refute

[67] Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models PDF

Cannot Refute

HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI PDF

[16] Exams-v: A multi-discipline multilingual multimodal exam benchmark for evaluating vision language models PDF

[17] MMVU: Measuring Expert-Level Multi-Discipline Video Understanding PDF

[38] CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark PDF

Contribution Analysis

HSSBench benchmark for Humanities and Social Sciences

[5] MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI PDF

[51] Digital multimodal composing as translanguaging assessment in CLIL classrooms PDF

[52] Designing a multilingual, multimodal and collaborative platform of resources for higher education PDF

[53] Multimodal composing in multilingual learning and teaching contexts PDF

[54] Multilingual and multimodal composition at school: ScribJab in action PDF

[55] Multilingualism and multimodality in the CLIL/EMI classroom PDF

[56] Mlm: a benchmark dataset for multitask learning with multiple languages and modalities PDF

[57] EverydayMMQA: A Multilingual and Multimodal Framework for Culturally Grounded Spoken Visual QA PDF

[58] Placing multi-modal, and multi-lingual Data in the Humanities Domain on the Map: the Mythotopia Geo-tagged Corpus PDF

[59] MMReview: A Multidisciplinary and Multimodal Benchmark for LLM-Based Peer Review Automation PDF

VQA Generation Pipeline for HSS scenarios

[68] Drivelm: Driving with graph visual question answering PDF

[69] A question-type guided and progressive self-attention network for remote sensing visual question answering PDF

[70] VizGenie: Toward Self-Refining, Domain-Aware Workflows for Next-Generation Scientific Visualization PDF

[71] Eagle: Expert-guided self-enhancement for preference alignment in pathology large vision-language model PDF

[72] MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning PDF

[73] Toolvqa: A dataset for multi-step reasoning vqa with external tools PDF

[74] VilBias: A Study of Bias Detection through Linguistic and Visual Cues, presenting Annotation Strategies, Evaluation, and Key Challenges PDF

[75] Towards Human-Level Understanding of Complex Process Engineering Schematics: A Pedagogical, Introspective Multi-Agent Framework for Open-Domain Question â¦ PDF

[76] An AI-Assisted Bridge Inspection System: Ontology-Based Visual Question-Answering Methodology Using Large Language Models PDF

[77] Explaining CLIP's performance disparities on data from blind/low vision users PDF

Comprehensive multilingual evaluation of MLLMs on HSS tasks

[5] MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI PDF

[7] On path to multimodal historical reasoning: Histbench and histagent PDF

[60] Multimodal chain-of-thought reasoning: A comprehensive survey PDF

[61] Position: Multimodal large language models can significantly advance scientific reasoning PDF

[62] Multimodal reasoning with multimodal knowledge graph PDF

[63] Language Is Not All You Need: Aligning Perception with Language Models PDF

[64] Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought PDF

[65] A survey of scientific large language models: From data foundations to agent frontiers PDF

[66] Interdisciplinary-QG: An LLM-Based Framework for Generating High-Quality Interdisciplinary Test Questions with Knowledge Graphs and Chain-of-Thought â¦ PDF

[67] Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models PDF

Table of Contents

[75] Towards Human-Level Understanding of Complex Process Engineering Schematics: A Pedagogical, Introspective Multi-Agent Framework for Open-Domain Question â¦ PDF

[66] Interdisciplinary-QG: An LLM-Based Framework for Generating High-Quality Interdisciplinary Test Questions with Knowledge Graphs and Chain-of-Thought â¦ PDF