HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models

ICLR 2026 Conference SubmissionAnonymous Authors
MLLMsBenchmarkDatasetHumanities and Social Sciences
Abstract:

Multimodal Large Language Models (MLLMs) have demonstrated significant potential to advance a broad range of domains. However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences (HSS). Tasks in the HSS domain require more horizontal, interdisciplinary thinking and a deep integration of knowledge across related fields, which presents unique challenges for MLLMs, particularly in linking abstract concepts with corresponding visual representations. Addressing this gap, we present HSSBench, a dedicated benchmark designed to assess the capabilities of MLLMs on HSS tasks in multiple languages, including the six official languages of the United Nations. We also introduce a novel data generation pipeline tailored for HSS scenarios, in which multiple domain experts and automated agents collaborate to generate and iteratively refine each sample. HSSBench contains over 13,000 meticulously designed samples, covering six key categories. We benchmark more than 20 mainstream MLLMs on HSSBench and demonstrate that it poses significant challenges even for state-of-the-art models. We hope that this benchmark will inspire further research into enhancing the cross-disciplinary reasoning abilities of MLLMs, especially their capacity to internalize and connect knowledge across fields.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces HSSBench, a benchmark evaluating multimodal large language models on humanities and social sciences tasks across multiple languages. It resides in the 'Humanities and Social Sciences Task Benchmarks' leaf, which contains five papers total, including the original work. This leaf sits within the broader 'Domain-Specific Benchmark Development' branch, indicating a moderately populated research direction focused on creating specialized evaluation suites. The taxonomy reveals that while domain-specific benchmarking is active, this particular leaf represents a concentrated effort to address HSS-specific evaluation needs rather than a crowded subfield.

The taxonomy structure shows that HSSBench's leaf neighbors include 'Cultural and Historical Knowledge Assessment' (five papers) and 'Social and Behavioral Understanding Benchmarks' (three papers), both emphasizing narrower aspects of humanistic evaluation. Nearby branches like 'Cross-Disciplinary Evaluation' contain multi-domain benchmarks (e.g., MMMU, CMMMU) that span STEM and humanities but lack HSS-specific depth. The taxonomy's scope notes clarify that HSSBench's comprehensive HSS focus distinguishes it from purely cultural heritage benchmarks or general expert-level assessments, positioning it at the intersection of breadth and domain specialization within the humanities evaluation landscape.

Among thirty candidates examined, the 'HSSBench benchmark' contribution shows one refutable candidate out of ten examined, suggesting some prior work in comprehensive HSS benchmarking exists but is limited. The 'VQA Generation Pipeline for HSS scenarios' contribution found no refutable candidates among ten examined, indicating potential novelty in data generation methodology. The 'multilingual evaluation' contribution similarly found no refutable candidates among ten examined. These statistics reflect a focused search scope rather than exhaustive coverage, with the single refutable instance likely representing overlap with existing multi-domain benchmarks that include HSS components.

Based on the limited search of thirty candidates, the work appears to occupy a relatively underexplored niche combining comprehensive HSS coverage with multilingual evaluation. The taxonomy context suggests that while related benchmarks exist, few target the specific intersection of broad humanities disciplines, social sciences reasoning, and multilingual assessment. The analysis does not capture potential work outside the top-thirty semantic matches or recent preprints, so the novelty assessment remains provisional pending broader literature review.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Evaluating multimodal large language models on humanities and social sciences tasks. The field has organized itself around several complementary directions. Domain-Specific Benchmark Development focuses on creating targeted evaluation suites for humanities and social sciences, such as MMMU[5], Exams-v[16], and MMVU[17], which test models on discipline-specific knowledge and reasoning. Cross-Disciplinary Evaluation examines how models perform across multiple domains simultaneously, often revealing gaps in cultural or contextual understanding. Applied Domain Analysis investigates real-world applications in areas like urban studies, historical document processing, and cultural heritage, with works such as Urban Inequality Imagery[15] and HistBench HistAgent[7] demonstrating practical use cases. Methodological Frameworks and Approaches develop systematic ways to probe model capabilities, while Model Behavior and Bias Analysis scrutinizes fairness and representation issues through benchmarks like VLBiasBench[41]. Theoretical and Interdisciplinary Perspectives bridge computational methods with humanistic inquiry, as seen in Bridging Technology Humanities[3]. A particularly active line of work centers on comprehensive benchmarking that spans multiple humanities and social sciences disciplines, balancing breadth with depth of evaluation. HSSBench[0] exemplifies this approach by providing a broad assessment framework across diverse humanistic tasks, positioning itself alongside other general-purpose benchmarks like MMMU[5] and CMMMU[38] but with a stronger emphasis on social sciences and humanities-specific reasoning. In contrast, works such as Christian Iconography Benchmark[4] or HistBench HistAgent[7] pursue narrower, domain-expert-level evaluations within specialized subfields. The tension between creating widely applicable benchmarks versus deeply specialized assessments remains a central question, as does the challenge of capturing culturally situated knowledge that varies across linguistic and geographic contexts—a concern highlighted by efforts like Pangea[27] and Cultural Understanding Benchmark[25].

Claimed Contributions

HSSBench benchmark for Humanities and Social Sciences

The authors introduce HSSBench, a large-scale multilingual benchmark containing over 13,000 samples across six categories and 45 types, specifically designed to evaluate multimodal large language models on Humanities and Social Sciences tasks that require horizontal, interdisciplinary reasoning rather than vertical STEM-style reasoning.

10 retrieved papers
Can Refute
VQA Generation Pipeline for HSS scenarios

The authors develop a three-stage data construction pipeline (Dataset Preparation, Dataset Construction, and Validation) that combines domain expert annotation with a multi-agent framework to efficiently generate high-quality visual question answering data tailored to the unique requirements of Humanities and Social Sciences domains.

10 retrieved papers
Comprehensive multilingual evaluation of MLLMs on HSS tasks

The authors conduct extensive evaluations of over 20 mainstream multimodal large language models across six languages, revealing that current state-of-the-art models struggle with HSS tasks and demonstrating the benchmark's effectiveness in identifying limitations in cross-disciplinary reasoning and cross-modal knowledge transfer.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

HSSBench benchmark for Humanities and Social Sciences

The authors introduce HSSBench, a large-scale multilingual benchmark containing over 13,000 samples across six categories and 45 types, specifically designed to evaluate multimodal large language models on Humanities and Social Sciences tasks that require horizontal, interdisciplinary reasoning rather than vertical STEM-style reasoning.

Contribution

VQA Generation Pipeline for HSS scenarios

The authors develop a three-stage data construction pipeline (Dataset Preparation, Dataset Construction, and Validation) that combines domain expert annotation with a multi-agent framework to efficiently generate high-quality visual question answering data tailored to the unique requirements of Humanities and Social Sciences domains.

Contribution

Comprehensive multilingual evaluation of MLLMs on HSS tasks

The authors conduct extensive evaluations of over 20 mainstream multimodal large language models across six languages, revealing that current state-of-the-art models struggle with HSS tasks and demonstrating the benchmark's effectiveness in identifying limitations in cross-disciplinary reasoning and cross-modal knowledge transfer.