IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Vision Language ModelsVLMsMultimodal modelsCultural VLMsMutlimodal EvaluationOCRCultural VQAMutlimodal Machine TranslationMMT

Vision-language models (VLMs) have demonstrated impressive generalization across multimodal tasks, yet most evaluation benchmarks remain Western-centric, leaving open questions about their performance in culturally diverse and multilingual settings. To address this gap, we introduce IndicVisionBench, the first large-scale benchmark centered on the Indian subcontinent. Covering English and 10 Indian languages, our benchmark spans 3 multimodal tasks, including Optical Character Recognition (OCR), Multimodal Machine Translation (MMT), and Visual Question Answering (VQA), covering 6 kinds of question types. Our final benchmark consists of a total of ~5K images and 37K+ QA pairs across 13 culturally grounded topics. In addition, we release a paired parallel corpus of annotations across 10 Indic languages, creating a unique resource for analyzing cultural and linguistic biases in VLMs. We evaluate a broad spectrum of 8 models, from proprietary closed-source systems to open-weights medium and large-scale models. Our experiments reveal substantial performance gaps, underscoring the limitations of current VLMs in culturally diverse contexts. By centering cultural diversity and multilinguality, IndicVisionBench establishes a reproducible evaluation framework that paves the way for more inclusive multimodal research.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces IndicVisionBench, a large-scale benchmark for evaluating vision-language models on Indian subcontinent content across English and ten Indic languages. It resides in the 'Indian Subcontinent Benchmarks' leaf, which contains only two papers total: this work and Drishtikon. This represents a notably sparse research direction within the broader taxonomy of fifty papers, suggesting that culturally grounded evaluation resources for the Indian subcontinent remain underdeveloped despite the region's linguistic diversity and population scale.

The taxonomy reveals that region-specific benchmarks form a distinct branch alongside broader multilingual efforts. Neighboring leaves include Southeast Asian benchmarks (one paper) and geographically diverse cultural benchmarks (three papers), while the parent category 'Region-Specific and Cultural Benchmarks' contrasts with 'Comprehensive Multilingual Multimodal Benchmarks' containing exam-based and general frameworks. The paper's focus on culturally grounded topics and parallel annotations distinguishes it from general-purpose multilingual datasets like Global MMLU or M3exam, which prioritize breadth over regional depth. This positioning reflects a field tension between universal coverage and culturally nuanced evaluation.

Among thirty candidates examined, the contribution-level analysis shows varied novelty profiles. The benchmark itself (Contribution A: ten candidates, zero refutations) and the parallel corpus (Contribution B: ten candidates, zero refutations) appear relatively novel within the limited search scope. However, the evaluation revealing performance gaps (Contribution C: ten candidates, two refutations) encounters more substantial prior work, likely because documenting VLM limitations in diverse settings has been explored in related cultural and multilingual evaluation studies. The small number of refutations suggests the specific combination of tasks, languages, and cultural grounding may still offer distinctive insights.

Based on the limited search of thirty semantically similar papers, the work appears to occupy a genuinely sparse research area. The Indian subcontinent leaf's minimal population and the absence of refutations for the core benchmark contributions suggest meaningful novelty, though the evaluation findings align with broader patterns documented in cultural bias and multilingual capability studies. The analysis does not cover exhaustive literature review or domain-specific publication venues that might reveal additional related work.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Evaluating vision-language models on culturally diverse and multilingual content. The field has organized itself around several major branches that reflect both methodological and application-driven concerns. Benchmark Development and Dataset Construction focuses on creating evaluation resources that span diverse languages and cultural contexts, including region-specific benchmarks such as those targeting the Indian subcontinent (IndicVisionBench[0], Drishtikon[23]) or Arabic traditions (AraTraditions10k[32]), as well as broader multilingual datasets (Global MMLU[2], M3exam[8]). Model Development and Training Approaches addresses architectural innovations and training strategies for multilingual vision-language systems (SigLIP 2[1], mblip[25]), while Evaluation Methodologies and Empirical Analysis examines how to rigorously assess cultural understanding and linguistic equity (Cultural Understanding Benchmark[4], All Languages Matter[3]). Survey and Theoretical Frameworks provide conceptual grounding (Multilingual VLM Survey[10], Cultural Awareness Survey[30]), and the remaining branches explore Application-Oriented tasks, Domain-Specific uses (WorldMedQA-V[12]), Educational contexts (Pedagogical Inclusion[26]), and Cross-Cultural Communication Systems. Particularly active lines of work reveal tensions between broad multilingual coverage and deep cultural grounding. Some efforts prioritize scaling to many languages with general-purpose architectures (PaLI-X[20], Pangea[27]), while others emphasize culturally nuanced understanding through specialized benchmarks that capture region-specific knowledge and visual traditions (Cultural Inclusive VLMs[6], CultureVLM[15]). IndicVisionBench[0] sits squarely within the region-specific strand, joining a small cluster of Indian subcontinent benchmarks like Drishtikon[23] that probe whether models can handle culturally grounded visual reasoning in Indic languages. Compared to broader multilingual evaluations such as Global MMLU[2] or cross-cultural frameworks like Kaleidoscope[11], IndicVisionBench[0] offers deeper regional focus, trading breadth for the ability to surface culture-specific gaps that general benchmarks might overlook. This positioning reflects an ongoing question in the field: whether universal multilingual models can adequately serve diverse communities or whether region-tailored evaluation and development remain essential.

Claimed Contributions

IndicVisionBench benchmark for culturally grounded multimodal evaluation

10 retrieved papers

The authors introduce IndicVisionBench, the first large-scale benchmark explicitly designed to evaluate vision-language models on culturally grounded understanding in the Indian subcontinent context. It comprises 5K images and 37K+ question-answer pairs across 13 cultural topics, covering English and 10 Indic languages, and spans three multimodal tasks: Visual Question Answering, Optical Character Recognition, and Multimodal Machine Translation.

10 retrieved papers

Paired parallel corpus across 10 Indic languages

10 retrieved papers

The authors release a paired parallel corpus of annotations spanning 10 Indic languages, enabling systematic analysis of cultural and linguistic biases in vision-language models. This resource supports cross-lingual evaluation and comparison of model performance across diverse linguistic contexts.

10 retrieved papers

Comprehensive evaluation revealing performance gaps in culturally diverse settings

Can Refute

10 retrieved papers

The authors conduct a comprehensive evaluation of 8 state-of-the-art vision-language models, including both proprietary and open-weight systems, across all three benchmark tracks. Their experiments reveal substantial performance gaps and systematic limitations of current models in culturally diverse and multilingual contexts, particularly for low-resource languages.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[23] Drishtikon: A multimodal multilingual benchmark for testing language models' understanding on indian culture PDF

Kumar Raghvendra, Ghosh, Akash, Anushka, Shah Nemil, Mishra Nishant, Saha, Sriparna (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

IndicVisionBench benchmark for culturally grounded multimodal evaluation

[23] Drishtikon: A multimodal multilingual benchmark for testing language models' understanding on indian culture PDF

Cannot Refute

[48] Chitrarth: Bridging vision and language for a billion people PDF

Cannot Refute

[50] A culturally-diverse multilingual multimodal video benchmark & model PDF

Cannot Refute

[59] BharatBBQ: A Multilingual Bias Benchmark for Question Answering in the Indian Context PDF

Cannot Refute

[60] Milu: A multi-task indic language understanding benchmark PDF

Cannot Refute

[61] Impact of Visual Context on Noisy Multimodal NMT: An Empirical Study for English to Indian Languages PDF

Cannot Refute

[62] Fluent but Foreign: Even Regional LLMs Lack Cultural Alignment PDF

Cannot Refute

[63] IndiBias: A Benchmark Dataset to Measure Social Biases in Language Models for Indian Context PDF

Cannot Refute

[64] MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation PDF

Cannot Refute

[65] Indicmmlu-pro: Benchmarking indic large language models on multi-task language understanding PDF

Cannot Refute

Contribution

Paired parallel corpus across 10 Indic languages

[22] Multi3Hate: Multimodal, multilingual, and multicultural hate speech detection with visionâlanguage models PDF

Cannot Refute

[46] Cultural Bias Matters: A Cross-Cultural Benchmark Dataset and Sentiment-Enriched Model for Understanding Multimodal Metaphors PDF

Cannot Refute

[51] Examining gender and racial bias in large vision-language models using a novel dataset of parallel images PDF

Cannot Refute

[52] Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset PDF

Cannot Refute

[53] Cultural bias mitigation in vision-language models for digital heritage documentation: A comparative analysis of debiasing techniques PDF

Cannot Refute

[54] On The Origin of Cultural Biases in Language Models: From Pre-training Data to Linguistic Phenomena PDF

Cannot Refute

[55] Exploring cross-cultural differences in English hate speech annotations: From dataset construction to analysis PDF

Cannot Refute

[56] Parallel corpora PDF

Cannot Refute

[57] Lost in Translation: A Position Paper on Probing Cultural Bias in Vision-Language Models via Hanbok VQA PDF

Cannot Refute

[58] Rethinking Multilingual Vision-Language Translation: Dataset, Evaluation, and Adaptation PDF

Cannot Refute

Contribution

Comprehensive evaluation revealing performance gaps in culturally diverse settings

[4] Benchmarking vision language models for cultural understanding PDF

Can Refute

[11] Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation PDF

Can Refute

[14] Toward Socially Aware Vision-Language Models: Evaluating Cultural Competence Through Multimodal Story Generation PDF

Cannot Refute

[15] Culturevlm: Characterizing and improving cultural understanding of vision-language models for over 100 countries PDF

Cannot Refute

[30] Survey of cultural awareness in language models: Text and beyond PDF

Cannot Refute

[37] Computer vision datasets and models exhibit cultural and linguistic diversity in perception PDF

Cannot Refute

[45] Uncovering Cultural Representation Disparities in Vision-Language Models PDF

Cannot Refute

[50] A culturally-diverse multilingual multimodal video benchmark & model PDF

Cannot Refute

[53] Cultural bias mitigation in vision-language models for digital heritage documentation: A comparative analysis of debiasing techniques PDF

Cannot Refute

[66] Beyond words: Exploring cultural value sensitivity in multimodal models PDF

Cannot Refute

IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[23] Drishtikon: A multimodal multilingual benchmark for testing language models' understanding on indian culture PDF

Contribution Analysis

IndicVisionBench benchmark for culturally grounded multimodal evaluation

[23] Drishtikon: A multimodal multilingual benchmark for testing language models' understanding on indian culture PDF

[48] Chitrarth: Bridging vision and language for a billion people PDF

[50] A culturally-diverse multilingual multimodal video benchmark & model PDF

[59] BharatBBQ: A Multilingual Bias Benchmark for Question Answering in the Indian Context PDF

[60] Milu: A multi-task indic language understanding benchmark PDF

[61] Impact of Visual Context on Noisy Multimodal NMT: An Empirical Study for English to Indian Languages PDF

[62] Fluent but Foreign: Even Regional LLMs Lack Cultural Alignment PDF

[63] IndiBias: A Benchmark Dataset to Measure Social Biases in Language Models for Indian Context PDF

[64] MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation PDF

[65] Indicmmlu-pro: Benchmarking indic large language models on multi-task language understanding PDF

Paired parallel corpus across 10 Indic languages

[22] Multi3Hate: Multimodal, multilingual, and multicultural hate speech detection with visionâlanguage models PDF

[46] Cultural Bias Matters: A Cross-Cultural Benchmark Dataset and Sentiment-Enriched Model for Understanding Multimodal Metaphors PDF

[51] Examining gender and racial bias in large vision-language models using a novel dataset of parallel images PDF

[52] Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset PDF

[53] Cultural bias mitigation in vision-language models for digital heritage documentation: A comparative analysis of debiasing techniques PDF

[54] On The Origin of Cultural Biases in Language Models: From Pre-training Data to Linguistic Phenomena PDF

[55] Exploring cross-cultural differences in English hate speech annotations: From dataset construction to analysis PDF

[56] Parallel corpora PDF

[57] Lost in Translation: A Position Paper on Probing Cultural Bias in Vision-Language Models via Hanbok VQA PDF

[58] Rethinking Multilingual Vision-Language Translation: Dataset, Evaluation, and Adaptation PDF

Comprehensive evaluation revealing performance gaps in culturally diverse settings

[4] Benchmarking vision language models for cultural understanding PDF

[11] Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation PDF

[14] Toward Socially Aware Vision-Language Models: Evaluating Cultural Competence Through Multimodal Story Generation PDF

[15] Culturevlm: Characterizing and improving cultural understanding of vision-language models for over 100 countries PDF

[30] Survey of cultural awareness in language models: Text and beyond PDF

[37] Computer vision datasets and models exhibit cultural and linguistic diversity in perception PDF

[45] Uncovering Cultural Representation Disparities in Vision-Language Models PDF

[50] A culturally-diverse multilingual multimodal video benchmark & model PDF

[53] Cultural bias mitigation in vision-language models for digital heritage documentation: A comparative analysis of debiasing techniques PDF

[66] Beyond words: Exploring cultural value sensitivity in multimodal models PDF

Table of Contents

[22] Multi3Hate: Multimodal, multilingual, and multicultural hate speech detection with visionâlanguage models PDF