IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs

ICLR 2026 Conference SubmissionAnonymous Authors
Vision Language ModelsVLMsMultimodal modelsCultural VLMsMutlimodal EvaluationOCRCultural VQAMutlimodal Machine TranslationMMT
Abstract:

Vision-language models (VLMs) have demonstrated impressive generalization across multimodal tasks, yet most evaluation benchmarks remain Western-centric, leaving open questions about their performance in culturally diverse and multilingual settings. To address this gap, we introduce IndicVisionBench, the first large-scale benchmark centered on the Indian subcontinent. Covering English and 10 Indian languages, our benchmark spans 3 multimodal tasks, including Optical Character Recognition (OCR), Multimodal Machine Translation (MMT), and Visual Question Answering (VQA), covering 6 kinds of question types. Our final benchmark consists of a total of ~5K images and 37K+ QA pairs across 13 culturally grounded topics. In addition, we release a paired parallel corpus of annotations across 10 Indic languages, creating a unique resource for analyzing cultural and linguistic biases in VLMs. We evaluate a broad spectrum of 8 models, from proprietary closed-source systems to open-weights medium and large-scale models. Our experiments reveal substantial performance gaps, underscoring the limitations of current VLMs in culturally diverse contexts. By centering cultural diversity and multilinguality, IndicVisionBench establishes a reproducible evaluation framework that paves the way for more inclusive multimodal research.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces IndicVisionBench, a large-scale benchmark for evaluating vision-language models on Indian subcontinent content across English and ten Indic languages. It resides in the 'Indian Subcontinent Benchmarks' leaf, which contains only two papers total: this work and Drishtikon. This represents a notably sparse research direction within the broader taxonomy of fifty papers, suggesting that culturally grounded evaluation resources for the Indian subcontinent remain underdeveloped despite the region's linguistic diversity and population scale.

The taxonomy reveals that region-specific benchmarks form a distinct branch alongside broader multilingual efforts. Neighboring leaves include Southeast Asian benchmarks (one paper) and geographically diverse cultural benchmarks (three papers), while the parent category 'Region-Specific and Cultural Benchmarks' contrasts with 'Comprehensive Multilingual Multimodal Benchmarks' containing exam-based and general frameworks. The paper's focus on culturally grounded topics and parallel annotations distinguishes it from general-purpose multilingual datasets like Global MMLU or M3exam, which prioritize breadth over regional depth. This positioning reflects a field tension between universal coverage and culturally nuanced evaluation.

Among thirty candidates examined, the contribution-level analysis shows varied novelty profiles. The benchmark itself (Contribution A: ten candidates, zero refutations) and the parallel corpus (Contribution B: ten candidates, zero refutations) appear relatively novel within the limited search scope. However, the evaluation revealing performance gaps (Contribution C: ten candidates, two refutations) encounters more substantial prior work, likely because documenting VLM limitations in diverse settings has been explored in related cultural and multilingual evaluation studies. The small number of refutations suggests the specific combination of tasks, languages, and cultural grounding may still offer distinctive insights.

Based on the limited search of thirty semantically similar papers, the work appears to occupy a genuinely sparse research area. The Indian subcontinent leaf's minimal population and the absence of refutations for the core benchmark contributions suggest meaningful novelty, though the evaluation findings align with broader patterns documented in cultural bias and multilingual capability studies. The analysis does not cover exhaustive literature review or domain-specific publication venues that might reveal additional related work.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: Evaluating vision-language models on culturally diverse and multilingual content. The field has organized itself around several major branches that reflect both methodological and application-driven concerns. Benchmark Development and Dataset Construction focuses on creating evaluation resources that span diverse languages and cultural contexts, including region-specific benchmarks such as those targeting the Indian subcontinent (IndicVisionBench[0], Drishtikon[23]) or Arabic traditions (AraTraditions10k[32]), as well as broader multilingual datasets (Global MMLU[2], M3exam[8]). Model Development and Training Approaches addresses architectural innovations and training strategies for multilingual vision-language systems (SigLIP 2[1], mblip[25]), while Evaluation Methodologies and Empirical Analysis examines how to rigorously assess cultural understanding and linguistic equity (Cultural Understanding Benchmark[4], All Languages Matter[3]). Survey and Theoretical Frameworks provide conceptual grounding (Multilingual VLM Survey[10], Cultural Awareness Survey[30]), and the remaining branches explore Application-Oriented tasks, Domain-Specific uses (WorldMedQA-V[12]), Educational contexts (Pedagogical Inclusion[26]), and Cross-Cultural Communication Systems. Particularly active lines of work reveal tensions between broad multilingual coverage and deep cultural grounding. Some efforts prioritize scaling to many languages with general-purpose architectures (PaLI-X[20], Pangea[27]), while others emphasize culturally nuanced understanding through specialized benchmarks that capture region-specific knowledge and visual traditions (Cultural Inclusive VLMs[6], CultureVLM[15]). IndicVisionBench[0] sits squarely within the region-specific strand, joining a small cluster of Indian subcontinent benchmarks like Drishtikon[23] that probe whether models can handle culturally grounded visual reasoning in Indic languages. Compared to broader multilingual evaluations such as Global MMLU[2] or cross-cultural frameworks like Kaleidoscope[11], IndicVisionBench[0] offers deeper regional focus, trading breadth for the ability to surface culture-specific gaps that general benchmarks might overlook. This positioning reflects an ongoing question in the field: whether universal multilingual models can adequately serve diverse communities or whether region-tailored evaluation and development remain essential.

Claimed Contributions

IndicVisionBench benchmark for culturally grounded multimodal evaluation

The authors introduce IndicVisionBench, the first large-scale benchmark explicitly designed to evaluate vision-language models on culturally grounded understanding in the Indian subcontinent context. It comprises 5K images and 37K+ question-answer pairs across 13 cultural topics, covering English and 10 Indic languages, and spans three multimodal tasks: Visual Question Answering, Optical Character Recognition, and Multimodal Machine Translation.

10 retrieved papers
Paired parallel corpus across 10 Indic languages

The authors release a paired parallel corpus of annotations spanning 10 Indic languages, enabling systematic analysis of cultural and linguistic biases in vision-language models. This resource supports cross-lingual evaluation and comparison of model performance across diverse linguistic contexts.

10 retrieved papers
Comprehensive evaluation revealing performance gaps in culturally diverse settings

The authors conduct a comprehensive evaluation of 8 state-of-the-art vision-language models, including both proprietary and open-weight systems, across all three benchmark tracks. Their experiments reveal substantial performance gaps and systematic limitations of current models in culturally diverse and multilingual contexts, particularly for low-resource languages.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

IndicVisionBench benchmark for culturally grounded multimodal evaluation

The authors introduce IndicVisionBench, the first large-scale benchmark explicitly designed to evaluate vision-language models on culturally grounded understanding in the Indian subcontinent context. It comprises 5K images and 37K+ question-answer pairs across 13 cultural topics, covering English and 10 Indic languages, and spans three multimodal tasks: Visual Question Answering, Optical Character Recognition, and Multimodal Machine Translation.

Contribution

Paired parallel corpus across 10 Indic languages

The authors release a paired parallel corpus of annotations spanning 10 Indic languages, enabling systematic analysis of cultural and linguistic biases in vision-language models. This resource supports cross-lingual evaluation and comparison of model performance across diverse linguistic contexts.

Contribution

Comprehensive evaluation revealing performance gaps in culturally diverse settings

The authors conduct a comprehensive evaluation of 8 state-of-the-art vision-language models, including both proprietary and open-weight systems, across all three benchmark tracks. Their experiments reveal substantial performance gaps and systematic limitations of current models in culturally diverse and multilingual contexts, particularly for low-resource languages.