Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

multilingual benchmarksvision-language modelsmultimodal evaluationcultural diversitylow-resource languagesmachine learning evaluation

The evaluation of vision-language models (VLMs) has mainly relied on English-language benchmarks, leaving significant gaps in both multilingual and multicultural coverage. While multilingual benchmarks have expanded, both in size and language, many rely on translations of English datasets, failing to capture cultural nuances. In this work, we propose Kaleidoscope, as the most comprehensive exam benchmark to date for the multilingual evaluation of vision-language models. Kaleidoscope is a large-scale, in-language multimodal benchmark designed to evaluate VLMs across diverse languages and visual inputs. Kaleidoscope covers 18 languages and 14 different subjects, amounting to a total of 20,911 multiple-choice questions. Built through an open science collaboration with a diverse group of researchers worldwide, Kaleidoscope ensures linguistic and cultural authenticity. We evaluate top-performing multilingual vision-language models and find that they perform poorly on low-resource languages and in complex multimodal scenarios. Our results highlight the need for progress on culturally inclusive multimodal evaluation frameworks.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Kaleidoscope, a large-scale multilingual multimodal benchmark spanning 18 languages and 14 subjects with over 20,000 multiple-choice questions. It resides in the Comprehensive Multilingual Multimodal Benchmarks leaf, which contains six papers including Exams-V, M3exam, and M4u. This leaf represents one of the most active research directions in the taxonomy, reflecting sustained community interest in holistic evaluation frameworks that stress-test vision-language models across diverse linguistic and task contexts rather than narrow domain-specific assessments.

The taxonomy reveals that Kaleidoscope sits within the broader Evaluation Benchmarks and Datasets branch, which also includes Task-Specific Evaluation Benchmarks (focused on VQA, retrieval, or document comprehension) and Cultural and Linguistic Diversity Benchmarks (emphasizing region-specific visual contexts). Neighboring branches address Model Architecture and Training Approaches and Cross-Lingual Adaptation Methods, indicating that the field balances benchmark creation with model development. Kaleidoscope's emphasis on in-language data collection and cultural authenticity aligns it more closely with the Cultural and Linguistic Diversity leaf than with translation-based benchmarks, though it remains classified under comprehensive evaluation due to its multi-subject scope.

Among 30 candidates examined, the KALEIDOSCOPE benchmark contribution shows one refutable candidate out of ten examined, suggesting substantial prior work in comprehensive multilingual evaluation. The open science collaboration contribution faces stronger overlap, with six refutable candidates among ten examined, indicating that collaborative data collection methods are well-established in the field. The evaluation revealing performance disparities shows one refutable candidate out of ten, implying that while empirical findings on cross-lingual gaps are documented, the specific modality-language interaction patterns may offer incremental insights. The limited search scope means these statistics reflect top-K semantic matches rather than exhaustive coverage.

Given the search examined 30 candidates and found eight refutable pairs across three contributions, the work appears to build on a moderately crowded research area. The benchmark's scale and language coverage may differentiate it from siblings like Exams-V or M3exam, but the analysis cannot confirm whether these differences constitute substantial novelty without deeper comparison. The collaborative methodology and performance findings align with established patterns in multilingual evaluation research, though the specific combination of scale, authenticity, and task diversity may offer value to practitioners seeking comprehensive assessment tools.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: multilingual multimodal vision-language model evaluation. The field has evolved around several interconnected branches that together address how vision-language models perform across diverse languages and modalities. Model Architecture and Training Approaches explore foundational designs—ranging from early multilingual pretraining frameworks like UC2[5] and Pali[11] to more recent large-scale efforts such as Pali-X[1] and Siglip 2[2]—that aim to build robust cross-lingual representations from scratch or via distillation methods like Cross-Lingual Multimodal Distillation[6]. Cross-Lingual Adaptation Methods focus on transfer techniques, including pivoting strategies and weakly supervised alignment, to extend English-centric models to lower-resource languages. Evaluation Benchmarks and Datasets form a dense branch, introducing comprehensive test suites like Exams-V[14], M3exam[16], and M4u[27] that span multiple languages and task types, while Empirical Analysis and Model Behavior Studies investigate phenomena such as hallucination mitigation and semantic alignment. Application-Oriented Studies apply these models to domains like medical QA, artwork explanation, and navigation, and Survey and Review Literature synthesizes progress across the taxonomy. Within the evaluation landscape, a particularly active line of work centers on comprehensive multilingual multimodal benchmarks that stress-test models on diverse reasoning and perception tasks. Kaleidoscope[0] sits squarely in this cluster, offering a broad assessment framework that complements neighboring efforts like Exams-V[14] and M3exam[16], which emphasize academic exam-style questions, and M4u[27], which targets understanding across varied modalities. While Exams-V[14] and M3exam[16] prioritize structured knowledge evaluation in educational contexts, Kaleidoscope[0] appears to adopt a wider lens, potentially incorporating richer task diversity or cultural variation akin to what Culturally-Diverse Video Benchmark[10] explores for video data. This positioning reflects an ongoing tension in the field: balancing depth in specific reasoning domains against breadth in language and task coverage, a trade-off that remains central as researchers seek benchmarks capable of revealing both cross-lingual transfer gaps and fine-grained model behaviors.

Claimed Contributions

KALEIDOSCOPE benchmark for multilingual multimodal evaluation

Can Refute

10 retrieved papers

The authors introduce KALEIDOSCOPE, a large-scale benchmark containing 20,911 multiple-choice questions across 18 languages and 14 subjects. The benchmark is designed to evaluate vision-language models using in-language, culturally authentic exam questions, with 55% requiring image understanding for accurate resolution.

10 retrieved papers

Can Refute

Open science collaboration for authentic data collection

Can Refute

10 retrieved papers

The authors conduct a large-scale open science effort involving contributors from 20 nations across four continents to manually collect exam questions in their original languages. This participatory approach ensures that the benchmark captures genuine linguistic and cultural nuances rather than relying on translations.

10 retrieved papers

Can Refute

Comprehensive evaluation revealing modality-specific and cross-lingual performance disparities

Can Refute

10 retrieved papers

The authors evaluate state-of-the-art vision-language models on KALEIDOSCOPE and identify significant performance gaps: models show substantially better accuracy on text-only versus multimodal questions, struggle more with STEM subjects compared to humanities, and exhibit weaker performance on low-resource and non-Latin script languages.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[14] Exams-v: A multi-discipline multilingual multimodal exam benchmark for evaluating vision language models PDF

Cannot Refute

[55] Llava-ndino: Empowering llms with multimodality for the italian language PDF

Cannot Refute

Contribution

Open science collaboration for authentic data collection

[63] Aya dataset: An open-access collection for multilingual instruction tuning PDF

Can Refute

[64] Culturebank: An online community-driven knowledge base towards culturally aware language technologies PDF

Cannot Refute

[62] Naturalbench: Evaluating vision-language models on natural adversarial samples PDF

Cannot Refute

Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[14] Exams-v: A multi-discipline multilingual multimodal exam benchmark for evaluating vision language models PDF

[16] M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models PDF

[25] MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching PDF

[27] M4u: Evaluating multilingual understanding and reasoning for large multimodal models PDF

[32] M5--A Diverse Benchmark to Assess the Performance of Large Multimodal Models Across Multilingual and Multicultural Vision-Language Tasks PDF

Contribution Analysis

KALEIDOSCOPE benchmark for multilingual multimodal evaluation

[14] Exams-v: A multi-discipline multilingual multimodal exam benchmark for evaluating vision language models PDF

[22] xGQA: Cross-lingual visual question answering PDF

[28] Improving the cross-lingual generalisation in visual question answering PDF

[34] WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation PDF

[35] Cross-lingual text-rich visual comprehension: An information theory perspective PDF

[51] Seed-bench: Benchmarking multimodal large language models PDF

[52] Parameter-efficient cross-lingual transfer of vision and language models via translation-based alignment PDF

[53] Cvqa: Culturally-diverse multilingual visual question answering benchmark PDF

[54] MSA at ImageCLEF 2025 Multimodal Reasoning: Multilingual Multimodal Reasoning With Ensemble Vision Language Models PDF

[55] Llava-ndino: Empowering llms with multimodality for the italian language PDF

Open science collaboration for authentic data collection

[63] Aya dataset: An open-access collection for multilingual instruction tuning PDF

[64] Culturebank: An online community-driven knowledge base towards culturally aware language technologies PDF

[68] The bitter lesson learned from 2,000+ multilingual benchmarks PDF

[69] Palm: A culturally inclusive and linguistically diverse dataset for arabic llms PDF

[70] SEACrowd: A multilingual multimodal data hub and benchmark suite for Southeast Asian languages PDF

[71] Crowdsource, crawl, or generate? creating sea-vl, a multicultural vision-language dataset for southeast asia PDF

[65] Mmteb: Massive multilingual text embedding benchmark PDF

[66] GIMMICK--Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking PDF

[67] CulturalBench: A robust, diverse and challenging benchmark for measuring LMs' cultural knowledge through human-AI red-teaming PDF

[72] A cartography of open collaboration in open source ai: Mapping practices, motivations, and governance in 14 open large language model projects PDF

Comprehensive evaluation revealing modality-specific and cross-lingual performance disparities

[16] M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models PDF

[5] Uc2: Universal cross-lingual cross-modal vision-and-language pre-training PDF

[19] Multilingual Vision-Language Models, A Survey PDF

[56] Benchmarking vision language models for cultural understanding PDF

[57] Uncovering bias in large vision-language models at scale with counterfactuals PDF

[58] CLAIM: Mitigating Multilingual Object Hallucination in Large Vision-Language Models with Cross-Lingual Attention Intervention PDF

[59] Zero-shot cross-lingual knowledge transfer in vqa via multimodal distillation PDF

[60] Cultural bias mitigation in vision-language models for digital heritage documentation: A comparative analysis of debiasing techniques PDF

[61] Evaluating general vision-language models for clinical medicine PDF

[62] Naturalbench: Evaluating vision-language models on natural adversarial samples PDF

Table of Contents