Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation

ICLR 2026 Conference SubmissionAnonymous Authors
multilingual benchmarksvision-language modelsmultimodal evaluationcultural diversitylow-resource languagesmachine learning evaluation
Abstract:

The evaluation of vision-language models (VLMs) has mainly relied on English-language benchmarks, leaving significant gaps in both multilingual and multicultural coverage. While multilingual benchmarks have expanded, both in size and language, many rely on translations of English datasets, failing to capture cultural nuances. In this work, we propose Kaleidoscope, as the most comprehensive exam benchmark to date for the multilingual evaluation of vision-language models. Kaleidoscope is a large-scale, in-language multimodal benchmark designed to evaluate VLMs across diverse languages and visual inputs. Kaleidoscope covers 18 languages and 14 different subjects, amounting to a total of 20,911 multiple-choice questions. Built through an open science collaboration with a diverse group of researchers worldwide, Kaleidoscope ensures linguistic and cultural authenticity. We evaluate top-performing multilingual vision-language models and find that they perform poorly on low-resource languages and in complex multimodal scenarios. Our results highlight the need for progress on culturally inclusive multimodal evaluation frameworks.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Kaleidoscope, a large-scale multilingual multimodal benchmark spanning 18 languages and 14 subjects with over 20,000 multiple-choice questions. It resides in the Comprehensive Multilingual Multimodal Benchmarks leaf, which contains six papers including Exams-V, M3exam, and M4u. This leaf represents one of the most active research directions in the taxonomy, reflecting sustained community interest in holistic evaluation frameworks that stress-test vision-language models across diverse linguistic and task contexts rather than narrow domain-specific assessments.

The taxonomy reveals that Kaleidoscope sits within the broader Evaluation Benchmarks and Datasets branch, which also includes Task-Specific Evaluation Benchmarks (focused on VQA, retrieval, or document comprehension) and Cultural and Linguistic Diversity Benchmarks (emphasizing region-specific visual contexts). Neighboring branches address Model Architecture and Training Approaches and Cross-Lingual Adaptation Methods, indicating that the field balances benchmark creation with model development. Kaleidoscope's emphasis on in-language data collection and cultural authenticity aligns it more closely with the Cultural and Linguistic Diversity leaf than with translation-based benchmarks, though it remains classified under comprehensive evaluation due to its multi-subject scope.

Among 30 candidates examined, the KALEIDOSCOPE benchmark contribution shows one refutable candidate out of ten examined, suggesting substantial prior work in comprehensive multilingual evaluation. The open science collaboration contribution faces stronger overlap, with six refutable candidates among ten examined, indicating that collaborative data collection methods are well-established in the field. The evaluation revealing performance disparities shows one refutable candidate out of ten, implying that while empirical findings on cross-lingual gaps are documented, the specific modality-language interaction patterns may offer incremental insights. The limited search scope means these statistics reflect top-K semantic matches rather than exhaustive coverage.

Given the search examined 30 candidates and found eight refutable pairs across three contributions, the work appears to build on a moderately crowded research area. The benchmark's scale and language coverage may differentiate it from siblings like Exams-V or M3exam, but the analysis cannot confirm whether these differences constitute substantial novelty without deeper comparison. The collaborative methodology and performance findings align with established patterns in multilingual evaluation research, though the specific combination of scale, authenticity, and task diversity may offer value to practitioners seeking comprehensive assessment tools.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
8
Refutable Paper

Research Landscape Overview

Core task: multilingual multimodal vision-language model evaluation. The field has evolved around several interconnected branches that together address how vision-language models perform across diverse languages and modalities. Model Architecture and Training Approaches explore foundational designs—ranging from early multilingual pretraining frameworks like UC2[5] and Pali[11] to more recent large-scale efforts such as Pali-X[1] and Siglip 2[2]—that aim to build robust cross-lingual representations from scratch or via distillation methods like Cross-Lingual Multimodal Distillation[6]. Cross-Lingual Adaptation Methods focus on transfer techniques, including pivoting strategies and weakly supervised alignment, to extend English-centric models to lower-resource languages. Evaluation Benchmarks and Datasets form a dense branch, introducing comprehensive test suites like Exams-V[14], M3exam[16], and M4u[27] that span multiple languages and task types, while Empirical Analysis and Model Behavior Studies investigate phenomena such as hallucination mitigation and semantic alignment. Application-Oriented Studies apply these models to domains like medical QA, artwork explanation, and navigation, and Survey and Review Literature synthesizes progress across the taxonomy. Within the evaluation landscape, a particularly active line of work centers on comprehensive multilingual multimodal benchmarks that stress-test models on diverse reasoning and perception tasks. Kaleidoscope[0] sits squarely in this cluster, offering a broad assessment framework that complements neighboring efforts like Exams-V[14] and M3exam[16], which emphasize academic exam-style questions, and M4u[27], which targets understanding across varied modalities. While Exams-V[14] and M3exam[16] prioritize structured knowledge evaluation in educational contexts, Kaleidoscope[0] appears to adopt a wider lens, potentially incorporating richer task diversity or cultural variation akin to what Culturally-Diverse Video Benchmark[10] explores for video data. This positioning reflects an ongoing tension in the field: balancing depth in specific reasoning domains against breadth in language and task coverage, a trade-off that remains central as researchers seek benchmarks capable of revealing both cross-lingual transfer gaps and fine-grained model behaviors.

Claimed Contributions

KALEIDOSCOPE benchmark for multilingual multimodal evaluation

The authors introduce KALEIDOSCOPE, a large-scale benchmark containing 20,911 multiple-choice questions across 18 languages and 14 subjects. The benchmark is designed to evaluate vision-language models using in-language, culturally authentic exam questions, with 55% requiring image understanding for accurate resolution.

10 retrieved papers
Can Refute
Open science collaboration for authentic data collection

The authors conduct a large-scale open science effort involving contributors from 20 nations across four continents to manually collect exam questions in their original languages. This participatory approach ensures that the benchmark captures genuine linguistic and cultural nuances rather than relying on translations.

10 retrieved papers
Can Refute
Comprehensive evaluation revealing modality-specific and cross-lingual performance disparities

The authors evaluate state-of-the-art vision-language models on KALEIDOSCOPE and identify significant performance gaps: models show substantially better accuracy on text-only versus multimodal questions, struggle more with STEM subjects compared to humanities, and exhibit weaker performance on low-resource and non-Latin script languages.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

KALEIDOSCOPE benchmark for multilingual multimodal evaluation

The authors introduce KALEIDOSCOPE, a large-scale benchmark containing 20,911 multiple-choice questions across 18 languages and 14 subjects. The benchmark is designed to evaluate vision-language models using in-language, culturally authentic exam questions, with 55% requiring image understanding for accurate resolution.

Contribution

Open science collaboration for authentic data collection

The authors conduct a large-scale open science effort involving contributors from 20 nations across four continents to manually collect exam questions in their original languages. This participatory approach ensures that the benchmark captures genuine linguistic and cultural nuances rather than relying on translations.

Contribution

Comprehensive evaluation revealing modality-specific and cross-lingual performance disparities

The authors evaluate state-of-the-art vision-language models on KALEIDOSCOPE and identify significant performance gaps: models show substantially better accuracy on text-only versus multimodal questions, struggle more with STEM subjects compared to humanities, and exhibit weaker performance on low-resource and non-Latin script languages.