MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

benchmarkcrosslingualmultimodalinstruction-followingspeechvideo

Recent advances in large language models have laid the foundation for multimodal LLMs (MLLMs), which unify text, speech, and vision within a single framework. As these models are rapidly evolving toward general-purpose instruction following across diverse and complex tasks, a key frontier is evaluating their crosslingual and multimodal capabilities over both short- and long-form inputs. However, existing benchmarks fall short in evaluating these dimensions jointly: they are often limited to English, mostly focus on a single modality at a time, rely on short-form inputs, or lack human annotations--hindering comprehensive assessment of model performance across languages, modalities, and task complexity. To address these gaps, we introduce MCIF (Multimodal Crosslingual Instruction Following), the first crosslingual human-annotated benchmark based on scientific talks on NLP and beyond. MCIF evaluates instruction following in crosslingual, multimodal settings over different input lengths and spans four macro-tasks: recognition, translation, question answering, and summarization. It covers three core modalities (speech, vision, and text) and four diverse languages (English, German, Italian, and Chinese), fully aligned across all dimensions. This parallel design enables a systematic evaluation of MLLMs' abilities to interpret instructions across languages and effectively integrate multimodal contextual information. Our benchmarking and analysis of 23 models highlight universal challenges across modalities and tasks, indicating substantial room for improvement in future MLLMs development. MCIF is released under CC-BY 4.0 license to promote open research.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MCIF, a human-annotated benchmark for evaluating multimodal crosslingual instruction following across speech, vision, and text modalities in four languages. Within the taxonomy, MCIF resides in the 'Instruction Following Assessment Frameworks' leaf alongside one sibling paper (MaXIFE). This leaf contains only two papers total, suggesting a relatively sparse research direction focused specifically on instruction-following evaluation rather than general multimodal multilingual benchmarking. The taxonomy shows sixteen papers across the entire field, with MCIF occupying a specialized niche that emphasizes structured assessment of instruction execution rather than model development or domain-specific applications.

The taxonomy reveals that MCIF's immediate neighbors include 'General Multimodal Multilingual Benchmarks' (Pangea, CCFQA) and 'Domain-Specific Evaluation Benchmarks' covering web understanding, factuality detection, and cultural knowledge. While general benchmarks evaluate diverse vision-language tasks without instruction-following constraints, MCIF explicitly targets how models interpret and execute instructions across modalities and languages. Domain-specific branches address specialized contexts like web comprehension or cultural grounding, whereas MCIF focuses on scientific talks spanning NLP and related fields. The taxonomy's scope notes clarify that instruction-following frameworks differ from general task benchmarks by requiring explicit instruction adherence rather than open-ended multimodal reasoning.

Among thirty candidates examined, the core MCIF benchmark contribution shows no clear refutation across ten papers reviewed, suggesting potential novelty in its specific combination of crosslingual instruction following with scientific talk content. However, the parallel design enabling systematic evaluation across dimensions encountered one refutable candidate among ten examined, and the two prompt variants for robustness evaluation found four refutable candidates among ten. These statistics indicate that while the overall benchmark concept may be distinctive, certain methodological elements—particularly prompt variation strategies—have substantial prior work within the limited search scope. The analysis does not claim exhaustive coverage but reflects patterns among top-ranked semantic matches.

Based on the limited literature search of thirty candidates, MCIF appears to occupy a specialized position combining instruction-following assessment with scientific domain content and human annotation across four languages. The taxonomy context suggests this direction remains relatively unexplored compared to broader multimodal benchmarking efforts. However, the contribution-level statistics reveal that specific design choices around parallel evaluation and prompt robustness have more established precedents, warranting careful positioning relative to existing methodological frameworks within the examined candidate set.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: multimodal crosslingual instruction following evaluation. This field examines how vision-language models handle instructions across diverse languages, combining visual understanding with linguistic diversity. The taxonomy divides into two main branches: Benchmark Design and Construction, which focuses on creating evaluation frameworks and datasets that probe crosslingual multimodal capabilities, and Model Development Approaches, which explores architectural innovations and training strategies to improve multilingual vision-language performance. Works in the benchmark branch, such as Pangea[1], CCFQA[2], and WebMMU[3], emphasize rigorous assessment protocols that span multiple languages and modalities, while the model development side includes efforts like xCoT[4] and Cross-lingual Visual Comprehension[5] that tackle reasoning and representation learning. Together, these branches reflect a maturing effort to move beyond English-centric evaluation and build systems that generalize across the world's languages. Recent activity highlights contrasting priorities: some studies concentrate on broad language coverage and cultural diversity, as seen in M5 Benchmark[7] and M4U[8], while others target specific reasoning challenges or domain applications like In-Vehicle Voice Control[12]. MCIF[0] sits within the Instruction Following Assessment Frameworks cluster, sharing close ties with MaXIFE[9], both of which emphasize structured evaluation of how models interpret and execute multimodal instructions in non-English contexts. Compared to neighbors like MaXIFE[9], MCIF[0] appears to prioritize systematic crosslingual benchmarking rather than purely model-centric innovations. Open questions persist around balancing language diversity with annotation quality, handling low-resource languages, and ensuring that evaluation metrics capture culturally grounded visual reasoning beyond surface-level translation.

Claimed Contributions

MCIF benchmark for multimodal crosslingual instruction following

10 retrieved papers

The authors introduce MCIF, a novel benchmark designed to evaluate instruction-following capabilities of multimodal LLMs across languages and modalities. It covers three modalities (text, speech, video), four languages (English, German, Italian, Chinese), and 13 tasks organized into four macro-tasks, with both short- and long-form contexts.

10 retrieved papers

Parallel design enabling systematic evaluation across dimensions

Can Refute

10 retrieved papers

The benchmark is fully parallel across modalities and languages, allowing controlled ablation studies and systematic evaluation of how models handle different input modalities, languages, and context lengths. This design supports direct comparison of model performance across these dimensions.

10 retrieved papers

Can Refute

Two prompt variants for robustness evaluation

Can Refute

10 retrieved papers

The authors design two benchmark variants with different prompting strategies: MCIFfix uses fixed prompts per macro-task, while MCIFmix samples from diverse prompt pools. This enables direct measurement of model generalization and robustness to instruction reformulation under semantic equivalence.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[9] MaXIFE: Multilingual and Cross-lingual Instruction Following Evaluation PDF

ChangJing ChangJing, Hu Jinglu, Jiang Xiu, Li Liang, Liu Yile, Ma Zi-Wei (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MCIF benchmark for multimodal crosslingual instruction following

[9] MaXIFE: Multilingual and Cross-lingual Instruction Following Evaluation PDF

Cannot Refute

[37] Mitigating multilingual hallucination in large vision-language models PDF

Cannot Refute

[38] Mtvqa: Benchmarking multilingual text-centric visual question answering PDF

Cannot Refute

[39] Parrot: Multilingual visual instruction tuning PDF

Cannot Refute

[40] Centurio: On drivers of multilingual ability of large vision-language model PDF

Cannot Refute

[41] Behind Maya: Building a Multilingual Vision Language Model PDF

Cannot Refute

[42] A Benchmark for Multi-Lingual Vision-Language Learning in Remote Sensing Image Captioning PDF

Cannot Refute

[43] mFollowIR: a Multilingual Benchmark for Instruction Following in Retrieval PDF

Cannot Refute

[44] M2lingual: Enhancing multilingual, multi-turn instruction alignment in large language models PDF

Cannot Refute

[45] M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models PDF

Cannot Refute

Contribution

Parallel design enabling systematic evaluation across dimensions

[33] Perception Test: A Diagnostic Benchmark for Multimodal Video Models PDF

Can Refute

[27] Seed-bench: Benchmarking multimodal large language models PDF

Cannot Refute

[28] A survey on multimodal large language models PDF

Cannot Refute

[29] MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models PDF

Cannot Refute

[30] Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models PDF

Cannot Refute

[31] MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models PDF

Cannot Refute

[32] Vlmevalkit: An open-source toolkit for evaluating large multi-modality models PDF

Cannot Refute

[34] Mibench: Evaluating multimodal large language models over multiple images PDF

Cannot Refute

[35] Llava-critic: Learning to evaluate multimodal models PDF

Cannot Refute

[36] MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation PDF

Cannot Refute

Contribution

Two prompt variants for robustness evaluation

[18] PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts PDF

Can Refute

[21] Multitask prompted training enables zero-shot task generalization PDF

Can Refute

[23] Do prompt-based models really understand the meaning of their prompts? PDF

Can Refute

[25] Prompting gpt-3 to be reliable PDF

Can Refute

[17] Evaluating Large Language Models at Evaluating Instruction Following PDF

Cannot Refute

[19] Dual prototype evolving for test-time generalization of vision-language models PDF

Cannot Refute

[20] Large Language Models Are State-of-the-Art Evaluators of Translation Quality PDF

Cannot Refute

[22] Robust prompt optimization for defending language models against jailbreaking attacks PDF

Cannot Refute

[24] Enhancing and assessing instruction-following with fine-grained instruction variants PDF

Cannot Refute

[26] Out-of-context prompting boosts fairness and robustness in large language model predictions PDF

Cannot Refute

MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[9] MaXIFE: Multilingual and Cross-lingual Instruction Following Evaluation PDF

Contribution Analysis

MCIF benchmark for multimodal crosslingual instruction following

[9] MaXIFE: Multilingual and Cross-lingual Instruction Following Evaluation PDF

[37] Mitigating multilingual hallucination in large vision-language models PDF

[38] Mtvqa: Benchmarking multilingual text-centric visual question answering PDF

[39] Parrot: Multilingual visual instruction tuning PDF

[40] Centurio: On drivers of multilingual ability of large vision-language model PDF

[41] Behind Maya: Building a Multilingual Vision Language Model PDF

[42] A Benchmark for Multi-Lingual Vision-Language Learning in Remote Sensing Image Captioning PDF

[43] mFollowIR: a Multilingual Benchmark for Instruction Following in Retrieval PDF

[44] M2lingual: Enhancing multilingual, multi-turn instruction alignment in large language models PDF

[45] M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models PDF

Parallel design enabling systematic evaluation across dimensions

[33] Perception Test: A Diagnostic Benchmark for Multimodal Video Models PDF

[27] Seed-bench: Benchmarking multimodal large language models PDF

[28] A survey on multimodal large language models PDF

[29] MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models PDF

[30] Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models PDF

[31] MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models PDF

[32] Vlmevalkit: An open-source toolkit for evaluating large multi-modality models PDF

[34] Mibench: Evaluating multimodal large language models over multiple images PDF

[35] Llava-critic: Learning to evaluate multimodal models PDF

[36] MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation PDF

Two prompt variants for robustness evaluation

[18] PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts PDF

[21] Multitask prompted training enables zero-shot task generalization PDF

[23] Do prompt-based models really understand the meaning of their prompts? PDF

[25] Prompting gpt-3 to be reliable PDF

[17] Evaluating Large Language Models at Evaluating Instruction Following PDF

[19] Dual prototype evolving for test-time generalization of vision-language models PDF

[20] Large Language Models Are State-of-the-Art Evaluators of Translation Quality PDF

[22] Robust prompt optimization for defending language models against jailbreaking attacks PDF

[24] Enhancing and assessing instruction-following with fine-grained instruction variants PDF

[26] Out-of-context prompting boosts fairness and robustness in large language model predictions PDF

Table of Contents