MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

ICLR 2026 Conference SubmissionAnonymous Authors
benchmarkcrosslingualmultimodalinstruction-followingspeechvideo
Abstract:

Recent advances in large language models have laid the foundation for multimodal LLMs (MLLMs), which unify text, speech, and vision within a single framework. As these models are rapidly evolving toward general-purpose instruction following across diverse and complex tasks, a key frontier is evaluating their crosslingual and multimodal capabilities over both short- and long-form inputs. However, existing benchmarks fall short in evaluating these dimensions jointly: they are often limited to English, mostly focus on a single modality at a time, rely on short-form inputs, or lack human annotations--hindering comprehensive assessment of model performance across languages, modalities, and task complexity. To address these gaps, we introduce MCIF (Multimodal Crosslingual Instruction Following), the first crosslingual human-annotated benchmark based on scientific talks on NLP and beyond. MCIF evaluates instruction following in crosslingual, multimodal settings over different input lengths and spans four macro-tasks: recognition, translation, question answering, and summarization. It covers three core modalities (speech, vision, and text) and four diverse languages (English, German, Italian, and Chinese), fully aligned across all dimensions. This parallel design enables a systematic evaluation of MLLMs' abilities to interpret instructions across languages and effectively integrate multimodal contextual information. Our benchmarking and analysis of 23 models highlight universal challenges across modalities and tasks, indicating substantial room for improvement in future MLLMs development. MCIF is released under CC-BY 4.0 license to promote open research.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MCIF, a human-annotated benchmark for evaluating multimodal crosslingual instruction following across speech, vision, and text modalities in four languages. Within the taxonomy, MCIF resides in the 'Instruction Following Assessment Frameworks' leaf alongside one sibling paper (MaXIFE). This leaf contains only two papers total, suggesting a relatively sparse research direction focused specifically on instruction-following evaluation rather than general multimodal multilingual benchmarking. The taxonomy shows sixteen papers across the entire field, with MCIF occupying a specialized niche that emphasizes structured assessment of instruction execution rather than model development or domain-specific applications.

The taxonomy reveals that MCIF's immediate neighbors include 'General Multimodal Multilingual Benchmarks' (Pangea, CCFQA) and 'Domain-Specific Evaluation Benchmarks' covering web understanding, factuality detection, and cultural knowledge. While general benchmarks evaluate diverse vision-language tasks without instruction-following constraints, MCIF explicitly targets how models interpret and execute instructions across modalities and languages. Domain-specific branches address specialized contexts like web comprehension or cultural grounding, whereas MCIF focuses on scientific talks spanning NLP and related fields. The taxonomy's scope notes clarify that instruction-following frameworks differ from general task benchmarks by requiring explicit instruction adherence rather than open-ended multimodal reasoning.

Among thirty candidates examined, the core MCIF benchmark contribution shows no clear refutation across ten papers reviewed, suggesting potential novelty in its specific combination of crosslingual instruction following with scientific talk content. However, the parallel design enabling systematic evaluation across dimensions encountered one refutable candidate among ten examined, and the two prompt variants for robustness evaluation found four refutable candidates among ten. These statistics indicate that while the overall benchmark concept may be distinctive, certain methodological elements—particularly prompt variation strategies—have substantial prior work within the limited search scope. The analysis does not claim exhaustive coverage but reflects patterns among top-ranked semantic matches.

Based on the limited literature search of thirty candidates, MCIF appears to occupy a specialized position combining instruction-following assessment with scientific domain content and human annotation across four languages. The taxonomy context suggests this direction remains relatively unexplored compared to broader multimodal benchmarking efforts. However, the contribution-level statistics reveal that specific design choices around parallel evaluation and prompt robustness have more established precedents, warranting careful positioning relative to existing methodological frameworks within the examined candidate set.

Taxonomy

Core-task Taxonomy Papers
16
3
Claimed Contributions
30
Contribution Candidate Papers Compared
5
Refutable Paper

Research Landscape Overview

Core task: multimodal crosslingual instruction following evaluation. This field examines how vision-language models handle instructions across diverse languages, combining visual understanding with linguistic diversity. The taxonomy divides into two main branches: Benchmark Design and Construction, which focuses on creating evaluation frameworks and datasets that probe crosslingual multimodal capabilities, and Model Development Approaches, which explores architectural innovations and training strategies to improve multilingual vision-language performance. Works in the benchmark branch, such as Pangea[1], CCFQA[2], and WebMMU[3], emphasize rigorous assessment protocols that span multiple languages and modalities, while the model development side includes efforts like xCoT[4] and Cross-lingual Visual Comprehension[5] that tackle reasoning and representation learning. Together, these branches reflect a maturing effort to move beyond English-centric evaluation and build systems that generalize across the world's languages. Recent activity highlights contrasting priorities: some studies concentrate on broad language coverage and cultural diversity, as seen in M5 Benchmark[7] and M4U[8], while others target specific reasoning challenges or domain applications like In-Vehicle Voice Control[12]. MCIF[0] sits within the Instruction Following Assessment Frameworks cluster, sharing close ties with MaXIFE[9], both of which emphasize structured evaluation of how models interpret and execute multimodal instructions in non-English contexts. Compared to neighbors like MaXIFE[9], MCIF[0] appears to prioritize systematic crosslingual benchmarking rather than purely model-centric innovations. Open questions persist around balancing language diversity with annotation quality, handling low-resource languages, and ensuring that evaluation metrics capture culturally grounded visual reasoning beyond surface-level translation.

Claimed Contributions

MCIF benchmark for multimodal crosslingual instruction following

The authors introduce MCIF, a novel benchmark designed to evaluate instruction-following capabilities of multimodal LLMs across languages and modalities. It covers three modalities (text, speech, video), four languages (English, German, Italian, Chinese), and 13 tasks organized into four macro-tasks, with both short- and long-form contexts.

10 retrieved papers
Parallel design enabling systematic evaluation across dimensions

The benchmark is fully parallel across modalities and languages, allowing controlled ablation studies and systematic evaluation of how models handle different input modalities, languages, and context lengths. This design supports direct comparison of model performance across these dimensions.

10 retrieved papers
Can Refute
Two prompt variants for robustness evaluation

The authors design two benchmark variants with different prompting strategies: MCIFfix uses fixed prompts per macro-task, while MCIFmix samples from diverse prompt pools. This enables direct measurement of model generalization and robustness to instruction reformulation under semantic equivalence.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MCIF benchmark for multimodal crosslingual instruction following

The authors introduce MCIF, a novel benchmark designed to evaluate instruction-following capabilities of multimodal LLMs across languages and modalities. It covers three modalities (text, speech, video), four languages (English, German, Italian, Chinese), and 13 tasks organized into four macro-tasks, with both short- and long-form contexts.

Contribution

Parallel design enabling systematic evaluation across dimensions

The benchmark is fully parallel across modalities and languages, allowing controlled ablation studies and systematic evaluation of how models handle different input modalities, languages, and context lengths. This design supports direct comparison of model performance across these dimensions.

Contribution

Two prompt variants for robustness evaluation

The authors design two benchmark variants with different prompting strategies: MCIFfix uses fixed prompts per macro-task, while MCIFmix samples from diverse prompt pools. This enables direct measurement of model generalization and robustness to instruction reformulation under semantic equivalence.

MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks | Novelty Validation