RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Visual culture understandingCultural benchmarkMultimodal retrieval-augmented generation

As vision-language models (VLMs) become increasingly integrated into daily life, the need for accurate visual culture understanding is becoming critical. Yet, these models frequently fall short in interpreting cultural nuances effectively. Prior work has demonstrated the effectiveness of retrieval-augmented generation (RAG) in enhancing cultural understanding in text-only settings, while its application in multimodal scenarios remains underexplored. To bridge this gap, we introduce RAVENEA (Retrieval-Augmented Visual culturE uNdErstAnding), a new benchmark designed to advance visual culture understanding through retrieval, focusing on two tasks: culture-focused visual question answering (cVQA) and culture-informed image captioning (cIC). RAVENEA extends existing datasets by integrating over 10,000 unique Wikipedia documents curated and ranked by human annotators. Through the extensive evaluation on seven multimodal retrievers and fifteen VLMs, RAVENEA reveals some undiscovered findings: (i) In general, cultural grounding annotations can enhance multimodal retrieval and corresponding downstream tasks. (ii) Lightweight VLMs, when augmented with culture-aware retrieval, outperform their non-augmented counterparts (by at least 3.2% on cVQA and 6.2% on cIC). (iii) Performance varies widely across countries, with culture-aware retrieval augmented VLMs showing more stable results on Korean and Chinese contexts than in the other countries. These findings highlight the critical limitations of current multimodal retrievers and VLMs, and underscore the need to enhance RAG visual culture understanding. Our RAVENEA can serve as a foundational tool for advancing the study of RAG visual culture understanding of multimodal AI.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces RAVENEA, a benchmark for retrieval-augmented visual culture understanding, comprising culture-focused visual question answering and culture-informed image captioning tasks with over 10,000 curated Wikipedia documents. Within the taxonomy, it resides in the 'Visual Culture Understanding Benchmarks' leaf under 'Multimodal Retrieval and Benchmarking'. This leaf contains only two papers total, including RAVENEA and one sibling (Event Image Challenge), indicating a relatively sparse research direction. The broader taxonomy encompasses 40 papers across multiple branches, suggesting that while cultural visual understanding is active, dedicated benchmarking efforts remain limited.

The taxonomy reveals that RAVENEA's parent branch (Multimodal Retrieval and Benchmarking) is distinct from neighboring areas like 'Cultural Heritage Information Systems' (20 papers) and 'Retrieval-Augmented Generation Frameworks for Vision' (6 papers). While heritage systems focus on organizing artifacts and RAG frameworks develop general architectures, RAVENEA bridges these by providing evaluation infrastructure specifically for retrieval-augmented cultural interpretation. The scope note clarifies that general visual benchmarks without cultural focus belong elsewhere, positioning RAVENEA at the intersection of cultural grounding and multimodal retrieval assessment. This placement suggests the work addresses an underserved niche between application-focused heritage systems and evaluation-focused general benchmarks.

Among 30 candidates examined across three contributions, none were found to clearly refute any component. The RAVENEA benchmark itself (10 candidates examined, 0 refutable) appears novel within this limited search scope, as does the Culture-Aware Contrastive learning framework (10 candidates, 0 refutable) and RegionScore metric (10 candidates, 0 refutable). The absence of refutable prior work across all contributions, combined with the sparse benchmark leaf containing only one sibling paper, suggests these contributions occupy relatively unexplored territory. However, this assessment is constrained by the top-30 semantic search scope and does not constitute exhaustive coverage of all potentially relevant prior work.

Based on the limited literature search, RAVENEA appears to introduce distinct evaluation infrastructure for a nascent research direction. The sparse benchmark category and zero refutable candidates across contributions indicate novelty within the examined scope, though the small search scale (30 papers) means potentially relevant work outside top semantic matches may exist. The taxonomy structure confirms that dedicated benchmarks for retrieval-augmented cultural understanding remain rare compared to broader heritage or RAG research.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: retrieval-augmented visual culture understanding. This emerging field combines multimodal retrieval with cultural knowledge to interpret images, artworks, and heritage artifacts in context. The taxonomy reveals five main branches: Retrieval-Augmented Generation Frameworks for Vision develops general-purpose architectures that integrate external knowledge into vision-language models, exemplified by surveys like RAG Vision Survey[1] and systems such as Video-RAG[9] and RegionRAG[10]. Cultural Heritage and Artistic Understanding focuses on domain-specific applications for museums, historical artifacts, and traditional art forms, with works like ArtRAG[2] and Chinese Heritage Retrieval[12] addressing specialized cultural content. Cross-Cultural and Multilingual Visual Understanding tackles diversity and representation challenges through datasets like AraTraditions10k[13] and systems such as Hakka Chatbots[6]. Multimodal Retrieval and Benchmarking establishes evaluation frameworks and datasets to measure cultural comprehension capabilities, while Specialized Visual Generation and Retrieval Applications explores targeted use cases from tombstone parsing to sign language recognition. Recent activity highlights tensions between general-purpose retrieval frameworks and culturally-grounded approaches. While broad systems like DIR Captioning[3] and Evcap[11] aim for scalable image understanding, works such as FolkRAG[4] and ValuesRAG[5] emphasize the importance of culturally-specific knowledge bases that capture nuanced traditions and values. RAVENEA[0] sits within the benchmarking cluster alongside Event Image Challenge[16], contributing evaluation resources for assessing how well models handle culturally-rich visual content. Compared to neighboring benchmarks, RAVENEA appears to emphasize retrieval-augmented approaches where external cultural knowledge enhances interpretation, contrasting with purely end-to-end vision-language evaluation. Key open questions include how to balance retrieval efficiency with cultural depth, whether general frameworks can adequately capture regional specificity, and how to construct representative benchmarks that avoid perpetuating cultural biases while enabling meaningful progress measurement.

Claimed Contributions

RAVENEA benchmark for multimodal retrieval-augmented visual culture understanding

10 retrieved papers

The authors introduce RAVENEA, the first benchmark explicitly designed to evaluate vision-language models in using external knowledge for visual culture understanding. It covers eight countries across eleven categories, linking images to human-ranked Wikipedia documents on two tasks: culture-focused visual question answering and culture-informed image captioning.

10 retrieved papers

Culture-Aware Contrastive (CAC) learning framework

10 retrieved papers

The authors propose Culture-Aware Contrastive learning, a supervised learning framework that enhances cultural awareness in multimodal retrieval by incorporating culture-targeted annotations. This framework is compatible with both CLIP and SigLIP architectures and demonstrates marked gains in retrieval accuracy.

10 retrieved papers

RegionScore metric for evaluating cultural relevance in image captions

10 retrieved papers

The authors introduce RegionScore, a novel evaluation metric that quantifies the extent to which generated captions reference specific geopolitical regions. This metric addresses the mismatch between automatic metrics and human judgments of cultural appropriateness in image captioning tasks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[16] Event-Enriched Image Analysis Grand Challenge At ACM Multimedia 2025 PDF

Tran Thien Phuc, Nguyen Minh Quang, Tran, Minh-Triet, Nguyen, Tam V., Do, Trong-Le, Huynh, Viet-Tham, Le, Khanh-Duy, Mai-Khiem, LÃª Trung NghÄ©a (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

RAVENEA benchmark for multimodal retrieval-augmented visual culture understanding

[2] ArtRAG: Retrieval-Augmented Generation with Structured Context for Visual Art Understanding PDF

Cannot Refute

[14] Multi-Modal Semantic Parsing for the Interpretation of Tombstone Inscriptions PDF

Cannot Refute

[16] Event-Enriched Image Analysis Grand Challenge At ACM Multimedia 2025 PDF

Cannot Refute

[29] Exposing Blindspots: Cultural Bias Evaluation in Generative Image Models PDF

Cannot Refute

[33] LUMOS-DM: Landscape-Based Multimodal Scene Retrieval Enhanced by Diffusion Model PDF

Cannot Refute

[39] Cultural Heritage Assistant: A Lightweight Retrieval Augmented Generation Method Enhanced Vision-Language Model for Cultural Heritage PDF

Cannot Refute

[60] Evaluating visual and cultural interpretation: The k-viscuit benchmark with human-vlm collaboration PDF

Cannot Refute

[61] From local concepts to universals: Evaluating the multicultural understanding of vision-language models PDF

Cannot Refute

[62] GREEN: Generative Retrieval-Enhanced Emotional Support Conversations PDF

Cannot Refute

[63] Lost in Translation: A Position Paper on Probing Cultural Bias in Vision-Language Models via Hanbok VQA PDF

Cannot Refute

Contribution

Culture-Aware Contrastive (CAC) learning framework

[13] AraTraditions10k bridging cultures with a comprehensive dataset for enhanced cross lingual image annotation retrieval and tagging PDF

Cannot Refute

[41] Multimodal cultural safety: Evaluation frameworks and alignment strategies PDF

Cannot Refute

[42] Matina: A culturally-aligned Persian language model using multiple LoRA experts PDF

Cannot Refute

[43] Cultural bias mitigation in vision-language models for digital heritage documentation: A comparative analysis of debiasing techniques PDF

Cannot Refute

[44] CultureCLIP: Empowering CLIP with Cultural Awareness through Synthetic Images and Contextualized Captions PDF

Cannot Refute

[45] No filter: Cultural and socioeconomic diversity in contrastive vision-language models PDF

Cannot Refute

[46] Finding Culture-Sensitive Neurons in Vision-Language Models PDF

Cannot Refute

[47] Computational Approaches to Cross-Cultural Multimodal Film and TV Integration Using AMTAs PDF

Cannot Refute

[48] â¦ Manifold Disentanglement Transformer (IMD-Transformer): A Robust Framework for Low-Resource Cross-Modal Learning in Digital Cultural Heritage and Tourism PDF

Cannot Refute

[49] ExplainHM++: Explainable Harmful Meme Detection With Retrieval-Augmented Debate Between Large Multimodal Models PDF

Cannot Refute

Contribution

RegionScore metric for evaluating cultural relevance in image captions

[50] Beyond words: Exploring cultural value sensitivity in multimodal models PDF

Cannot Refute

[51] Understanding and evaluating racial biases in image captioning PDF

Cannot Refute

[52] Cic: A framework for culturally-aware image captioning PDF

Cannot Refute

[53] Culturally-aware Image Captioning PDF

Cannot Refute

[54] On the cultural gap in text-to-image generation PDF

Cannot Refute

[55] Beyond aesthetics: Cultural competence in text-to-image models PDF

Cannot Refute

[56] Vision-language models under cultural and inclusive considerations PDF

Cannot Refute

[57] How Culturally Aware Are Vision-Language Models? PDF

Cannot Refute

[58] Semantic and Expressive Variations in Image Captions Across Languages PDF

Cannot Refute

[59] Quantifying and Mitigating Dataset Biases in Video Under-standing Tasks across Cultural Contexts PDF

Cannot Refute

RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[16] Event-Enriched Image Analysis Grand Challenge At ACM Multimedia 2025 PDF

Contribution Analysis

RAVENEA benchmark for multimodal retrieval-augmented visual culture understanding

[2] ArtRAG: Retrieval-Augmented Generation with Structured Context for Visual Art Understanding PDF

[14] Multi-Modal Semantic Parsing for the Interpretation of Tombstone Inscriptions PDF

[16] Event-Enriched Image Analysis Grand Challenge At ACM Multimedia 2025 PDF

[29] Exposing Blindspots: Cultural Bias Evaluation in Generative Image Models PDF

[33] LUMOS-DM: Landscape-Based Multimodal Scene Retrieval Enhanced by Diffusion Model PDF

[39] Cultural Heritage Assistant: A Lightweight Retrieval Augmented Generation Method Enhanced Vision-Language Model for Cultural Heritage PDF

[60] Evaluating visual and cultural interpretation: The k-viscuit benchmark with human-vlm collaboration PDF

[61] From local concepts to universals: Evaluating the multicultural understanding of vision-language models PDF

[62] GREEN: Generative Retrieval-Enhanced Emotional Support Conversations PDF

[63] Lost in Translation: A Position Paper on Probing Cultural Bias in Vision-Language Models via Hanbok VQA PDF

Culture-Aware Contrastive (CAC) learning framework

[13] AraTraditions10k bridging cultures with a comprehensive dataset for enhanced cross lingual image annotation retrieval and tagging PDF

[41] Multimodal cultural safety: Evaluation frameworks and alignment strategies PDF

[42] Matina: A culturally-aligned Persian language model using multiple LoRA experts PDF

[43] Cultural bias mitigation in vision-language models for digital heritage documentation: A comparative analysis of debiasing techniques PDF

[44] CultureCLIP: Empowering CLIP with Cultural Awareness through Synthetic Images and Contextualized Captions PDF

[45] No filter: Cultural and socioeconomic diversity in contrastive vision-language models PDF

[46] Finding Culture-Sensitive Neurons in Vision-Language Models PDF

[47] Computational Approaches to Cross-Cultural Multimodal Film and TV Integration Using AMTAs PDF

[48] â¦ Manifold Disentanglement Transformer (IMD-Transformer): A Robust Framework for Low-Resource Cross-Modal Learning in Digital Cultural Heritage and Tourism PDF

[49] ExplainHM++: Explainable Harmful Meme Detection With Retrieval-Augmented Debate Between Large Multimodal Models PDF

RegionScore metric for evaluating cultural relevance in image captions

[50] Beyond words: Exploring cultural value sensitivity in multimodal models PDF

[51] Understanding and evaluating racial biases in image captioning PDF

[52] Cic: A framework for culturally-aware image captioning PDF

[53] Culturally-aware Image Captioning PDF

[54] On the cultural gap in text-to-image generation PDF

[55] Beyond aesthetics: Cultural competence in text-to-image models PDF

[56] Vision-language models under cultural and inclusive considerations PDF

[57] How Culturally Aware Are Vision-Language Models? PDF

[58] Semantic and Expressive Variations in Image Captions Across Languages PDF

[59] Quantifying and Mitigating Dataset Biases in Video Under-standing Tasks across Cultural Contexts PDF

Table of Contents

[48] â¦ Manifold Disentanglement Transformer (IMD-Transformer): A Robust Framework for Low-Resource Cross-Modal Learning in Digital Cultural Heritage and Tourism PDF