RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding

ICLR 2026 Conference SubmissionAnonymous Authors
Visual culture understandingCultural benchmarkMultimodal retrieval-augmented generation
Abstract:

As vision-language models (VLMs) become increasingly integrated into daily life, the need for accurate visual culture understanding is becoming critical. Yet, these models frequently fall short in interpreting cultural nuances effectively. Prior work has demonstrated the effectiveness of retrieval-augmented generation (RAG) in enhancing cultural understanding in text-only settings, while its application in multimodal scenarios remains underexplored. To bridge this gap, we introduce RAVENEA (Retrieval-Augmented Visual culturE uNdErstAnding), a new benchmark designed to advance visual culture understanding through retrieval, focusing on two tasks: culture-focused visual question answering (cVQA) and culture-informed image captioning (cIC). RAVENEA extends existing datasets by integrating over 10,000 unique Wikipedia documents curated and ranked by human annotators. Through the extensive evaluation on seven multimodal retrievers and fifteen VLMs, RAVENEA reveals some undiscovered findings: (i) In general, cultural grounding annotations can enhance multimodal retrieval and corresponding downstream tasks. (ii) Lightweight VLMs, when augmented with culture-aware retrieval, outperform their non-augmented counterparts (by at least 3.2% on cVQA and 6.2% on cIC). (iii) Performance varies widely across countries, with culture-aware retrieval augmented VLMs showing more stable results on Korean and Chinese contexts than in the other countries. These findings highlight the critical limitations of current multimodal retrievers and VLMs, and underscore the need to enhance RAG visual culture understanding. Our RAVENEA can serve as a foundational tool for advancing the study of RAG visual culture understanding of multimodal AI.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces RAVENEA, a benchmark for retrieval-augmented visual culture understanding, comprising culture-focused visual question answering and culture-informed image captioning tasks with over 10,000 curated Wikipedia documents. Within the taxonomy, it resides in the 'Visual Culture Understanding Benchmarks' leaf under 'Multimodal Retrieval and Benchmarking'. This leaf contains only two papers total, including RAVENEA and one sibling (Event Image Challenge), indicating a relatively sparse research direction. The broader taxonomy encompasses 40 papers across multiple branches, suggesting that while cultural visual understanding is active, dedicated benchmarking efforts remain limited.

The taxonomy reveals that RAVENEA's parent branch (Multimodal Retrieval and Benchmarking) is distinct from neighboring areas like 'Cultural Heritage Information Systems' (20 papers) and 'Retrieval-Augmented Generation Frameworks for Vision' (6 papers). While heritage systems focus on organizing artifacts and RAG frameworks develop general architectures, RAVENEA bridges these by providing evaluation infrastructure specifically for retrieval-augmented cultural interpretation. The scope note clarifies that general visual benchmarks without cultural focus belong elsewhere, positioning RAVENEA at the intersection of cultural grounding and multimodal retrieval assessment. This placement suggests the work addresses an underserved niche between application-focused heritage systems and evaluation-focused general benchmarks.

Among 30 candidates examined across three contributions, none were found to clearly refute any component. The RAVENEA benchmark itself (10 candidates examined, 0 refutable) appears novel within this limited search scope, as does the Culture-Aware Contrastive learning framework (10 candidates, 0 refutable) and RegionScore metric (10 candidates, 0 refutable). The absence of refutable prior work across all contributions, combined with the sparse benchmark leaf containing only one sibling paper, suggests these contributions occupy relatively unexplored territory. However, this assessment is constrained by the top-30 semantic search scope and does not constitute exhaustive coverage of all potentially relevant prior work.

Based on the limited literature search, RAVENEA appears to introduce distinct evaluation infrastructure for a nascent research direction. The sparse benchmark category and zero refutable candidates across contributions indicate novelty within the examined scope, though the small search scale (30 papers) means potentially relevant work outside top semantic matches may exist. The taxonomy structure confirms that dedicated benchmarks for retrieval-augmented cultural understanding remain rare compared to broader heritage or RAG research.

Taxonomy

Core-task Taxonomy Papers
40
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: retrieval-augmented visual culture understanding. This emerging field combines multimodal retrieval with cultural knowledge to interpret images, artworks, and heritage artifacts in context. The taxonomy reveals five main branches: Retrieval-Augmented Generation Frameworks for Vision develops general-purpose architectures that integrate external knowledge into vision-language models, exemplified by surveys like RAG Vision Survey[1] and systems such as Video-RAG[9] and RegionRAG[10]. Cultural Heritage and Artistic Understanding focuses on domain-specific applications for museums, historical artifacts, and traditional art forms, with works like ArtRAG[2] and Chinese Heritage Retrieval[12] addressing specialized cultural content. Cross-Cultural and Multilingual Visual Understanding tackles diversity and representation challenges through datasets like AraTraditions10k[13] and systems such as Hakka Chatbots[6]. Multimodal Retrieval and Benchmarking establishes evaluation frameworks and datasets to measure cultural comprehension capabilities, while Specialized Visual Generation and Retrieval Applications explores targeted use cases from tombstone parsing to sign language recognition. Recent activity highlights tensions between general-purpose retrieval frameworks and culturally-grounded approaches. While broad systems like DIR Captioning[3] and Evcap[11] aim for scalable image understanding, works such as FolkRAG[4] and ValuesRAG[5] emphasize the importance of culturally-specific knowledge bases that capture nuanced traditions and values. RAVENEA[0] sits within the benchmarking cluster alongside Event Image Challenge[16], contributing evaluation resources for assessing how well models handle culturally-rich visual content. Compared to neighboring benchmarks, RAVENEA appears to emphasize retrieval-augmented approaches where external cultural knowledge enhances interpretation, contrasting with purely end-to-end vision-language evaluation. Key open questions include how to balance retrieval efficiency with cultural depth, whether general frameworks can adequately capture regional specificity, and how to construct representative benchmarks that avoid perpetuating cultural biases while enabling meaningful progress measurement.

Claimed Contributions

RAVENEA benchmark for multimodal retrieval-augmented visual culture understanding

The authors introduce RAVENEA, the first benchmark explicitly designed to evaluate vision-language models in using external knowledge for visual culture understanding. It covers eight countries across eleven categories, linking images to human-ranked Wikipedia documents on two tasks: culture-focused visual question answering and culture-informed image captioning.

10 retrieved papers
Culture-Aware Contrastive (CAC) learning framework

The authors propose Culture-Aware Contrastive learning, a supervised learning framework that enhances cultural awareness in multimodal retrieval by incorporating culture-targeted annotations. This framework is compatible with both CLIP and SigLIP architectures and demonstrates marked gains in retrieval accuracy.

10 retrieved papers
RegionScore metric for evaluating cultural relevance in image captions

The authors introduce RegionScore, a novel evaluation metric that quantifies the extent to which generated captions reference specific geopolitical regions. This metric addresses the mismatch between automatic metrics and human judgments of cultural appropriateness in image captioning tasks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

RAVENEA benchmark for multimodal retrieval-augmented visual culture understanding

The authors introduce RAVENEA, the first benchmark explicitly designed to evaluate vision-language models in using external knowledge for visual culture understanding. It covers eight countries across eleven categories, linking images to human-ranked Wikipedia documents on two tasks: culture-focused visual question answering and culture-informed image captioning.

Contribution

Culture-Aware Contrastive (CAC) learning framework

The authors propose Culture-Aware Contrastive learning, a supervised learning framework that enhances cultural awareness in multimodal retrieval by incorporating culture-targeted annotations. This framework is compatible with both CLIP and SigLIP architectures and demonstrates marked gains in retrieval accuracy.

Contribution

RegionScore metric for evaluating cultural relevance in image captions

The authors introduce RegionScore, a novel evaluation metric that quantifies the extent to which generated captions reference specific geopolitical regions. This metric addresses the mismatch between automatic metrics and human judgments of cultural appropriateness in image captioning tasks.