When to use Graphs in RAG: A Comprehensive Analysis for Graph Retrieval-Augmented Generation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

RAGGraphRAGGraphRAG Benchmark

Graph retrieval-augmented generation (GraphRAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) with external knowledge. It leverages graphs to model the hierarchical structure between specific concepts, enabling more coherent and effective knowledge retrieval for accurate reasoning. Despite its conceptual promise, recent studies report that GraphRAG frequently underperforms vanilla RAG on many real-world tasks. This raises a critical question: Is GraphRAG really effective, and in which scenarios do graph structures provide measurable benefits for RAG systems? To address this, we propose GraphRAG-Bench, a comprehensive benchmark designed to evaluate GraphRAG models on both hierarchical knowledge retrieval and deep contextual reasoning. GraphRAG-Bench features a comprehensive dataset with tasks of increasing difficulty, covering fact retrieval, complex reasoning, contextual summarize, and creative generation, and a systematic evaluation across the entire pipeline, from graph construction and knowledge retrieval to final generation. Leveraging this novel benchmark, we systematically investigate the conditions when GraphRAG surpasses traditional RAG and the underlying reasons for its success, offering guidelines for its practical application. All related resources and analysis are collected for the community at https://anonymous.4open.science/r/GraphRAG-Benchmark-CE8D/.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes GraphRAG-Bench, a comprehensive benchmark for evaluating graph retrieval-augmented generation across multiple task types and difficulty levels. It sits within the 'Comprehensive Multi-Dimensional Benchmarks' leaf, which contains only three papers total. This is a relatively sparse research direction within the broader taxonomy of 26 papers across the field, suggesting that systematic, multi-dimensional benchmarking of GraphRAG remains an emerging area. The sibling papers in this leaf include works examining when to use graphs and providing in-depth analysis of GraphRAG performance, indicating a shared focus on understanding GraphRAG effectiveness rather than proposing new architectures.

The taxonomy reveals neighboring research directions in domain-specific evaluation, question generation for difficulty calibration, and various retrieval optimization strategies. The paper's position in benchmark design distinguishes it from adjacent branches focused on graph construction methods, adaptive retrieval techniques, and reasoning architectures. While the field shows substantial activity in retrieval strategies (with adaptive, multi-hop, and query processing subcategories) and domain applications, the comprehensive benchmarking cluster remains small. This positioning suggests the work addresses a recognized gap: the need for standardized evaluation frameworks that can systematically compare GraphRAG against traditional RAG across varying task complexities.

Among 28 candidates examined through semantic search and citation expansion, none were found to clearly refute any of the three main contributions. For the GraphRAG-Bench benchmark itself, 10 candidates were examined with no refutable prior work identified. Similarly, the systematic investigation of when GraphRAG outperforms traditional RAG examined 9 candidates without finding overlapping work, and the multi-stage evaluation framework examined 9 candidates with the same result. These statistics suggest that within the limited search scope, the specific combination of comprehensive benchmarking, task complexity analysis, and pipeline-level evaluation appears relatively novel, though the search scale of 28 papers means substantial prior work outside this scope cannot be ruled out.

Based on the limited literature search of 28 candidates, the work appears to occupy a relatively underexplored niche within GraphRAG evaluation. The sparse population of its taxonomy leaf and absence of clearly overlapping work among examined candidates suggest potential novelty, though this assessment is constrained by the top-K semantic search methodology. A more exhaustive review of the broader RAG benchmarking literature would be needed to definitively assess originality.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Evaluating graph retrieval-augmented generation effectiveness across task complexity levels. The field has organized itself around five main branches that reflect the lifecycle of graph-based RAG systems. Benchmark Design and Evaluation Frameworks establish standardized testbeds for measuring performance, with works like When to use Graphs[0], In-depth Analysis[2], and Graphrag-bench[7] providing comprehensive multi-dimensional assessments. Graph Construction and Knowledge Representation addresses how structured knowledge is extracted and organized, while Retrieval Strategies and Optimization explores methods for efficiently navigating these structures, including approaches like Hyper-RAG[9] and Hierarchical Lexical Graph[10]. Reasoning and Generation Architectures focuses on how retrieved graph information is integrated into language model outputs, and Domain-Specific Applications demonstrates practical deployments in areas such as medical question answering (Medical Hallucinations[13]), education (Exam Question Creation[11]), and industrial settings (Electric Power Support[17]). A particularly active tension exists between general-purpose benchmarking efforts and specialized retrieval techniques. Works like Benchmarking RAG Pipelines[16] and Beyond Vector Retrieval[15] examine fundamental trade-offs in retrieval paradigms, while adaptive methods such as AdaGCRAG[14] and Adaptive Schema[6] explore dynamic graph construction strategies. The original paper When to use Graphs[0] sits squarely within the comprehensive benchmarking cluster alongside In-depth Analysis[2] and Graphrag-bench[7], but distinguishes itself by systematically investigating when graph-based approaches outperform simpler alternatives across varying task complexity. Where Graphrag-bench[7] emphasizes breadth of evaluation scenarios and In-depth Analysis[2] provides detailed performance breakdowns, When to use Graphs[0] focuses on the decision boundary itself—helping practitioners understand the conditions under which the added complexity of graph structures yields measurable benefits over traditional retrieval methods.

Claimed Contributions

GraphRAG-Bench benchmark for evaluating graph retrieval-augmented generation

10 retrieved papers

The authors introduce GraphRAG-Bench, a novel benchmark that systematically evaluates GraphRAG systems through tasks of increasing difficulty (fact retrieval, complex reasoning, contextual summarization, creative generation), comprehensive corpora with varying information density, and systematic evaluation across the entire pipeline from graph construction to generation.

10 retrieved papers

Systematic investigation of when GraphRAG outperforms traditional RAG

9 retrieved papers

Using the GraphRAG-Bench benchmark, the authors conduct a comprehensive analysis to identify specific scenarios and conditions under which GraphRAG provides measurable benefits over vanilla RAG systems, providing practical guidelines for applying GraphRAG effectively.

9 retrieved papers

Multi-stage evaluation framework for GraphRAG pipeline

9 retrieved papers

The authors develop a holistic evaluation methodology that assesses GraphRAG systems at each stage of the pipeline, including graph quality metrics, retrieval performance measures, and generation accuracy, rather than treating the system as a black box focused only on final outputs.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] In-depth Analysis of Graph-based RAG in a Unified Framework PDF

Zhou Yingli, Su Yaodong, Yingli Zhou, Sun Youran, Yaodong Su, Wang Shu, Youran Sun, Wang, Taotao, Shu Wang, HE Runyuan, Taotao Wang, Zhang Yongwei, Runyuan He, Liang Sicong, Yongwei Zhang, Liu, Xilin, Sicong Liang, Ma, Yuchi, Xilin Liu, Fang, Yixiang, Yuchi Ma, Yixiang Fang (2025)

[7] Graphrag-bench: Challenging domain-specific reasoning for evaluating graph retrieval-augmented generation PDF

Xiao Yi-Lin, Yilin Xiao, Zhou Chuang, Junnan Dong, Dong Su, Chuang Zhou, Zhang, Qian-Wen, Su Dong, Yin, Di, Qianwen Zhang, Sun Xing, Di Yin, Huang Xiao, Xing Sun, Xiao Huang (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

GraphRAG-Bench benchmark for evaluating graph retrieval-augmented generation

[5] Optimizing open-domain question answering with graph-based retrieval augmented generation PDF

Cannot Refute

[7] Graphrag-bench: Challenging domain-specific reasoning for evaluating graph retrieval-augmented generation PDF

Cannot Refute

[37] Multihop-rag: Benchmarking retrieval-augmented generation for multi-hop queries PDF

Cannot Refute

[38] Graph retrieval-augmented generation: A survey PDF

Cannot Refute

[39] Neural-Symbolic Dual-Indexing Architectures for Scalable Retrieval-Augmented Generation PDF

Cannot Refute

[40] Weak-to-Strong GraphRAG: Aligning Weak Retrievers with Large Language Models for Graph-based Retrieval Augmented Generation PDF

Cannot Refute

[41] Medical graph RAG: evidence-based medical large language model via graph retrieval-augmented generation PDF

Cannot Refute

[42] G-retriever: Retrieval-augmented generation for textual graph understanding and question answering PDF

Cannot Refute

[43] Crud-rag: A comprehensive chinese benchmark for retrieval-augmented generation of large language models PDF

Cannot Refute

[44] Ragbench: Explainable benchmark for retrieval-augmented generation systems PDF

Cannot Refute

Contribution

Systematic investigation of when GraphRAG outperforms traditional RAG

[38] Graph retrieval-augmented generation: A survey PDF

Cannot Refute

[39] Neural-Symbolic Dual-Indexing Architectures for Scalable Retrieval-Augmented Generation PDF

Cannot Refute

[42] G-retriever: Retrieval-augmented generation for textual graph understanding and question answering PDF

Cannot Refute

[45] Lightrag: Simple and fast retrieval-augmented generation PDF

Cannot Refute

[46] Align-GRAG: Reasoning-Guided Dual Alignment for Graph Retrieval-Augmented Generation PDF

Cannot Refute

[47] From Local to Global: A Graph RAG Approach to Query-Focused Summarization PDF

Cannot Refute

[49] Knowledge graph retrieval-augmented generation for llm-based recommendation PDF

Cannot Refute

[50] Simple is Effective: The Roles of Graphs and Large Language Models in Knowledge-Graph-Based Retrieval-Augmented Generation PDF

Cannot Refute

[51] A survey of graph retrieval-augmented generation for customized large language models PDF

Cannot Refute

Contribution

Multi-stage evaluation framework for GraphRAG pipeline

[27] Evaluating retrieval quality in retrieval-augmented generation PDF

Cannot Refute

[28] Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation PDF

Cannot Refute

[29] Cofe-rag: A comprehensive full-chain evaluation framework for retrieval-augmented generation with enhanced data diversity PDF

Cannot Refute

[30] Are We on the Right Way for Assessing Document Retrieval-Augmented Generation? PDF

Cannot Refute

[31] Automating systematic literature reviews with retrieval-augmented generation: a comprehensive overview PDF

Cannot Refute

[33] A survey on retrieval-augmented text generation for large language models PDF

Cannot Refute

[34] Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity PDF

Cannot Refute

[35] Think-then-Act: A Dual-Angle Evaluated Retrieval-Augmented Generation PDF

Cannot Refute

[36] â¦ Performance Analysis of Locally Deployed Large Language Models Through a Retrieval-Augmented Generation Educational Assistant Application for â¦ PDF

Cannot Refute

When to use Graphs in RAG: A Comprehensive Analysis for Graph Retrieval-Augmented Generation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] In-depth Analysis of Graph-based RAG in a Unified Framework PDF

[7] Graphrag-bench: Challenging domain-specific reasoning for evaluating graph retrieval-augmented generation PDF

Contribution Analysis

GraphRAG-Bench benchmark for evaluating graph retrieval-augmented generation

[5] Optimizing open-domain question answering with graph-based retrieval augmented generation PDF

[7] Graphrag-bench: Challenging domain-specific reasoning for evaluating graph retrieval-augmented generation PDF

[37] Multihop-rag: Benchmarking retrieval-augmented generation for multi-hop queries PDF

[38] Graph retrieval-augmented generation: A survey PDF

[39] Neural-Symbolic Dual-Indexing Architectures for Scalable Retrieval-Augmented Generation PDF

[40] Weak-to-Strong GraphRAG: Aligning Weak Retrievers with Large Language Models for Graph-based Retrieval Augmented Generation PDF

[41] Medical graph RAG: evidence-based medical large language model via graph retrieval-augmented generation PDF

[42] G-retriever: Retrieval-augmented generation for textual graph understanding and question answering PDF

[43] Crud-rag: A comprehensive chinese benchmark for retrieval-augmented generation of large language models PDF

[44] Ragbench: Explainable benchmark for retrieval-augmented generation systems PDF

Systematic investigation of when GraphRAG outperforms traditional RAG

[38] Graph retrieval-augmented generation: A survey PDF

[39] Neural-Symbolic Dual-Indexing Architectures for Scalable Retrieval-Augmented Generation PDF

[42] G-retriever: Retrieval-augmented generation for textual graph understanding and question answering PDF

[45] Lightrag: Simple and fast retrieval-augmented generation PDF

[46] Align-GRAG: Reasoning-Guided Dual Alignment for Graph Retrieval-Augmented Generation PDF

[47] From Local to Global: A Graph RAG Approach to Query-Focused Summarization PDF

[49] Knowledge graph retrieval-augmented generation for llm-based recommendation PDF

[50] Simple is Effective: The Roles of Graphs and Large Language Models in Knowledge-Graph-Based Retrieval-Augmented Generation PDF

[51] A survey of graph retrieval-augmented generation for customized large language models PDF

Multi-stage evaluation framework for GraphRAG pipeline

[27] Evaluating retrieval quality in retrieval-augmented generation PDF

[28] Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation PDF

[29] Cofe-rag: A comprehensive full-chain evaluation framework for retrieval-augmented generation with enhanced data diversity PDF

[30] Are We on the Right Way for Assessing Document Retrieval-Augmented Generation? PDF

[31] Automating systematic literature reviews with retrieval-augmented generation: a comprehensive overview PDF

[33] A survey on retrieval-augmented text generation for large language models PDF

[34] Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity PDF

[35] Think-then-Act: A Dual-Angle Evaluated Retrieval-Augmented Generation PDF

[36] â¦ Performance Analysis of Locally Deployed Large Language Models Through a Retrieval-Augmented Generation Educational Assistant Application for â¦ PDF

Table of Contents

[36] â¦ Performance Analysis of Locally Deployed Large Language Models Through a Retrieval-Augmented Generation Educational Assistant Application for â¦ PDF