Search Arena: Analyzing Search-Augmented LLMs

ICLR 2026 Conference SubmissionAnonymous Authors
Large Language ModelsWeb SearchHuman-AI Interaction
Abstract:

Search-augmented language models combine web search with Large Language Models (LLMs) to improve response groundedness and freshness. However, analyzing these systems remains challenging: existing datasets are limited in scale and narrow in scope, often constrained to static, single-turn, fact-checking questions. In this work, we introduce \textbf{Search Arena}, a crowd-sourced, large-scale, human-preference dataset of over 24,000 paired multi-turn user interactions with search-augmented LLMs. The dataset spans diverse intents and languages, and contains full system traces with around 12,000 human preference votes. Our analysis reveals that user preferences are influenced by the number of citations and types of cited sources, even when the cited content does not directly support the associated claims, uncovering a gap between perceived and actual credibility. To assess cross-setting performance, we conduct cross-arena analyses by testing search-augmented LLMs in a general purpose chat environment and conventional LLMs in search-heavy settings. We find that web search does not degrade and may even improve performance in non-search settings; however, the quality in search settings is significantly affected if solely relying on the model's parametric knowledge. We open-sourced the dataset to support future research.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Search Arena, a large-scale crowd-sourced dataset of over 24,000 paired multi-turn interactions with search-augmented LLMs, accompanied by around 12,000 human preference votes. It resides in the Human Preference Collection and Analysis leaf, which contains only three papers total. This leaf sits within the broader Evaluation and Benchmarking of Search-Augmented LLMs branch, indicating a moderately sparse research direction focused specifically on collecting and analyzing human judgments rather than automated evaluation or reward modeling.

The taxonomy reveals that neighboring leaves address complementary evaluation concerns: Automated Evaluation with LLMs uses models as judges, while Reward Modeling and Benchmarking develops preference-based optimization frameworks. The broader Evaluation and Benchmarking branch is distinct from Preference Alignment Methods, which focuses on training-time or decoding-time optimization rather than measurement. Search Arena's emphasis on multi-turn, diverse-intent interactions differentiates it from single-turn factual QA systems in the Search-Augmented QA branch, though both share an interest in grounding and freshness.

Among 30 candidates examined, none clearly refute the three core contributions. The Search Arena dataset contribution examined 10 candidates with zero refutable overlaps, suggesting limited prior work on large-scale, multi-turn preference datasets for search-augmented LLMs. Similarly, the analysis of human preferences and cross-arena evaluation each examined 10 candidates without finding substantial prior overlap. This limited search scope indicates that within the top-30 semantic matches, no existing work appears to provide the same combination of scale, multi-turn structure, and cross-setting analysis.

Based on the limited literature search, the work appears to occupy a relatively underexplored niche within human preference collection for search-augmented systems. The taxonomy structure confirms that this leaf is sparse compared to more crowded areas like training-time alignment or domain-specific applications. However, the analysis covers only top-30 semantic candidates and does not exhaustively survey all related preference datasets or evaluation frameworks, leaving open the possibility of relevant work outside this scope.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: analyzing search-augmented large language models with human preferences. The field has coalesced around several major branches that reflect both technical and application-oriented concerns. Preference Alignment Methods for LLMs explore how to steer model behavior using human feedback, often through techniques like pairwise comparisons (Pairwise Preference Alignment[2]) or dual optimization strategies (Dual Preference Alignment[6]). Evaluation and Benchmarking of Search-Augmented LLMs focuses on measuring factuality, freshness, and user satisfaction, with works such as FreshLLMs[1] and Long-form Factuality[4] establishing rigorous testbeds. Search-Augmented QA and Information Retrieval Systems examine how retrieval components integrate with generation, exemplified by WebGLM[9] and MindSearch[16], while Domain-Specific Applications adapt these architectures to specialized contexts like emergency coding (Emergency ICD Coding[20]) or manufacturing knowledge sharing (Manufacturing Knowledge Sharing[21]). Retrieval and Generation Enhancement Techniques investigate indexing, query rewriting, and grounding strategies, and Comparative Studies and Methodological Evaluations provide cross-cutting analyses of trade-offs and best practices. A particularly active line of work centers on collecting and leveraging human preference signals to guide both retrieval and generation. Search Arena[0] sits squarely within the Human Preference Collection and Analysis cluster, emphasizing large-scale preference elicitation to understand what users value in search-augmented responses. This contrasts with neighboring efforts like Predict Searcher Preferences[11], which focuses on modeling user intent from behavioral cues, and Review Systems Evaluation[24], which assesses how well existing platforms capture nuanced quality judgments. Meanwhile, alignment-driven approaches such as Alignment Reward-Guided Search[12] and Agentic Reward Modeling[13] use preference data to optimize retrieval policies directly, highlighting an ongoing tension between offline preference collection and online policy refinement. Across these branches, open questions persist around scalability of human annotation, generalization of learned preferences, and the interplay between retrieval accuracy and generation fluency.

Claimed Contributions

Search Arena dataset

The authors release the first large-scale human-preference dataset containing 24,000 conversations with search-augmented LLMs, including 12,000 preference votes, system metadata, user intents, and prompt topics. The dataset spans diverse intents, over 70 languages, and multi-turn interactions, addressing limitations of prior static, single-turn, fact-checking benchmarks.

10 retrieved papers
Analysis of human preferences for search-augmented LLMs

The authors conduct the first systematic analysis of how response characteristics such as reasoning, citation count, citation sources, and citation attribution interact with user preferences in search-augmented settings. They reveal that users are influenced by citation presence even when citations do not support claims, uncovering a gap between perceived and actual credibility.

10 retrieved papers
Cross-arena evaluation of search and non-search models

The authors perform the first cross-setting evaluation by deploying search-augmented and non-search models in both search-intensive and general-purpose chat environments. They find that web search augmentation does not degrade and may improve performance in non-search settings, while relying solely on parametric knowledge significantly hurts performance in search-intensive settings.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Search Arena dataset

The authors release the first large-scale human-preference dataset containing 24,000 conversations with search-augmented LLMs, including 12,000 preference votes, system metadata, user intents, and prompt topics. The dataset spans diverse intents, over 70 languages, and multi-turn interactions, addressing limitations of prior static, single-turn, fact-checking benchmarks.

Contribution

Analysis of human preferences for search-augmented LLMs

The authors conduct the first systematic analysis of how response characteristics such as reasoning, citation count, citation sources, and citation attribution interact with user preferences in search-augmented settings. They reveal that users are influenced by citation presence even when citations do not support claims, uncovering a gap between perceived and actual credibility.

Contribution

Cross-arena evaluation of search and non-search models

The authors perform the first cross-setting evaluation by deploying search-augmented and non-search models in both search-intensive and general-purpose chat environments. They find that web search augmentation does not degrade and may improve performance in non-search settings, while relying solely on parametric knowledge significantly hurts performance in search-intensive settings.

Search Arena: Analyzing Search-Augmented LLMs | Novelty Validation