Is Your Paper Being Reviewed by an LLM? Benchmarking AI Text Detection in Peer Review

ICLR 2026 Conference SubmissionAnonymous Authors
Machine Learning EvaluationBenchmark DatasetsRobustness in NLPLarge Language Models (LLMs)Generative AIHuman–AI AlignmentEthical Considerations in ML
Abstract:

Peer review is a critical process for ensuring the integrity of published scientific research. Confidence in this process is predicated on the assumption that experts in the relevant domain give careful consideration to the merits of manuscripts which are submitted for publication. With the recent rapid advancements in large language models (LLMs), a new risk to the peer review process is that negligent reviewers will rely on LLMs to perform the often time consuming process of reviewing a paper. However, there is a lack of existing resources for benchmarking the detectability of AI text in the domain of peer review. To address this deficiency, we introduce a comprehensive dataset containing a total of 788,984 AI-written peer reviews paired with corresponding human reviews, covering 8 years of papers submitted to each of two leading AI research conferences (ICLR and NeurIPS). We use this new resource to evaluate the ability of 18 existing AI text detection algorithms to distinguish between peer reviews fully written by humans and different state-of-the-art LLMs. Additionally, we explore a context-aware detection method called Anchor, which leverages manuscript content to detect AI-generated reviews, and analyze the sensitivity of detection models to LLM-assisted editing of human-written text. Our work reveals the difficulty of identifying AI-generated text at the individual peer review level, highlighting the urgent need for new tools and methods to detect this unethical use of generative AI. To support future research and reproducibility, we will publicly release our dataset upon publication.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes a large-scale dataset of 788,984 AI-written peer reviews paired with human reviews from ICLR and NeurIPS, alongside benchmark evaluations of 18 detection algorithms and a novel context-aware method called Anchor. It resides in the Peer Review Process Detection leaf, which contains eight papers—a moderately populated cluster within the broader Application Contexts branch. This leaf specifically targets AI-generated content in reviewer feedback, distinguishing it from manuscript detection or educational plagiarism contexts. The concentration of sibling papers suggests active research interest in safeguarding the peer review process against AI misuse.

The taxonomy reveals neighboring leaves addressing Manuscript and Publication Detection (five papers) and Educational and Plagiarism Detection (five papers), indicating that peer review detection is part of a larger ecosystem examining AI text across academic workflows. The Detection Methods and Tools branch, with four sub-leaves totaling 18 papers, provides the algorithmic foundation that application-focused studies like this one build upon. The scope_note for Peer Review Process Detection explicitly excludes manuscript content and general publishing workflows, positioning this work at the intersection of detection methodology and a specific high-stakes academic context where reviewer integrity is paramount.

Among 24 candidates examined, the dataset contribution encountered one refutable candidate out of 10 examined, while the benchmark evaluation similarly found one overlapping work among 10 candidates. The context-aware Anchor method, examined against four candidates, showed no clear prior refutation. These statistics suggest that while the dataset and benchmark contributions face some precedent within the limited search scope, the context-aware detection approach appears more distinctive. The analysis does not claim exhaustive coverage; rather, it reflects top-K semantic matches and citation expansion, leaving open the possibility of additional relevant work beyond this sample.

Given the limited search scope of 24 candidates, the work appears to occupy a moderately explored niche within peer review integrity. The dataset scale and multi-conference coverage may differentiate it from prior efforts, though the benchmark evaluation aligns with established practices in the Detection Methods branch. The Anchor method's manuscript-aware design represents a plausible incremental advance, contingent on how substantially it diverges from existing context-sensitive approaches not captured in this search.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
24
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: AI text detection in peer review. The field has coalesced around three main branches that reflect both technical and societal dimensions. Detection Methods and Tools encompasses the development and evaluation of algorithms—ranging from statistical classifiers to deep learning approaches—that aim to distinguish human-written from machine-generated text, with works like Testing of detection tools[1] and Detection of GPT-4 generated[2] examining performance across diverse datasets. Application Contexts and Use Cases focuses on deploying these detectors in real-world scenarios, particularly within academic publishing and peer review workflows, where studies such as Monitoring ai-modified content at[5] and MixRevDetect[7] explore how detection systems can be integrated into editorial pipelines. Integrity and Ethics addresses the broader implications of AI-generated content for scholarly communication, including questions of authorship, trust, and the potential for misuse, as seen in discussions around tortured phrases[22] and global retractions[33]. Recent work has highlighted tensions between detection accuracy and fairness, with some studies noting that detectors may exhibit biases against non-native English speakers or struggle with mixed human-AI text. Within the Application Contexts branch, a dense cluster of papers examines peer review specifically, investigating both the prevalence of AI-generated reviews and the feasibility of automated screening. Is Your Paper Being[0] sits squarely in this cluster, focusing on detecting AI-generated content within the peer review process itself. It shares thematic ground with Detecting LLM-generated peer reviews[18] and ReviewGuard[28], which similarly target review text, yet differs in emphasis from broader editorial monitoring efforts like Monitoring ai-modified content at[5]. Meanwhile, works such as Personal experience with AI-generated[12] and âQuis custodiet ipsos custodesâ[30] raise critical questions about who oversees the detectors and how their deployment might reshape academic gatekeeping, underscoring that technical solutions alone cannot resolve the evolving challenges of AI in scholarly communication.

Claimed Contributions

Large-scale dataset of AI-written and human-written peer reviews

The authors introduce the largest dataset to date for studying AI text detection in peer review, containing 788,984 reviews covering 8 years of submissions to ICLR and NeurIPS conferences. The dataset pairs human-written reviews with AI-generated reviews from five state-of-the-art LLMs for the same papers.

10 retrieved papers
Can Refute
Benchmark evaluation of AI text detection methods for peer review

The authors conduct the first comprehensive benchmark of 18 existing AI text detection methods on the task of identifying LLM-generated peer reviews at the individual review level, revealing that most methods struggle to achieve robust detection while maintaining low false positive rates.

10 retrieved papers
Can Refute
Context-aware detection method leveraging manuscript content

The authors introduce Anchor, a novel detection approach specifically designed for peer review that leverages the manuscript being reviewed as additional context. The method compares semantic similarity between a test review and synthetically generated anchor reviews for the same paper, achieving superior performance under strict false positive rate constraints.

4 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Large-scale dataset of AI-written and human-written peer reviews

The authors introduce the largest dataset to date for studying AI text detection in peer review, containing 788,984 reviews covering 8 years of submissions to ICLR and NeurIPS conferences. The dataset pairs human-written reviews with AI-generated reviews from five state-of-the-art LLMs for the same papers.

Contribution

Benchmark evaluation of AI text detection methods for peer review

The authors conduct the first comprehensive benchmark of 18 existing AI text detection methods on the task of identifying LLM-generated peer reviews at the individual review level, revealing that most methods struggle to achieve robust detection while maintaining low false positive rates.

Contribution

Context-aware detection method leveraging manuscript content

The authors introduce Anchor, a novel detection approach specifically designed for peer review that leverages the manuscript being reviewed as additional context. The method compares semantic similarity between a test review and synthetically generated anchor reviews for the same paper, achieving superior performance under strict false positive rate constraints.