Is Your Paper Being Reviewed by an LLM? Benchmarking AI Text Detection in Peer Review
Overview
Overall Novelty Assessment
The paper contributes a large-scale dataset of 788,984 AI-written peer reviews paired with human reviews from ICLR and NeurIPS, alongside benchmark evaluations of 18 detection algorithms and a novel context-aware method called Anchor. It resides in the Peer Review Process Detection leaf, which contains eight papers—a moderately populated cluster within the broader Application Contexts branch. This leaf specifically targets AI-generated content in reviewer feedback, distinguishing it from manuscript detection or educational plagiarism contexts. The concentration of sibling papers suggests active research interest in safeguarding the peer review process against AI misuse.
The taxonomy reveals neighboring leaves addressing Manuscript and Publication Detection (five papers) and Educational and Plagiarism Detection (five papers), indicating that peer review detection is part of a larger ecosystem examining AI text across academic workflows. The Detection Methods and Tools branch, with four sub-leaves totaling 18 papers, provides the algorithmic foundation that application-focused studies like this one build upon. The scope_note for Peer Review Process Detection explicitly excludes manuscript content and general publishing workflows, positioning this work at the intersection of detection methodology and a specific high-stakes academic context where reviewer integrity is paramount.
Among 24 candidates examined, the dataset contribution encountered one refutable candidate out of 10 examined, while the benchmark evaluation similarly found one overlapping work among 10 candidates. The context-aware Anchor method, examined against four candidates, showed no clear prior refutation. These statistics suggest that while the dataset and benchmark contributions face some precedent within the limited search scope, the context-aware detection approach appears more distinctive. The analysis does not claim exhaustive coverage; rather, it reflects top-K semantic matches and citation expansion, leaving open the possibility of additional relevant work beyond this sample.
Given the limited search scope of 24 candidates, the work appears to occupy a moderately explored niche within peer review integrity. The dataset scale and multi-conference coverage may differentiate it from prior efforts, though the benchmark evaluation aligns with established practices in the Detection Methods branch. The Anchor method's manuscript-aware design represents a plausible incremental advance, contingent on how substantially it diverges from existing context-sensitive approaches not captured in this search.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce the largest dataset to date for studying AI text detection in peer review, containing 788,984 reviews covering 8 years of submissions to ICLR and NeurIPS conferences. The dataset pairs human-written reviews with AI-generated reviews from five state-of-the-art LLMs for the same papers.
The authors conduct the first comprehensive benchmark of 18 existing AI text detection methods on the task of identifying LLM-generated peer reviews at the individual review level, revealing that most methods struggle to achieve robust detection while maintaining low false positive rates.
The authors introduce Anchor, a novel detection approach specifically designed for peer review that leverages the manuscript being reviewed as additional context. The method compares semantic similarity between a test review and synthetically generated anchor reviews for the same paper, achieving superior performance under strict false positive rate constraints.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[5] Monitoring ai-modified content at scale: A case study on the impact of chatgpt on ai conference peer reviews PDF
[7] MixRevDetect: Towards Detecting AI-Generated Content in Hybrid Peer Reviews. PDF
[12] Personal experience with AI-generated peer reviews: a case study PDF
[15] Is Your Paper Being Reviewed by an LLM? Investigating AI Text Detectability in Peer Review PDF
[18] Detecting LLM-generated peer reviews. PDF
[28] ReviewGuard: Enhancing Deficient Peer Review Detection via LLM-Driven Data Augmentation PDF
[30] âQuis custodiet ipsos custodes?â Who will watch the watchmen? On Detecting AI-generated peer-reviews PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Large-scale dataset of AI-written and human-written peer reviews
The authors introduce the largest dataset to date for studying AI text detection in peer review, containing 788,984 reviews covering 8 years of submissions to ICLR and NeurIPS conferences. The dataset pairs human-written reviews with AI-generated reviews from five state-of-the-art LLMs for the same papers.
[55] Gen-Review: A Large-scale Dataset of AI-Generated (and Human-written) Peer Reviews PDF
[30] âQuis custodiet ipsos custodes?â Who will watch the watchmen? On Detecting AI-generated peer-reviews PDF
[56] Is This Abstract Generated by AI? A Research for the Gap between AI-generated Scientific Text and Human-written Scientific Text PDF
[57] Detection of AI-generated essays in writing assessments PDF
[58] Gen-Review: A Dataset and Large-scale Study of AI-Generated and Human-Authored Peer Reviews PDF
[59] How Artificial Intelligence Differs From Humans in Peer Review PDF
[60] Evaluating science: A comparison of human and AI reviewers PDF
[61] Revolutionizing Peer Review: A Comparative Analysis of ChatGPT and Human Review Reports in Scientific Publishing PDF
[62] Performance of artificial intelligence content detectors using human and artificial intelligence-generated scientific writing PDF
[63] Failure to apply standard limit-of-detection or limit-of-quantitation criteria to specialized pro-resolving mediator analysis incorrectly characterizes their presence in ⦠PDF
Benchmark evaluation of AI text detection methods for peer review
The authors conduct the first comprehensive benchmark of 18 existing AI text detection methods on the task of identifying LLM-generated peer reviews at the individual review level, revealing that most methods struggle to achieve robust detection while maintaining low false positive rates.
[15] Is Your Paper Being Reviewed by an LLM? Investigating AI Text Detectability in Peer Review PDF
[1] Testing of detection tools for AI-generated text PDF
[3] AI tool detects LLM-generated text in research papers and peer reviews. PDF
[5] Monitoring ai-modified content at scale: A case study on the impact of chatgpt on ai conference peer reviews PDF
[7] MixRevDetect: Towards Detecting AI-Generated Content in Hybrid Peer Reviews. PDF
[14] An empirical study of AI generated text detection tools PDF
[64] Evaluating the efficacy of AI content detection tools in differentiating between human and AI-generated text PDF
[65] A survey on detection of llms-generated content PDF
[66] Accuracy and Reliability of AI-Generated Text Detection Tools: A Literature Review PDF
[67] Artificial writing and automated detection PDF
Context-aware detection method leveraging manuscript content
The authors introduce Anchor, a novel detection approach specifically designed for peer review that leverages the manuscript being reviewed as additional context. The method compares semantic similarity between a test review and synthetically generated anchor reviews for the same paper, achieving superior performance under strict false positive rate constraints.