Is Your Paper Being Reviewed by an LLM? Benchmarking AI Text Detection in Peer Review

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Machine Learning EvaluationBenchmark DatasetsRobustness in NLPLarge Language Models (LLMs)Generative AIHuman–AI AlignmentEthical Considerations in ML

Peer review is a critical process for ensuring the integrity of published scientific research. Confidence in this process is predicated on the assumption that experts in the relevant domain give careful consideration to the merits of manuscripts which are submitted for publication. With the recent rapid advancements in large language models (LLMs), a new risk to the peer review process is that negligent reviewers will rely on LLMs to perform the often time consuming process of reviewing a paper. However, there is a lack of existing resources for benchmarking the detectability of AI text in the domain of peer review. To address this deficiency, we introduce a comprehensive dataset containing a total of 788,984 AI-written peer reviews paired with corresponding human reviews, covering 8 years of papers submitted to each of two leading AI research conferences (ICLR and NeurIPS). We use this new resource to evaluate the ability of 18 existing AI text detection algorithms to distinguish between peer reviews fully written by humans and different state-of-the-art LLMs. Additionally, we explore a context-aware detection method called Anchor, which leverages manuscript content to detect AI-generated reviews, and analyze the sensitivity of detection models to LLM-assisted editing of human-written text. Our work reveals the difficulty of identifying AI-generated text at the individual peer review level, highlighting the urgent need for new tools and methods to detect this unethical use of generative AI. To support future research and reproducibility, we will publicly release our dataset upon publication.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes a large-scale dataset of 788,984 AI-written peer reviews paired with human reviews from ICLR and NeurIPS, alongside benchmark evaluations of 18 detection algorithms and a novel context-aware method called Anchor. It resides in the Peer Review Process Detection leaf, which contains eight papers—a moderately populated cluster within the broader Application Contexts branch. This leaf specifically targets AI-generated content in reviewer feedback, distinguishing it from manuscript detection or educational plagiarism contexts. The concentration of sibling papers suggests active research interest in safeguarding the peer review process against AI misuse.

The taxonomy reveals neighboring leaves addressing Manuscript and Publication Detection (five papers) and Educational and Plagiarism Detection (five papers), indicating that peer review detection is part of a larger ecosystem examining AI text across academic workflows. The Detection Methods and Tools branch, with four sub-leaves totaling 18 papers, provides the algorithmic foundation that application-focused studies like this one build upon. The scope_note for Peer Review Process Detection explicitly excludes manuscript content and general publishing workflows, positioning this work at the intersection of detection methodology and a specific high-stakes academic context where reviewer integrity is paramount.

Among 24 candidates examined, the dataset contribution encountered one refutable candidate out of 10 examined, while the benchmark evaluation similarly found one overlapping work among 10 candidates. The context-aware Anchor method, examined against four candidates, showed no clear prior refutation. These statistics suggest that while the dataset and benchmark contributions face some precedent within the limited search scope, the context-aware detection approach appears more distinctive. The analysis does not claim exhaustive coverage; rather, it reflects top-K semantic matches and citation expansion, leaving open the possibility of additional relevant work beyond this sample.

Given the limited search scope of 24 candidates, the work appears to occupy a moderately explored niche within peer review integrity. The dataset scale and multi-conference coverage may differentiate it from prior efforts, though the benchmark evaluation aligns with established practices in the Detection Methods branch. The Anchor method's manuscript-aware design represents a plausible incremental advance, contingent on how substantially it diverges from existing context-sensitive approaches not captured in this search.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: AI text detection in peer review. The field has coalesced around three main branches that reflect both technical and societal dimensions. Detection Methods and Tools encompasses the development and evaluation of algorithms—ranging from statistical classifiers to deep learning approaches—that aim to distinguish human-written from machine-generated text, with works like Testing of detection tools[1] and Detection of GPT-4 generated[2] examining performance across diverse datasets. Application Contexts and Use Cases focuses on deploying these detectors in real-world scenarios, particularly within academic publishing and peer review workflows, where studies such as Monitoring ai-modified content at[5] and MixRevDetect[7] explore how detection systems can be integrated into editorial pipelines. Integrity and Ethics addresses the broader implications of AI-generated content for scholarly communication, including questions of authorship, trust, and the potential for misuse, as seen in discussions around tortured phrases[22] and global retractions[33]. Recent work has highlighted tensions between detection accuracy and fairness, with some studies noting that detectors may exhibit biases against non-native English speakers or struggle with mixed human-AI text. Within the Application Contexts branch, a dense cluster of papers examines peer review specifically, investigating both the prevalence of AI-generated reviews and the feasibility of automated screening. Is Your Paper Being[0] sits squarely in this cluster, focusing on detecting AI-generated content within the peer review process itself. It shares thematic ground with Detecting LLM-generated peer reviews[18] and ReviewGuard[28], which similarly target review text, yet differs in emphasis from broader editorial monitoring efforts like Monitoring ai-modified content at[5]. Meanwhile, works such as Personal experience with AI-generated[12] and âQuis custodiet ipsos custodesâ[30] raise critical questions about who oversees the detectors and how their deployment might reshape academic gatekeeping, underscoring that technical solutions alone cannot resolve the evolving challenges of AI in scholarly communication.

Claimed Contributions

Large-scale dataset of AI-written and human-written peer reviews

Can Refute

10 retrieved papers

The authors introduce the largest dataset to date for studying AI text detection in peer review, containing 788,984 reviews covering 8 years of submissions to ICLR and NeurIPS conferences. The dataset pairs human-written reviews with AI-generated reviews from five state-of-the-art LLMs for the same papers.

10 retrieved papers

Can Refute

Benchmark evaluation of AI text detection methods for peer review

Can Refute

10 retrieved papers

The authors conduct the first comprehensive benchmark of 18 existing AI text detection methods on the task of identifying LLM-generated peer reviews at the individual review level, revealing that most methods struggle to achieve robust detection while maintaining low false positive rates.

10 retrieved papers

Can Refute

Context-aware detection method leveraging manuscript content

4 retrieved papers

The authors introduce Anchor, a novel detection approach specifically designed for peer review that leverages the manuscript being reviewed as additional context. The method compares semantic similarity between a test review and synthetically generated anchor reviews for the same paper, achieving superior performance under strict false positive rate constraints.

4 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] Monitoring ai-modified content at scale: A case study on the impact of chatgpt on ai conference peer reviews PDF

Liang, Weixin, Izzo, Zachary, Weixin Liang, Zhang Yaohui, Zachary Izzo, Lepp, Haley, Yaohui Zhang, Cao, Hancheng, Haley Lepp, Zhao Xuan-dong, Hancheng Cao, Chen, Lingjiao, Xuandong Zhao, Ye, Haotian, Lingjiao Chen, Liu Sheng, Haotian Ye, Huang Zhi, Sheng Liu, McFarland, Daniel A., Zhi Huang, Zou, James Y., Daniel A. McFarland, James Y. Zou (2024)

[7] MixRevDetect: Towards Detecting AI-Generated Content in Hybrid Peer Reviews. PDF

Sandeep Kumar, Samarth Garg, Sagnik Sengupta, Tirthankar Ghosal, Asif Ekbal (2025)

[12] Personal experience with AI-generated peer reviews: a case study PDF

Nicholas Lo Vecchio (2025) • Research Integrity and Peer Review

[15] Is Your Paper Being Reviewed by an LLM? Investigating AI Text Detectability in Peer Review PDF

Yu, Sungduk, Luo Man, Sungduk Yu, Madasu, Avinash, Man Luo, Lal, Vasudev, Avinash Madasu, Howard, Phillip, Vasudev Lal, Phillip Howard (2024)

[18] Detecting LLM-generated peer reviews. PDF

Kumar, Aounon, Lakkaraju, Himabindu, Shah, Nihar B. (2025) • PloS one

[28] ReviewGuard: Enhancing Deficient Peer Review Detection via LLM-Driven Data Augmentation PDF

Zhang Hao-xuan, Haoxuan Zhang, Ruochi Li, Sarthak Shrestha, Shree Harshini Mamidala, Revanth Putta, Xiao-Ting, Arka Krishan Aggarwal, Ding Junhua, Ting Xiao, Chen Haihua, Junhua Ding, Haihua Chen (2025) • arXiv.org

[30] âQuis custodiet ipsos custodes?â Who will watch the watchmen? On Detecting AI-generated peer-reviews PDF

Sandeep Kumar, Mohit Sahu, M.P. SAHU, Vardhan Gacche, Tirthankar Ghosal, Asif Ekbal (2024) • Conference on Empirical Methods in Natural Language Processing

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Large-scale dataset of AI-written and human-written peer reviews

[55] Gen-Review: A Large-scale Dataset of AI-Generated (and Human-written) Peer Reviews PDF

Can Refute

[30] âQuis custodiet ipsos custodes?â Who will watch the watchmen? On Detecting AI-generated peer-reviews PDF

Cannot Refute

[56] Is This Abstract Generated by AI? A Research for the Gap between AI-generated Scientific Text and Human-written Scientific Text PDF

Cannot Refute

[57] Detection of AI-generated essays in writing assessments PDF

Cannot Refute

[58] Gen-Review: A Dataset and Large-scale Study of AI-Generated and Human-Authored Peer Reviews PDF

Cannot Refute

[59] How Artificial Intelligence Differs From Humans in Peer Review PDF

Cannot Refute

[60] Evaluating science: A comparison of human and AI reviewers PDF

Cannot Refute

[61] Revolutionizing Peer Review: A Comparative Analysis of ChatGPT and Human Review Reports in Scientific Publishing PDF

Cannot Refute

[62] Performance of artificial intelligence content detectors using human and artificial intelligence-generated scientific writing PDF

Cannot Refute

[63] Failure to apply standard limit-of-detection or limit-of-quantitation criteria to specialized pro-resolving mediator analysis incorrectly characterizes their presence in â¦ PDF

Cannot Refute

Contribution

Benchmark evaluation of AI text detection methods for peer review

[15] Is Your Paper Being Reviewed by an LLM? Investigating AI Text Detectability in Peer Review PDF

Can Refute

[1] Testing of detection tools for AI-generated text PDF

Cannot Refute

[3] AI tool detects LLM-generated text in research papers and peer reviews. PDF

Cannot Refute

[5] Monitoring ai-modified content at scale: A case study on the impact of chatgpt on ai conference peer reviews PDF

Cannot Refute

[7] MixRevDetect: Towards Detecting AI-Generated Content in Hybrid Peer Reviews. PDF

Cannot Refute

[14] An empirical study of AI generated text detection tools PDF

Cannot Refute

[64] Evaluating the efficacy of AI content detection tools in differentiating between human and AI-generated text PDF

Cannot Refute

[65] A survey on detection of llms-generated content PDF

Cannot Refute

[66] Accuracy and Reliability of AI-Generated Text Detection Tools: A Literature Review PDF

Cannot Refute

[67] Artificial writing and automated detection PDF

Cannot Refute

Contribution

Context-aware detection method leveraging manuscript content

[51] Context-aware citation recommendation of scientific papers: comparative study, gaps and trends PDF

Cannot Refute

[52] Leveraging medical context to recommend semantically similar terms for chart reviews PDF

Cannot Refute

[53] Enhancing Technical Document Compliance Review through a Context-Aware Generative AI Framework PDF

Cannot Refute

[54] AI-Assisted Manuscript Review: Opportunities, Challenges, and Reviewer Support Framework PDF

Cannot Refute

Is Your Paper Being Reviewed by an LLM? Benchmarking AI Text Detection in Peer Review

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] Monitoring ai-modified content at scale: A case study on the impact of chatgpt on ai conference peer reviews PDF

[7] MixRevDetect: Towards Detecting AI-Generated Content in Hybrid Peer Reviews. PDF

[12] Personal experience with AI-generated peer reviews: a case study PDF

[15] Is Your Paper Being Reviewed by an LLM? Investigating AI Text Detectability in Peer Review PDF

[18] Detecting LLM-generated peer reviews. PDF

[28] ReviewGuard: Enhancing Deficient Peer Review Detection via LLM-Driven Data Augmentation PDF

[30] âQuis custodiet ipsos custodes?â Who will watch the watchmen? On Detecting AI-generated peer-reviews PDF

Contribution Analysis

Large-scale dataset of AI-written and human-written peer reviews

[55] Gen-Review: A Large-scale Dataset of AI-Generated (and Human-written) Peer Reviews PDF

[30] âQuis custodiet ipsos custodes?â Who will watch the watchmen? On Detecting AI-generated peer-reviews PDF

[56] Is This Abstract Generated by AI? A Research for the Gap between AI-generated Scientific Text and Human-written Scientific Text PDF

[57] Detection of AI-generated essays in writing assessments PDF

[58] Gen-Review: A Dataset and Large-scale Study of AI-Generated and Human-Authored Peer Reviews PDF

[59] How Artificial Intelligence Differs From Humans in Peer Review PDF

[60] Evaluating science: A comparison of human and AI reviewers PDF

[61] Revolutionizing Peer Review: A Comparative Analysis of ChatGPT and Human Review Reports in Scientific Publishing PDF

[62] Performance of artificial intelligence content detectors using human and artificial intelligence-generated scientific writing PDF

[63] Failure to apply standard limit-of-detection or limit-of-quantitation criteria to specialized pro-resolving mediator analysis incorrectly characterizes their presence in â¦ PDF

Benchmark evaluation of AI text detection methods for peer review

[15] Is Your Paper Being Reviewed by an LLM? Investigating AI Text Detectability in Peer Review PDF

[1] Testing of detection tools for AI-generated text PDF

[3] AI tool detects LLM-generated text in research papers and peer reviews. PDF

[5] Monitoring ai-modified content at scale: A case study on the impact of chatgpt on ai conference peer reviews PDF

[7] MixRevDetect: Towards Detecting AI-Generated Content in Hybrid Peer Reviews. PDF

[14] An empirical study of AI generated text detection tools PDF

[64] Evaluating the efficacy of AI content detection tools in differentiating between human and AI-generated text PDF

[65] A survey on detection of llms-generated content PDF

[66] Accuracy and Reliability of AI-Generated Text Detection Tools: A Literature Review PDF

[67] Artificial writing and automated detection PDF

Context-aware detection method leveraging manuscript content

[51] Context-aware citation recommendation of scientific papers: comparative study, gaps and trends PDF

[52] Leveraging medical context to recommend semantically similar terms for chart reviews PDF

[53] Enhancing Technical Document Compliance Review through a Context-Aware Generative AI Framework PDF

[54] AI-Assisted Manuscript Review: Opportunities, Challenges, and Reviewer Support Framework PDF

Table of Contents

[30] âQuis custodiet ipsos custodes?â Who will watch the watchmen? On Detecting AI-generated peer-reviews PDF

[30] âQuis custodiet ipsos custodes?â Who will watch the watchmen? On Detecting AI-generated peer-reviews PDF

[63] Failure to apply standard limit-of-detection or limit-of-quantitation criteria to specialized pro-resolving mediator analysis incorrectly characterizes their presence in â¦ PDF