GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-wild LLM Jailbreak Methods

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Large Language ModelsJailbreak AttacksEvaluation SystemBenchmark

Despite the growing interest in jailbreaks as an effective red-teaming tool for building safe and responsible large language models (LLMs), flawed evaluation system designs have led to significant discrepancies in their effectiveness assessments. With a systematic measurement study based on 37 jailbreak studies since 2022, we find that existing evaluation systems lack case-specific criteria, resulting in misleading conclusions about their effectiveness and safety implications. In this paper, we introduce GuidedBench, a novel benchmark comprising a curated harmful question dataset and GuidedEval, an evaluation system integrated with detailed case-by-case evaluation guidelines. Experiments demonstrate that GuidedBench offers more accurate evaluations of jailbreak performance, enabling meaningful comparisons across methods. GuidedEval reduces inter-evaluator variance by at least 76.03%, ensuring reliable and reproducible evaluations. We reveal why existing jailbreak benchmarks fail to evaluate accurately and suggest better evaluation practices.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces GuidedBench, a benchmark comprising a curated harmful question dataset and GuidedEval, a guideline-based evaluation system for assessing jailbreak attacks. It resides in the 'Jailbreak Evaluation Frameworks' leaf alongside four sibling papers: JailbreakEval, Rethinking Jailbreak Evaluation, AttackEval, and GuardVal. This leaf is part of the broader 'Evaluation Methodologies and Benchmarks' branch, which contains three leaves totaling eleven papers. The evaluation frameworks cluster is moderately populated, indicating active but not overcrowded research interest in standardizing jailbreak assessment protocols.

The taxonomy reveals that evaluation methodologies sit between attack techniques (twenty-nine papers across four leaves) and defense strategies (six papers across two leaves). GuidedBench's neighboring leaves include 'Benchmark Datasets and Taxonomies' (two papers) and 'General Adversarial Robustness Assessment' (five papers). The scope notes clarify that evaluation frameworks focus on measurement protocols and reliability, while benchmark datasets emphasize standardized testbeds. GuidedBench bridges these by combining dataset curation with evaluation guidelines, connecting to concerns about reproducibility raised in the general robustness assessment cluster.

Among thirty candidates examined, the curated harmful question dataset contribution shows two refutable candidates from ten examined, while the guideline-based evaluation system shows zero refutations from ten candidates. The systematic measurement study contribution also has two refutable candidates from ten examined. The limited search scope means these statistics reflect top-thirty semantic matches, not exhaustive coverage. The evaluation system component appears more novel within this constrained search, while the dataset and measurement study face more substantial prior work overlap among examined candidates.

Based on the limited thirty-candidate search, GuidedBench appears to occupy established evaluation territory with incremental refinements. The guideline-based evaluation system shows stronger novelty signals than the dataset or measurement components within examined candidates. However, the moderate density of the evaluation frameworks leaf and the presence of methodologically similar siblings suggest the work extends rather than redefines existing assessment paradigms. These observations are bounded by the search scope and may not capture all relevant prior work.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Evaluating large language model jailbreak attack effectiveness. The field has crystallized around three major branches that together capture the adversarial lifecycle of LLM safety. The first branch, Jailbreak Attack Methods and Techniques, encompasses diverse strategies ranging from automated prompt optimization approaches like GPTFuzzer[17] and AutoDAN[36] to multilingual exploits such as Multilingual Jailbreak[8] and cross-modal attacks including JailbreakV[22]. The second branch, Evaluation Methodologies and Benchmarks, addresses the critical need for standardized assessment frameworks, with works like JailbreakBench[20] and AttackEval[12] establishing systematic protocols for measuring attack success rates and model vulnerabilities. The third branch, Defense Strategies and Mitigation, explores countermeasures such as Goal Prioritization Defense[13] and Chain of Thought Defense[46], creating a dynamic interplay between offensive and defensive research directions. Recent work reveals tension between attack sophistication and evaluation rigor. While many studies develop increasingly subtle techniques—from indirect prompt injection methods like Indirect Prompt Injection[1] to multi-round conversational exploits such as Multi-round Jailbreak[26] and Crescendo Attack[49]—a growing cluster questions whether existing benchmarks adequately capture real-world threat models. GuidedBench[0] situates itself within the evaluation frameworks cluster alongside JailbreakEval[25] and Rethinking Jailbreak Evaluation[33], emphasizing the need for more nuanced assessment beyond binary success metrics. Where AttackEval[12] focuses on standardizing attack measurement and GuardVal[30] examines defense validation, GuidedBench[0] appears to bridge methodological gaps by providing guided evaluation protocols that account for contextual factors often overlooked in simpler benchmarks, addressing concerns raised by works like Rethinking Jailbreak Evaluation[33] about the ecological validity of current testing paradigms.

Claimed Contributions

GuidedBench benchmark with curated harmful question dataset

Can Refute

10 retrieved papers

The authors construct a benchmark containing 200 carefully curated harmful questions (180 core, 20 additional) organized into 20 topic categories. Questions are filtered to ensure they are refused by LLMs without jailbreaks, directly malicious, and structurally answerable, addressing defects in existing datasets.

10 retrieved papers

Can Refute

GuidedEval guideline-based evaluation system

10 retrieved papers

The authors develop an evaluation framework that provides case-specific guidelines identifying key entities and actions that successful jailbreak responses should contain. This shifts evaluation from subjective judgment to objective verification of guideline-defined scoring points, reducing evaluator variance by at least 76.03%.

10 retrieved papers

Systematic measurement study of jailbreak evaluation discrepancies

Can Refute

10 retrieved papers

The authors analyze 37 jailbreak papers to identify significant discrepancies in evaluation systems. They demonstrate that existing keyword-based and general LLM-as-a-judge approaches produce misleading assessments, and reveal why these benchmarks fail through controlled experiments on misjudged cases.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[12] Attackeval: How to evaluate the effectiveness of jailbreak attacking on large language models PDF

Dong Shu, Chong Zhang, Mingyu Jin, Zihao Zhou, Suiyuan Zhu, Lingyao Li, Beichen Wang, Yongfeng Zhang (2025)

[25] Jailbreakeval: An integrated toolkit for evaluating jailbreak attempts against large language models PDF

Ran De-long, Liu Jinyuan, Delong Ran, Gong Yi-chen, Jinyuan Liu, Zheng Jingyi, Yichen Gong, He, Xinlei, Jingyi Zheng, Cong, Tianshuo, Xinlei He, Wang Anyu, Tianshuo Cong, Anyu Wang (2024)

[30] GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing PDF

Zhang, Peiyan, Jin Hai-bo, Kang Liying, Wang, Haohan (2025) • arXiv.org

[33] Rethinking how to evaluate language model jailbreak PDF

Cai Hongyu, Hongyu Cai, Arjun Arunasalam, Leo Y. Lin, Bianchi Antonio, Antonio Bianchi, Celik, Z. Berkay, Z. Berkay Celik (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

GuidedBench benchmark with curated harmful question dataset

[60] Sorry-bench: Systematically evaluating large language model safety refusal PDF

Can Refute

[64] Do-not-answer: Evaluating safeguards in LLMs PDF

Can Refute

[22] Jailbreakv: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks PDF

Cannot Refute

[61] Safebench: A safety evaluation framework for multimodal large language models PDF

Cannot Refute

[62] Do-not-answer: A dataset for evaluating safeguards in llms PDF

Cannot Refute

[63] Trojllm: A black-box trojan prompt attack on large language models PDF

Cannot Refute

[65] Jigsaw Puzzles: Splitting Harmful Questions to Jailbreak Large Language Models PDF

Cannot Refute

[66] Realtoxicityprompts: Evaluating neural toxic degeneration in language models PDF

Cannot Refute

[67] BadCodePrompt: backdoor attacks against prompt engineering of large language models for code generation PDF

Cannot Refute

[68] Redcode: Risky code execution and generation benchmark for code agents PDF

Cannot Refute

Contribution

GuidedEval guideline-based evaluation system

[33] Rethinking how to evaluate language model jailbreak PDF

Cannot Refute

[51] A StrongREJECT for Empty Jailbreaks PDF

Cannot Refute

[52] CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs PDF

Cannot Refute

[53] Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges PDF

Cannot Refute

[54] Jades: A universal framework for jailbreak assessment via decompositional scoring PDF

Cannot Refute

[55] Understanding the Effectiveness of Coverage Criteria for Large Language Models: A Special Angle from Jailbreak Attacks PDF

Cannot Refute

[56] " Prompter Says": A Linguistic Approach to Understanding and Detecting Jailbreak Attacks Against Large-Language Models PDF

Cannot Refute

[57] GuidedBench: Equipping Jailbreak Evaluation with Guidelines PDF

Cannot Refute

[58] How Real is Your Jailbreak? Fine-grained Jailbreak Evaluation with Anchored Reference PDF

Cannot Refute

[59] SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models PDF

Cannot Refute

Contribution

Systematic measurement study of jailbreak evaluation discrepancies

[51] A StrongREJECT for Empty Jailbreaks PDF

Can Refute

[72] Comparison requires valid measurement: Rethinking attack success rate comparisons in AI red teaming PDF

Can Refute

[7] Jailbreak attacks and defenses against large language models: A survey PDF

Cannot Refute

[12] Attackeval: How to evaluate the effectiveness of jailbreak attacking on large language models PDF

Cannot Refute

[22] Jailbreakv: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks PDF

Cannot Refute

[44] Jailbreakradar: Comprehensive assessment of jailbreak attacks against llms PDF

Cannot Refute

[69] Characterizing and evaluating the reliability of llms against jailbreak attacks PDF

Cannot Refute

[70] PandaGuard: Systematic Evaluation of LLM Safety against Jailbreaking Attacks PDF

Cannot Refute

[71] Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study PDF

Cannot Refute

[73] The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense PDF

Cannot Refute

GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-wild LLM Jailbreak Methods

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[12] Attackeval: How to evaluate the effectiveness of jailbreak attacking on large language models PDF

[25] Jailbreakeval: An integrated toolkit for evaluating jailbreak attempts against large language models PDF

[30] GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing PDF

[33] Rethinking how to evaluate language model jailbreak PDF

Contribution Analysis

GuidedBench benchmark with curated harmful question dataset

[60] Sorry-bench: Systematically evaluating large language model safety refusal PDF

[64] Do-not-answer: Evaluating safeguards in LLMs PDF

[22] Jailbreakv: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks PDF

[61] Safebench: A safety evaluation framework for multimodal large language models PDF

[62] Do-not-answer: A dataset for evaluating safeguards in llms PDF

[63] Trojllm: A black-box trojan prompt attack on large language models PDF

[65] Jigsaw Puzzles: Splitting Harmful Questions to Jailbreak Large Language Models PDF

[66] Realtoxicityprompts: Evaluating neural toxic degeneration in language models PDF

[67] BadCodePrompt: backdoor attacks against prompt engineering of large language models for code generation PDF

[68] Redcode: Risky code execution and generation benchmark for code agents PDF

GuidedEval guideline-based evaluation system

[33] Rethinking how to evaluate language model jailbreak PDF

[51] A StrongREJECT for Empty Jailbreaks PDF

[52] CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs PDF

[53] Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges PDF

[54] Jades: A universal framework for jailbreak assessment via decompositional scoring PDF

[55] Understanding the Effectiveness of Coverage Criteria for Large Language Models: A Special Angle from Jailbreak Attacks PDF

[56] " Prompter Says": A Linguistic Approach to Understanding and Detecting Jailbreak Attacks Against Large-Language Models PDF

[57] GuidedBench: Equipping Jailbreak Evaluation with Guidelines PDF

[58] How Real is Your Jailbreak? Fine-grained Jailbreak Evaluation with Anchored Reference PDF

[59] SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models PDF

Systematic measurement study of jailbreak evaluation discrepancies

[51] A StrongREJECT for Empty Jailbreaks PDF

[72] Comparison requires valid measurement: Rethinking attack success rate comparisons in AI red teaming PDF

[7] Jailbreak attacks and defenses against large language models: A survey PDF

[12] Attackeval: How to evaluate the effectiveness of jailbreak attacking on large language models PDF

[22] Jailbreakv: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks PDF

[44] Jailbreakradar: Comprehensive assessment of jailbreak attacks against llms PDF

[69] Characterizing and evaluating the reliability of llms against jailbreak attacks PDF

[70] PandaGuard: Systematic Evaluation of LLM Safety against Jailbreaking Attacks PDF

[71] Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study PDF

[73] The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense PDF

Table of Contents