GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-wild LLM Jailbreak Methods

ICLR 2026 Conference SubmissionAnonymous Authors
Large Language ModelsJailbreak AttacksEvaluation SystemBenchmark
Abstract:

Despite the growing interest in jailbreaks as an effective red-teaming tool for building safe and responsible large language models (LLMs), flawed evaluation system designs have led to significant discrepancies in their effectiveness assessments. With a systematic measurement study based on 37 jailbreak studies since 2022, we find that existing evaluation systems lack case-specific criteria, resulting in misleading conclusions about their effectiveness and safety implications. In this paper, we introduce GuidedBench, a novel benchmark comprising a curated harmful question dataset and GuidedEval, an evaluation system integrated with detailed case-by-case evaluation guidelines. Experiments demonstrate that GuidedBench offers more accurate evaluations of jailbreak performance, enabling meaningful comparisons across methods. GuidedEval reduces inter-evaluator variance by at least 76.03%, ensuring reliable and reproducible evaluations. We reveal why existing jailbreak benchmarks fail to evaluate accurately and suggest better evaluation practices.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces GuidedBench, a benchmark comprising a curated harmful question dataset and GuidedEval, a guideline-based evaluation system for assessing jailbreak attacks. It resides in the 'Jailbreak Evaluation Frameworks' leaf alongside four sibling papers: JailbreakEval, Rethinking Jailbreak Evaluation, AttackEval, and GuardVal. This leaf is part of the broader 'Evaluation Methodologies and Benchmarks' branch, which contains three leaves totaling eleven papers. The evaluation frameworks cluster is moderately populated, indicating active but not overcrowded research interest in standardizing jailbreak assessment protocols.

The taxonomy reveals that evaluation methodologies sit between attack techniques (twenty-nine papers across four leaves) and defense strategies (six papers across two leaves). GuidedBench's neighboring leaves include 'Benchmark Datasets and Taxonomies' (two papers) and 'General Adversarial Robustness Assessment' (five papers). The scope notes clarify that evaluation frameworks focus on measurement protocols and reliability, while benchmark datasets emphasize standardized testbeds. GuidedBench bridges these by combining dataset curation with evaluation guidelines, connecting to concerns about reproducibility raised in the general robustness assessment cluster.

Among thirty candidates examined, the curated harmful question dataset contribution shows two refutable candidates from ten examined, while the guideline-based evaluation system shows zero refutations from ten candidates. The systematic measurement study contribution also has two refutable candidates from ten examined. The limited search scope means these statistics reflect top-thirty semantic matches, not exhaustive coverage. The evaluation system component appears more novel within this constrained search, while the dataset and measurement study face more substantial prior work overlap among examined candidates.

Based on the limited thirty-candidate search, GuidedBench appears to occupy established evaluation territory with incremental refinements. The guideline-based evaluation system shows stronger novelty signals than the dataset or measurement components within examined candidates. However, the moderate density of the evaluation frameworks leaf and the presence of methodologically similar siblings suggest the work extends rather than redefines existing assessment paradigms. These observations are bounded by the search scope and may not capture all relevant prior work.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
4
Refutable Paper

Research Landscape Overview

Core task: Evaluating large language model jailbreak attack effectiveness. The field has crystallized around three major branches that together capture the adversarial lifecycle of LLM safety. The first branch, Jailbreak Attack Methods and Techniques, encompasses diverse strategies ranging from automated prompt optimization approaches like GPTFuzzer[17] and AutoDAN[36] to multilingual exploits such as Multilingual Jailbreak[8] and cross-modal attacks including JailbreakV[22]. The second branch, Evaluation Methodologies and Benchmarks, addresses the critical need for standardized assessment frameworks, with works like JailbreakBench[20] and AttackEval[12] establishing systematic protocols for measuring attack success rates and model vulnerabilities. The third branch, Defense Strategies and Mitigation, explores countermeasures such as Goal Prioritization Defense[13] and Chain of Thought Defense[46], creating a dynamic interplay between offensive and defensive research directions. Recent work reveals tension between attack sophistication and evaluation rigor. While many studies develop increasingly subtle techniques—from indirect prompt injection methods like Indirect Prompt Injection[1] to multi-round conversational exploits such as Multi-round Jailbreak[26] and Crescendo Attack[49]—a growing cluster questions whether existing benchmarks adequately capture real-world threat models. GuidedBench[0] situates itself within the evaluation frameworks cluster alongside JailbreakEval[25] and Rethinking Jailbreak Evaluation[33], emphasizing the need for more nuanced assessment beyond binary success metrics. Where AttackEval[12] focuses on standardizing attack measurement and GuardVal[30] examines defense validation, GuidedBench[0] appears to bridge methodological gaps by providing guided evaluation protocols that account for contextual factors often overlooked in simpler benchmarks, addressing concerns raised by works like Rethinking Jailbreak Evaluation[33] about the ecological validity of current testing paradigms.

Claimed Contributions

GuidedBench benchmark with curated harmful question dataset

The authors construct a benchmark containing 200 carefully curated harmful questions (180 core, 20 additional) organized into 20 topic categories. Questions are filtered to ensure they are refused by LLMs without jailbreaks, directly malicious, and structurally answerable, addressing defects in existing datasets.

10 retrieved papers
Can Refute
GuidedEval guideline-based evaluation system

The authors develop an evaluation framework that provides case-specific guidelines identifying key entities and actions that successful jailbreak responses should contain. This shifts evaluation from subjective judgment to objective verification of guideline-defined scoring points, reducing evaluator variance by at least 76.03%.

10 retrieved papers
Systematic measurement study of jailbreak evaluation discrepancies

The authors analyze 37 jailbreak papers to identify significant discrepancies in evaluation systems. They demonstrate that existing keyword-based and general LLM-as-a-judge approaches produce misleading assessments, and reveal why these benchmarks fail through controlled experiments on misjudged cases.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

GuidedBench benchmark with curated harmful question dataset

The authors construct a benchmark containing 200 carefully curated harmful questions (180 core, 20 additional) organized into 20 topic categories. Questions are filtered to ensure they are refused by LLMs without jailbreaks, directly malicious, and structurally answerable, addressing defects in existing datasets.

Contribution

GuidedEval guideline-based evaluation system

The authors develop an evaluation framework that provides case-specific guidelines identifying key entities and actions that successful jailbreak responses should contain. This shifts evaluation from subjective judgment to objective verification of guideline-defined scoring points, reducing evaluator variance by at least 76.03%.

Contribution

Systematic measurement study of jailbreak evaluation discrepancies

The authors analyze 37 jailbreak papers to identify significant discrepancies in evaluation systems. They demonstrate that existing keyword-based and general LLM-as-a-judge approaches produce misleading assessments, and reveal why these benchmarks fail through controlled experiments on misjudged cases.