GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-wild LLM Jailbreak Methods
Overview
Overall Novelty Assessment
The paper introduces GuidedBench, a benchmark comprising a curated harmful question dataset and GuidedEval, a guideline-based evaluation system for assessing jailbreak attacks. It resides in the 'Jailbreak Evaluation Frameworks' leaf alongside four sibling papers: JailbreakEval, Rethinking Jailbreak Evaluation, AttackEval, and GuardVal. This leaf is part of the broader 'Evaluation Methodologies and Benchmarks' branch, which contains three leaves totaling eleven papers. The evaluation frameworks cluster is moderately populated, indicating active but not overcrowded research interest in standardizing jailbreak assessment protocols.
The taxonomy reveals that evaluation methodologies sit between attack techniques (twenty-nine papers across four leaves) and defense strategies (six papers across two leaves). GuidedBench's neighboring leaves include 'Benchmark Datasets and Taxonomies' (two papers) and 'General Adversarial Robustness Assessment' (five papers). The scope notes clarify that evaluation frameworks focus on measurement protocols and reliability, while benchmark datasets emphasize standardized testbeds. GuidedBench bridges these by combining dataset curation with evaluation guidelines, connecting to concerns about reproducibility raised in the general robustness assessment cluster.
Among thirty candidates examined, the curated harmful question dataset contribution shows two refutable candidates from ten examined, while the guideline-based evaluation system shows zero refutations from ten candidates. The systematic measurement study contribution also has two refutable candidates from ten examined. The limited search scope means these statistics reflect top-thirty semantic matches, not exhaustive coverage. The evaluation system component appears more novel within this constrained search, while the dataset and measurement study face more substantial prior work overlap among examined candidates.
Based on the limited thirty-candidate search, GuidedBench appears to occupy established evaluation territory with incremental refinements. The guideline-based evaluation system shows stronger novelty signals than the dataset or measurement components within examined candidates. However, the moderate density of the evaluation frameworks leaf and the presence of methodologically similar siblings suggest the work extends rather than redefines existing assessment paradigms. These observations are bounded by the search scope and may not capture all relevant prior work.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors construct a benchmark containing 200 carefully curated harmful questions (180 core, 20 additional) organized into 20 topic categories. Questions are filtered to ensure they are refused by LLMs without jailbreaks, directly malicious, and structurally answerable, addressing defects in existing datasets.
The authors develop an evaluation framework that provides case-specific guidelines identifying key entities and actions that successful jailbreak responses should contain. This shifts evaluation from subjective judgment to objective verification of guideline-defined scoring points, reducing evaluator variance by at least 76.03%.
The authors analyze 37 jailbreak papers to identify significant discrepancies in evaluation systems. They demonstrate that existing keyword-based and general LLM-as-a-judge approaches produce misleading assessments, and reveal why these benchmarks fail through controlled experiments on misjudged cases.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[12] Attackeval: How to evaluate the effectiveness of jailbreak attacking on large language models PDF
[25] Jailbreakeval: An integrated toolkit for evaluating jailbreak attempts against large language models PDF
[30] GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing PDF
[33] Rethinking how to evaluate language model jailbreak PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
GuidedBench benchmark with curated harmful question dataset
The authors construct a benchmark containing 200 carefully curated harmful questions (180 core, 20 additional) organized into 20 topic categories. Questions are filtered to ensure they are refused by LLMs without jailbreaks, directly malicious, and structurally answerable, addressing defects in existing datasets.
[60] Sorry-bench: Systematically evaluating large language model safety refusal PDF
[64] Do-not-answer: Evaluating safeguards in LLMs PDF
[22] Jailbreakv: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks PDF
[61] Safebench: A safety evaluation framework for multimodal large language models PDF
[62] Do-not-answer: A dataset for evaluating safeguards in llms PDF
[63] Trojllm: A black-box trojan prompt attack on large language models PDF
[65] Jigsaw Puzzles: Splitting Harmful Questions to Jailbreak Large Language Models PDF
[66] Realtoxicityprompts: Evaluating neural toxic degeneration in language models PDF
[67] BadCodePrompt: backdoor attacks against prompt engineering of large language models for code generation PDF
[68] Redcode: Risky code execution and generation benchmark for code agents PDF
GuidedEval guideline-based evaluation system
The authors develop an evaluation framework that provides case-specific guidelines identifying key entities and actions that successful jailbreak responses should contain. This shifts evaluation from subjective judgment to objective verification of guideline-defined scoring points, reducing evaluator variance by at least 76.03%.
[33] Rethinking how to evaluate language model jailbreak PDF
[51] A StrongREJECT for Empty Jailbreaks PDF
[52] CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs PDF
[53] Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges PDF
[54] Jades: A universal framework for jailbreak assessment via decompositional scoring PDF
[55] Understanding the Effectiveness of Coverage Criteria for Large Language Models: A Special Angle from Jailbreak Attacks PDF
[56] " Prompter Says": A Linguistic Approach to Understanding and Detecting Jailbreak Attacks Against Large-Language Models PDF
[57] GuidedBench: Equipping Jailbreak Evaluation with Guidelines PDF
[58] How Real is Your Jailbreak? Fine-grained Jailbreak Evaluation with Anchored Reference PDF
[59] SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models PDF
Systematic measurement study of jailbreak evaluation discrepancies
The authors analyze 37 jailbreak papers to identify significant discrepancies in evaluation systems. They demonstrate that existing keyword-based and general LLM-as-a-judge approaches produce misleading assessments, and reveal why these benchmarks fail through controlled experiments on misjudged cases.