WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

large language modelsevaluationLLM-as-a-judgebenchmark

The paradigm of LLM-as-a-judge is emerging as a scalable and efficient alternative to human evaluation, demonstrating strong performance on well-defined tasks. However, its reliability in open-ended tasks with dynamic environments and complex interactions remains unexplored. To bridge the gap, we introduce WebDevJudge, a systematic benchmark for assessing LLM-as-a-judge performance in web development, with support for both non-interactive evaluation based on static observations and continuous interactive evaluation with a dynamic web environment. WebDevJudge comprises human preference labels over paired web implementations, annotated with structured and query-grounded rubrics to ensure high-quality ground truth. Using this benchmark, we comprehensively evaluate various evaluators, including LLMs, MLLMs, and agentic workflows. We systematically investigate the impact of different paradigms and guidance mechanisms. Our experiments reveal a significant gap between LLM judges and human experts. In-depth analysis indicates this gap stems from fundamental model limitations, including failures in recognizing functional equivalence, verifying task feasibility, and mitigating bias. Overall, WebDevJudge presents a significant challenge to LLM-as-a-judge, offering insights to guide future research toward developing more reliable and capable automated evaluators for complicated scenarios.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces WebDevJudge, a benchmark for evaluating LLM-as-a-judge performance in web development quality assessment, including both static and interactive evaluation modes. It resides in the 'Web Development and Interactive Artifact Evaluation' leaf, which contains four papers total. This represents a relatively sparse research direction within the broader taxonomy of 25 papers across the field, suggesting that systematic evaluation of LLM judges specifically for web development tasks remains an emerging area with limited prior benchmarking efforts.

The taxonomy reveals that WebDevJudge sits at the intersection of evaluation frameworks and software engineering applications. Its sibling papers include ArtifactsBench (general interactive artifacts), IWR-Bench (web reasoning), and one other web-focused work. Neighboring leaves address algorithmic code evaluation (three papers) and various application domains like code review and agentic workflows. The scope notes clarify that this leaf specifically targets visual rendering, interactivity, and dynamic behavior assessment—distinguishing it from purely algorithmic correctness evaluation. This positioning suggests the work addresses a gap between general code evaluation and the specialized requirements of web artifact assessment.

Among the 29 candidates examined across three contributions, none were identified as clearly refuting the paper's claims. The WebDevJudge benchmark contribution examined nine candidates with zero refutable matches; the query-grounded rubric methodology examined ten candidates with zero refutations; and the WebDevJudge-Unit diagnostic dataset examined ten candidates, also with zero refutations. This pattern suggests that within the limited search scope, the specific combination of web development focus, interactive evaluation support, and structured rubric annotation appears relatively novel, though the analysis does not claim exhaustive coverage of all potentially relevant prior work.

Based on the top-29 semantic matches examined, the work appears to occupy a distinct niche within LLM-as-a-judge research. The sparse population of its taxonomy leaf and absence of direct refutations across contributions indicate potential novelty, though this assessment is constrained by the search methodology. The gap identified between LLM judges and human experts in web development evaluation may represent a substantive contribution to understanding judge reliability in visually-oriented, interactive domains.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: evaluating LLM-as-a-judge performance in web development quality assessment. The field organizes around four main branches that reflect both methodological and application-oriented concerns. The first branch, LLM-as-a-Judge Evaluation Frameworks and Benchmarks, develops systematic testbeds and metrics to measure how well language models can serve as evaluators, with works like ArtifactsBench[2] and IWR-Bench[21] providing structured environments for assessing interactive artifacts and web-based outputs. The second branch, LLM-as-a-Judge Reliability and Bias Analysis, investigates the robustness and fairness of these automated judges, examining issues such as position bias and consistency across diverse evaluation scenarios. The third branch, LLM-as-a-Judge Applications in Software Engineering, explores practical deployment in code review, accessibility auditing, and quality assurance tasks, while the fourth branch, LLM-Based Code Generation and Web Development Tools, focuses on the generation side, producing the artifacts that judges must evaluate. Together, these branches capture the dual challenge of building reliable automated evaluators and applying them to increasingly complex software artifacts. Several active lines of work highlight key trade-offs and open questions. One cluster examines the granularity and domain specificity of evaluation: some studies develop fine-grained rubrics for code correctness and style, while others like CodeJudge[7] and CodeJudgeBench[8] target broader functional assessments. Another theme concerns the interplay between automated judgment and human oversight, with works exploring when LLM judges align with expert reviewers and when they introduce systematic errors. WebDevJudge[0] sits within the Web Development and Interactive Artifact Evaluation cluster, emphasizing the challenge of assessing visual and interactive quality in web outputs—a setting where traditional code metrics fall short. Compared to neighbors like ArtifactsBench[2], which provides a general-purpose benchmark for interactive artifacts, and IWR-Bench[21], which focuses on web reasoning tasks, WebDevJudge[0] zeroes in on the specific reliability and validity of LLM judges when evaluating web development quality, bridging evaluation framework design with practical software engineering concerns.

Claimed Contributions

WebDevJudge benchmark for evaluating LLM-as-a-judge in web development

9 retrieved papers

The authors present WebDevJudge, a meta-evaluation benchmark designed to assess how well LLMs can judge web development quality. The benchmark supports both static code-based evaluation and interactive assessment within live web environments, addressing the gap in evaluating LLM judges on complex, dynamic tasks.

9 retrieved papers

Query-grounded rubric tree annotation methodology

10 retrieved papers

The authors develop a structured annotation approach using rubric trees that break down web development requirements into hierarchical, verifiable criteria. This methodology achieves high inter-annotator agreement (89.7%) and provides reliable ground-truth preference labels for the benchmark.

10 retrieved papers

WebDevJudge-Unit diagnostic dataset for feasibility verification

10 retrieved papers

The authors create WebDevJudge-Unit, a targeted dataset of 502 test cases designed to diagnose and evaluate the ability of LLM-based and agent-based evaluators to verify whether specific web development tasks are feasible and correctly implemented.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation PDF

Zhang Chenchen, Li Yuhang, Chenchen Zhang, Xu Can, Yuhang Li, Liu Jiaheng, Can Xu, Liu Ao, Jiaheng Liu, Zhou Changzhi, Ao Liu, Deng, Ken, Shihui Hu, WU Dengpeng, Dengpeng Wu, Huang Guan-hua, Guanhua Huang, Kejiao Li, Yi Qi, Qi Yi, Xiong Ruibin, Ruibin Xiong, Hu Shihui, Haotian Zhu, Zhang, Yue, Yuanxing Zhang, Jiang, Yuhao, Yuhao Jiang, Xu, Zenan, Yue Zhang, Yuanxing, Zenan Xu, Bohui Zhai, Guoxiang He, Lian Fengzong, Hebin Li, Jie Zhao, Le Zhang, Lingyun Tan, Pengyu Guo, Xianshu Pang, Y. Ruan, Zhifeng Zhang, Zhonghu Wang, Zi-Jian Xu, Zuopu Yin, Wiggin Zhou, Chayse Zhou, Fengzong Lian (2025) • arXiv.org

[19] You Don't Know Until You Click: Automated GUI Testing for Production-Ready Software Evaluation PDF

Lin Xian-hao, Yutong Bian, Xie Yu-peng, Xianhao Lin, Liu Tianyang, Yupeng Xie, Zhuge, Mingchen, Tianyang Liu, Lu, Siyuan, Mingchen Zhuge, Siyuan Lu, Wang Jin-lin, Haoming Tang, Zhang Jiayi, Jinlin Wang, Chen Jiaqi, Jiayi Zhang, Tang, Xiangru, Jiaqi Chen, Ni YongXin, Xiangru Tang, Hong, Sirui, Yongxin Ni, WU Chenglin, Sirui Hong, Chenglin Wu (2025)

[21] IWR-Bench: Can LVLMs reconstruct interactive webpage from a user interaction video? PDF

Chen Yang, Liu, Minghao, Yang Chen, Shen Yu-fan, Minghao Liu, Yufan Shen, Huang Tian-yuan, Yunwen Li, Fang XinYu, Tianyuan Huang, Zheng Tianyu, Xinyu Fang, Huang Wenxuan, Tianyu Zheng, Yang Cheng, Wenxuan Huang, Fu, Daocheng, Cheng Yang, Mei, Jianbiao, Daocheng Fu, Wu Rong, Jianbiao Mei, Zhao Yun-fei, Rong Wu, Wen Li-cheng, Licheng Wen, Yang Xuemeng, Xuemeng Yang, Mao Song, Song Mao, Qunshu Lin, Yu Zhi, Zhi Yu, Shen, Yongliang, Yongliang Shen, Qiao Yu, Yu Qiao, Shi, Botian, Botian Shi (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

WebDevJudge benchmark for evaluating LLM-as-a-judge in web development

[7] Codejudge: Evaluating code generation with large language models PDF

Cannot Refute

[8] Codejudgebench: Benchmarking llm-as-a-judge for coding tasks PDF

Cannot Refute

[29] A survey on evaluating large language models in code generation tasks PDF

Cannot Refute

[46] CodeJudge-eval: Can large language models be good judges in code understanding? PDF

Cannot Refute

[47] Program synthesis with large language models PDF

Cannot Refute

[48] Humanevalcomm: Benchmarking the communication competence of code generation for llms and llm agents PDF

Cannot Refute

[49] Automatic generation of benchmarks and reliable LLM judgment for code tasks PDF

Cannot Refute

[50] Web-bench: A llm code benchmark based on web standards and frameworks PDF

Cannot Refute

[51] Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive Prototyping PDF

Cannot Refute

Contribution

Query-grounded rubric tree annotation methodology

[36] Pencils down! automatic rubric-based evaluation of retrieve/generate systems PDF

Cannot Refute

[37] Qa-lign: Aligning llms through constitutionally decomposed qa PDF

Cannot Refute

[38] A framework for specification-based testing PDF

Cannot Refute

[39] Aecbench: A hierarchical benchmark for knowledge evaluation of large language models in the aec field PDF

Cannot Refute

[40] ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists PDF

Cannot Refute

[41] Spoq: Scaling {Machine-Checkable} systems verification in coq PDF

Cannot Refute

[42] An Efficient Rubric-based Generative Verifier for Search-Augmented LLMs PDF

Cannot Refute

[43] ProImage-Bench: Rubric-Based Evaluation for Professional Image Generation PDF

Cannot Refute

[44] AI-Driven Automated Test Generation Framework for VCU: A Multidimensional Coupling Approach Integrating Requirements, Variables and Logic PDF

Cannot Refute

[45] Multidimensional Rubric-oriented Reward Model Learning via Geometric Projection Reference Constraints PDF

Cannot Refute

Contribution

WebDevJudge-Unit diagnostic dataset for feasibility verification

[26] An empirical study of the non-determinism of chatgpt in code generation PDF

Cannot Refute

[27] Evaluating large language models in class-level code generation PDF

Cannot Refute

[28] Codescore: Evaluating code generation by learning code execution PDF

Cannot Refute

[29] A survey on evaluating large language models in code generation tasks PDF

Cannot Refute

[30] GeoJSEval: an automated evaluation framework for large language models on JavaScript-based geospatial computation and visualization code generation PDF

Cannot Refute

[31] Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation PDF

Cannot Refute

[32] mHumanEval-a multilingual benchmark to evaluate large language models for code generation PDF

Cannot Refute

[33] Evaluating the Test Adequacy of Benchmarks for LLMs on Code Generation PDF

Cannot Refute

[34] Execution-based evaluation for data science code generation models PDF

Cannot Refute

[35] Beyond correctness: Benchmarking multi-dimensional code generation for large language models PDF

Cannot Refute

WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation PDF

[19] You Don't Know Until You Click: Automated GUI Testing for Production-Ready Software Evaluation PDF

[21] IWR-Bench: Can LVLMs reconstruct interactive webpage from a user interaction video? PDF

Contribution Analysis

WebDevJudge benchmark for evaluating LLM-as-a-judge in web development

[7] Codejudge: Evaluating code generation with large language models PDF

[8] Codejudgebench: Benchmarking llm-as-a-judge for coding tasks PDF

[29] A survey on evaluating large language models in code generation tasks PDF

[46] CodeJudge-eval: Can large language models be good judges in code understanding? PDF

[47] Program synthesis with large language models PDF

[48] Humanevalcomm: Benchmarking the communication competence of code generation for llms and llm agents PDF

[49] Automatic generation of benchmarks and reliable LLM judgment for code tasks PDF

[50] Web-bench: A llm code benchmark based on web standards and frameworks PDF

[51] Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive Prototyping PDF

Query-grounded rubric tree annotation methodology

[36] Pencils down! automatic rubric-based evaluation of retrieve/generate systems PDF

[37] Qa-lign: Aligning llms through constitutionally decomposed qa PDF

[38] A framework for specification-based testing PDF

[39] Aecbench: A hierarchical benchmark for knowledge evaluation of large language models in the aec field PDF

[40] ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists PDF

[41] Spoq: Scaling {Machine-Checkable} systems verification in coq PDF

[42] An Efficient Rubric-based Generative Verifier for Search-Augmented LLMs PDF

[43] ProImage-Bench: Rubric-Based Evaluation for Professional Image Generation PDF

[44] AI-Driven Automated Test Generation Framework for VCU: A Multidimensional Coupling Approach Integrating Requirements, Variables and Logic PDF

[45] Multidimensional Rubric-oriented Reward Model Learning via Geometric Projection Reference Constraints PDF

WebDevJudge-Unit diagnostic dataset for feasibility verification

[26] An empirical study of the non-determinism of chatgpt in code generation PDF

[27] Evaluating large language models in class-level code generation PDF

[28] Codescore: Evaluating code generation by learning code execution PDF

[29] A survey on evaluating large language models in code generation tasks PDF

[30] GeoJSEval: an automated evaluation framework for large language models on JavaScript-based geospatial computation and visualization code generation PDF

[31] Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation PDF

[32] mHumanEval-a multilingual benchmark to evaluate large language models for code generation PDF

[33] Evaluating the Test Adequacy of Benchmarks for LLMs on Code Generation PDF

[34] Execution-based evaluation for data science code generation models PDF

[35] Beyond correctness: Benchmarking multi-dimensional code generation for large language models PDF

Table of Contents