ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

expert-annotatedprofessional knowledgellm judgerubric evaluation

Evaluating progress in large language models (LLMs) is often constrained by the challenge of verifying responses, limiting assessments to tasks like mathematics, programming, and short-form question-answering. However, many real-world applications require evaluating LLMs in processing professional documents, synthesizing information, and generating comprehensive reports in response to user queries. We introduce ProfBench: a set of over 7000 response-criterion pairs as evaluated by human-experts with professional knowledge across Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA. We build robust and affordable LLM-Judges to evaluate ProfBench rubrics, by mitigating self-enhancement bias and reducing the cost of evaluation by 2-3 orders of magnitude, to make it fair and accessible to the broader community. Our findings reveal that ProfBench poses significant challenges even for state-of-the-art LLMs, with top-performing models like GPT-5-high achieving only 65.9% overall performance. Furthermore, we identify notable performance disparities between proprietary and open-weight models and provide insights into the role that extended thinking plays in addressing complex, professional-domain tasks.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

ProfBench introduces a multi-domain professional knowledge benchmark spanning Physics PhD, Chemistry PhD, Finance MBA, and Consulting MBA tasks, with over 7000 response-criterion pairs validated by human experts. The paper resides in the Multi-Domain Professional Knowledge Benchmarks leaf, which contains only three papers including ProfBench itself, ExpertLongBench, and SuperGPQA. This sparse leaf within the Cross-Domain and Multi-Disciplinary Evaluation Frameworks branch suggests the work addresses a relatively underexplored research direction—comprehensive multi-domain professional evaluation remains less crowded than single-domain benchmarks found in Healthcare or Scientific evaluation branches.

The taxonomy reveals substantial activity in adjacent single-domain evaluation branches: Healthcare and Clinical Medicine Evaluation contains nine papers across three sub-categories, while Scientific and Technical Domain Evaluation spans six papers covering chemistry, genomics, and engineering. Business and Legal Domain Evaluation includes three papers focused on legal reasoning and financial analysis. ProfBench's multi-domain approach contrasts with these specialized branches by attempting to capture transferable reasoning across professional fields rather than drilling into discipline-specific nuances. The Cross-Domain parent branch also houses evaluation methodology frameworks and cross-lingual assessments, indicating growing interest in general-purpose evaluation paradigms beyond domain-specific test suites.

Among 29 candidates examined through limited semantic search, the contribution-level analysis reveals varied novelty signals. The core ProfBench benchmark contribution examined 9 candidates with no clear refutations, suggesting the specific combination of expert-created rubrics across these four professional domains represents relatively novel ground. Performance measurement of 40+ models examined 10 candidates without refutation, indicating this systematic comparison may offer new empirical insights. However, methods to reduce LLM-Judge bias examined 10 candidates and found 1 refutable match, suggesting prior work exists on bias mitigation and cost reduction in LLM-based evaluation, though the specific techniques applied to professional domain rubrics may still contribute incremental value.

Based on this limited search scope covering 29 semantically similar papers, ProfBench appears to occupy a moderately novel position within a sparse research direction. The multi-domain professional benchmark itself shows stronger novelty signals than the LLM-Judge methodology components. The analysis does not cover exhaustive literature on evaluation frameworks or domain-specific benchmarks outside the top-K semantic matches, so definitive claims about absolute novelty remain constrained by search boundaries.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: evaluating large language models on professional domain tasks requiring expert knowledge. The field has organized itself into several major branches that reflect both the diversity of professional domains and the methodological challenges of rigorous evaluation. Healthcare and Clinical Medicine Evaluation (e.g., HealthBench[4], Clinical LLM Review[10]) focuses on diagnostic reasoning and medical decision-making, while Scientific and Technical Domain Evaluation spans chemistry, engineering, and other STEM fields where specialized terminology and problem-solving are paramount. Business and Legal Domain Evaluation (e.g., LawBench[24], LegalBench[26]) addresses regulatory compliance and contract analysis, and Software Engineering and Code Generation Evaluation examines programming tasks. Cross-Domain and Multi-Disciplinary Evaluation Frameworks aim to assess models across multiple professional areas simultaneously, complemented by branches on Domain Specialization and Adaptation Methods that explore fine-tuning and knowledge injection strategies, Specialized Domain Benchmarks that provide targeted test suites, and Expert Annotation and Human-AI Collaboration Evaluation that investigates how domain experts interact with and validate model outputs. A central tension runs through these branches: whether to build narrow, deeply specialized benchmarks for individual professions or to create broader frameworks that capture transferable reasoning skills across domains. Works like TaskBench[3] and LLM Evaluation Survey[8] emphasize general-purpose evaluation paradigms, while others such as Chemistry Benchmark[7] and ElecBench[15] drill into discipline-specific nuances. ProfBench[0] sits within the Cross-Domain and Multi-Disciplinary Evaluation Frameworks branch, positioning itself alongside ExpertLongBench[18] and SuperGPQA[44] as a multi-domain professional knowledge benchmark. Compared to ExpertLongBench[18], which emphasizes long-context reasoning across expert fields, ProfBench[0] appears to prioritize breadth of professional coverage and the integration of expert-level task diversity, reflecting ongoing debates about whether comprehensive multi-domain assessments can meaningfully capture the depth that single-domain benchmarks provide.

Claimed Contributions

ProfBench benchmark with expert-created rubrics across multiple professional domains

9 retrieved papers

The authors present ProfBench, a new benchmark containing over 7000 response-criterion pairs evaluated by human experts across four professional domains: Physics PhD, Chemistry PhD, Finance MBA, and Consulting MBA. This benchmark enables evaluation of LLMs on challenging, real-world professional tasks requiring domain expertise.

9 retrieved papers

Performance measurement of over 40 models as report-generators and LLM-Judges

10 retrieved papers

The authors evaluate more than 40 language models both as generators of professional reports and as judges that assess whether responses meet expert-defined criteria. They analyze trends across open/closed-source models, reasoning/instruct models, and model sizes.

10 retrieved papers

Methods to reduce LLM-Judge bias and evaluation cost

Can Refute

10 retrieved papers

The authors develop techniques to mitigate self-enhancement bias in LLM-Judges and reduce evaluation costs by 2-3 orders of magnitude. Their approach achieves no more than 1% bias across three models from different providers while costing only $12 using the o3 model.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[18] ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists PDF

Ruan Jie, Nair, Inderjeet, Jie Ruan, Cao Shuyang, Inderjeet Nair, Shuyang Cao, Amy Liu, Sheza Munir, Micah Pollens-Dempsey, Tiffany Chiang, David Nicholas, Lucy Kates, Chen, Sihan, Nicholas David, Yang Ruxin, Sihan Chen, Yang Yuqian, Ruxin Yang, Yuqian Yang, Jasmine Gump, Tessa Bialek, Schlanger, Margo, Vivek Sankaran, Wang Lu, Margo Schlanger, Lu Wang (2025)

[44] Supergpqa: Scaling llm evaluation across 285 graduate disciplines PDF

P. Team, M-A-P Team, Xinrun Du, Du Xinrun, Yao Yifan, Yifan Yao, Kaijing Ma, Wang Bing-li, Bingli Wang, Zheng Tianyu, Tianyu Zheng, Kang Zhu, Liu, Minghao, Minghao Liu, King Zhu, Liang Yiming, Yiming Liang, Jin, Xiaolong, Xiaolong Jin, Wei Zhen-lin, Zhenlin Wei, Zheng, Chujie, Chujie Zheng, Zhen-Nan Wei, Deng Kaixin, Kun Deng, Shuyue Guo, Kaixin Deng, Jia Shian, Shi Jia, Shawn Gavin, S. S. Jiang, Shian Jia, Yen-Yen Liao, Sichao Jiang, Li Rui, Rui Li, Yiyan Liao, Li Qinrui, Qinrui Li, S.M. Li, Li Yi-Zhi, Yizhi Li, Sirun Li, Yunwen Li, Ma David, Dehua Ma, Yuansheng Ni, David Ma, Haoran Que, Wang Qi-yao, Qiyao Wang, Wen, Zhoufutu, Zhoufutu Wen, Wu Siwei, Siwei Wu, T. Y. Xing, Si-Xuan Wu, Si-Yuan Wu, Xu Ming, Ming Cai Xu, Tianshun Xing, Tyshawn Hsing, Yang, Zhenzhu, Zhenzhu Yang, Ming Xu, Zekun Moore Wang, Zhen-Yu Yang, Zhou Jun-Ting, Junting Zhou, Ze Wang, Bai Yue-lin, Yuelin Bai, Bu, Xingyuan, Xingyuan Bu, Yu Bai, Cai Chenglin, Chenglin Cai, Chen Liang, Liang Chen, Chen Yifan, Yifan Chen, Chengtuo Cheng, Tianhao Cheng, Ding Ke-yi, Keyi Ding, Huang Si-ming, Siming Huang, Huang Yun, Yun Huang, Simin Huang, Y. Li, Yun-Jing Huang, Li YiZhe, Yun Li, Yaoru Li, Li, Zhaoqun, Zhaoqun Li, Yizhe Li, Liang, Tianhao, Teng Liang, Lin, Chengdong, Chengdong Lin, Tianhao Liang, Hongquan, Hongquan Lin, Ma, Yinghao, Yinghao Ma, Zhongyuan Peng, Yi-Hui Ma, Y. Ma, Zhaoyun Peng, Tianyang Pang, Peng Zifan, Qige Qi, Z. Peng, Qi Qige, S. Qiu, Zifan Peng, Qiu Shi, Xingwei Qu, Qu Xingwei, Yizhou Tan, Shi Qiu, Zili Wang, Tan YiZhou, Chenqing Wang, Shanghaoran Quan, Wang Zili, Hao Wang, Wang Chen-Qing, Yiya Wang, Wang, Hao, Yubo Wang, Wang Yi-ya, Jiajun Xu, Wang Yubo, Kexin Yang, Xu Jiajun, Ruibin Yuan, Yang Ke-xin, Yuanhao Yue, Ru-Qing Yuan, Yuan, Ruibin, Taideng Zhan, Zhang Chun, Tianyang Zhan, Zhan, Tianyang, Jingyang Zhang, Chunqing Zhang, Xiyue Zhang, Jing-Yun Zhang, Zhang JinYang, Xingjian Zhang, Zhang Xi-yue, Yue Zhang, Xing Zhang, Jinyang Zhang, Zhang Xingjian, Yan Zhao, Zhang, Yue, Xiangyu Zheng, Yongchi Zhao, Zhao Yongchi, Chenghua Zhong, Zheng Xiangyu, Yang Gao, Zhong, Chenghua, Zhoujun Li, Gao Yang, Dayiheng Liu, LI Zhou-jun, Qian Liu, Dayiheng, Tianyu Liu, Liu Qian, Shiwen Ni, Da Liu, Tianyu, Junran Peng, Ni, Shiwen, Yujia Qin, Peng, Junran, Wenbo Su, Qin, Yujia, Guoyin Wang, J. Peng, SU WenBo, Shi Wang, Wang Guoyin, Jian Yang, Wang Shi, Min Yang, Yang Jian, Meng Cao, Yang Min, Yue Xiang, Mengxuan Cao, Cao Meng, Zhaoxiang Zhang, Xiang Yue, Wangchunshu Zhou, ZhaoXiang, Jiaheng Liu, Zhou, Wangchunshu, Qunshu Lin, Liu Jiaheng, Wenhao Huang, Ge Zhang, Huang Wen-Hao, Zhang Ge (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ProfBench benchmark with expert-created rubrics across multiple professional domains

[71] Researchrubrics: A benchmark of prompts and rubrics for evaluating deep research agents PDF

Cannot Refute

[72] Aecbench: A hierarchical benchmark for knowledge evaluation of large language models in the aec field PDF

Cannot Refute

[73] Expert evaluation of large language models for clinical dialogue summarization PDF

Cannot Refute

[74] Scalable evaluation framework for retrieval augmented generation in tobacco research using large Language models PDF

Cannot Refute

[75] Ucfe: A user-centric financial expertise benchmark for large language models PDF

Cannot Refute

[76] A Scalable Framework for Evaluating Health Language Models PDF

Cannot Refute

[77] Evaluation of Reliability Criteria for News Publishers with Large Language Models PDF

Cannot Refute

[79] Rubrics as rewards: Reinforcement learning beyond verifiable domains PDF

Cannot Refute

[80] Towards a personal health large language model PDF

Cannot Refute

Contribution

Performance measurement of over 40 models as report-generators and LLM-Judges

[61] Transforming science with large language models: A survey on ai-assisted scientific discovery, experimentation, content generation, and evaluation PDF

Cannot Refute

[62] From generation to judgment: Opportunities and challenges of llm-as-a-judge PDF

Cannot Refute

[63] A survey on the use of large language models (llms) in fake news PDF

Cannot Refute

[64] Generative AI and misinformation: a scoping review of the role of generative AI in the generation, detection, mitigation, and impact of misinformation PDF

Cannot Refute

[65] Automated test creation using large language models: A practical application PDF

Cannot Refute

[66] Fighting fire with fire: The dual role of LLMs in crafting and detecting elusive disinformation PDF

Cannot Refute

[67] A survey of textual cyber abuse detection using cutting-edge language models and large language models PDF

Cannot Refute

[68] Eduquick: A dataset toward evaluating summarization of informal educational content for social media PDF

Cannot Refute

[69] LLMs for Customized Marketing Content Generation and Evaluation at Scale PDF

Cannot Refute

[70] The Dual Threat of Large Language Models: Addressing Plagiarism and Deepfake Generation PDF

Cannot Refute

Contribution

Methods to reduce LLM-Judge bias and evaluation cost

[51] Judgelm: Fine-tuned large language models are scalable judges PDF

Can Refute

[52] Inadequacies of large language model benchmarks in the era of generative artificial intelligence PDF

Cannot Refute

[53] Bias and fairness in large language models: A survey PDF

Cannot Refute

[54] Split and merge: Aligning position biases in LLM-based evaluators PDF

Cannot Refute

[55] Bias testing and mitigation in llm-based code generation PDF

Cannot Refute

[56] Length-controlled alpacaeval: A simple way to debias automatic evaluators PDF

Cannot Refute

[57] Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena PDF

Cannot Refute

[58] Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias PDF

Cannot Refute

[59] Pre: A peer review based large language model evaluator PDF

Cannot Refute

[60] Application of unified health large language model evaluation framework to In-Basket message replies: bridging qualitative and quantitative assessments PDF

Cannot Refute

ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[18] ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists PDF

[44] Supergpqa: Scaling llm evaluation across 285 graduate disciplines PDF

Contribution Analysis

ProfBench benchmark with expert-created rubrics across multiple professional domains

[71] Researchrubrics: A benchmark of prompts and rubrics for evaluating deep research agents PDF

[72] Aecbench: A hierarchical benchmark for knowledge evaluation of large language models in the aec field PDF

[73] Expert evaluation of large language models for clinical dialogue summarization PDF

[74] Scalable evaluation framework for retrieval augmented generation in tobacco research using large Language models PDF

[75] Ucfe: A user-centric financial expertise benchmark for large language models PDF

[76] A Scalable Framework for Evaluating Health Language Models PDF

[77] Evaluation of Reliability Criteria for News Publishers with Large Language Models PDF

[79] Rubrics as rewards: Reinforcement learning beyond verifiable domains PDF

[80] Towards a personal health large language model PDF

Performance measurement of over 40 models as report-generators and LLM-Judges

[61] Transforming science with large language models: A survey on ai-assisted scientific discovery, experimentation, content generation, and evaluation PDF

[62] From generation to judgment: Opportunities and challenges of llm-as-a-judge PDF

[63] A survey on the use of large language models (llms) in fake news PDF

[64] Generative AI and misinformation: a scoping review of the role of generative AI in the generation, detection, mitigation, and impact of misinformation PDF

[65] Automated test creation using large language models: A practical application PDF

[66] Fighting fire with fire: The dual role of LLMs in crafting and detecting elusive disinformation PDF

[67] A survey of textual cyber abuse detection using cutting-edge language models and large language models PDF

[68] Eduquick: A dataset toward evaluating summarization of informal educational content for social media PDF

[69] LLMs for Customized Marketing Content Generation and Evaluation at Scale PDF

[70] The Dual Threat of Large Language Models: Addressing Plagiarism and Deepfake Generation PDF

Methods to reduce LLM-Judge bias and evaluation cost

[51] Judgelm: Fine-tuned large language models are scalable judges PDF

[52] Inadequacies of large language model benchmarks in the era of generative artificial intelligence PDF

[53] Bias and fairness in large language models: A survey PDF

[54] Split and merge: Aligning position biases in LLM-based evaluators PDF

[55] Bias testing and mitigation in llm-based code generation PDF

[56] Length-controlled alpacaeval: A simple way to debias automatic evaluators PDF

[57] Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena PDF

[58] Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias PDF

[59] Pre: A peer review based large language model evaluator PDF

[60] Application of unified health large language model evaluation framework to In-Basket message replies: bridging qualitative and quantitative assessments PDF

Table of Contents