ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

ICLR 2026 Conference SubmissionAnonymous Authors
expert-annotatedprofessional knowledgellm judgerubric evaluation
Abstract:

Evaluating progress in large language models (LLMs) is often constrained by the challenge of verifying responses, limiting assessments to tasks like mathematics, programming, and short-form question-answering. However, many real-world applications require evaluating LLMs in processing professional documents, synthesizing information, and generating comprehensive reports in response to user queries. We introduce ProfBench: a set of over 7000 response-criterion pairs as evaluated by human-experts with professional knowledge across Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA. We build robust and affordable LLM-Judges to evaluate ProfBench rubrics, by mitigating self-enhancement bias and reducing the cost of evaluation by 2-3 orders of magnitude, to make it fair and accessible to the broader community. Our findings reveal that ProfBench poses significant challenges even for state-of-the-art LLMs, with top-performing models like GPT-5-high achieving only 65.9% overall performance. Furthermore, we identify notable performance disparities between proprietary and open-weight models and provide insights into the role that extended thinking plays in addressing complex, professional-domain tasks.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

ProfBench introduces a multi-domain professional knowledge benchmark spanning Physics PhD, Chemistry PhD, Finance MBA, and Consulting MBA tasks, with over 7000 response-criterion pairs validated by human experts. The paper resides in the Multi-Domain Professional Knowledge Benchmarks leaf, which contains only three papers including ProfBench itself, ExpertLongBench, and SuperGPQA. This sparse leaf within the Cross-Domain and Multi-Disciplinary Evaluation Frameworks branch suggests the work addresses a relatively underexplored research direction—comprehensive multi-domain professional evaluation remains less crowded than single-domain benchmarks found in Healthcare or Scientific evaluation branches.

The taxonomy reveals substantial activity in adjacent single-domain evaluation branches: Healthcare and Clinical Medicine Evaluation contains nine papers across three sub-categories, while Scientific and Technical Domain Evaluation spans six papers covering chemistry, genomics, and engineering. Business and Legal Domain Evaluation includes three papers focused on legal reasoning and financial analysis. ProfBench's multi-domain approach contrasts with these specialized branches by attempting to capture transferable reasoning across professional fields rather than drilling into discipline-specific nuances. The Cross-Domain parent branch also houses evaluation methodology frameworks and cross-lingual assessments, indicating growing interest in general-purpose evaluation paradigms beyond domain-specific test suites.

Among 29 candidates examined through limited semantic search, the contribution-level analysis reveals varied novelty signals. The core ProfBench benchmark contribution examined 9 candidates with no clear refutations, suggesting the specific combination of expert-created rubrics across these four professional domains represents relatively novel ground. Performance measurement of 40+ models examined 10 candidates without refutation, indicating this systematic comparison may offer new empirical insights. However, methods to reduce LLM-Judge bias examined 10 candidates and found 1 refutable match, suggesting prior work exists on bias mitigation and cost reduction in LLM-based evaluation, though the specific techniques applied to professional domain rubrics may still contribute incremental value.

Based on this limited search scope covering 29 semantically similar papers, ProfBench appears to occupy a moderately novel position within a sparse research direction. The multi-domain professional benchmark itself shows stronger novelty signals than the LLM-Judge methodology components. The analysis does not cover exhaustive literature on evaluation frameworks or domain-specific benchmarks outside the top-K semantic matches, so definitive claims about absolute novelty remain constrained by search boundaries.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: evaluating large language models on professional domain tasks requiring expert knowledge. The field has organized itself into several major branches that reflect both the diversity of professional domains and the methodological challenges of rigorous evaluation. Healthcare and Clinical Medicine Evaluation (e.g., HealthBench[4], Clinical LLM Review[10]) focuses on diagnostic reasoning and medical decision-making, while Scientific and Technical Domain Evaluation spans chemistry, engineering, and other STEM fields where specialized terminology and problem-solving are paramount. Business and Legal Domain Evaluation (e.g., LawBench[24], LegalBench[26]) addresses regulatory compliance and contract analysis, and Software Engineering and Code Generation Evaluation examines programming tasks. Cross-Domain and Multi-Disciplinary Evaluation Frameworks aim to assess models across multiple professional areas simultaneously, complemented by branches on Domain Specialization and Adaptation Methods that explore fine-tuning and knowledge injection strategies, Specialized Domain Benchmarks that provide targeted test suites, and Expert Annotation and Human-AI Collaboration Evaluation that investigates how domain experts interact with and validate model outputs. A central tension runs through these branches: whether to build narrow, deeply specialized benchmarks for individual professions or to create broader frameworks that capture transferable reasoning skills across domains. Works like TaskBench[3] and LLM Evaluation Survey[8] emphasize general-purpose evaluation paradigms, while others such as Chemistry Benchmark[7] and ElecBench[15] drill into discipline-specific nuances. ProfBench[0] sits within the Cross-Domain and Multi-Disciplinary Evaluation Frameworks branch, positioning itself alongside ExpertLongBench[18] and SuperGPQA[44] as a multi-domain professional knowledge benchmark. Compared to ExpertLongBench[18], which emphasizes long-context reasoning across expert fields, ProfBench[0] appears to prioritize breadth of professional coverage and the integration of expert-level task diversity, reflecting ongoing debates about whether comprehensive multi-domain assessments can meaningfully capture the depth that single-domain benchmarks provide.

Claimed Contributions

ProfBench benchmark with expert-created rubrics across multiple professional domains

The authors present ProfBench, a new benchmark containing over 7000 response-criterion pairs evaluated by human experts across four professional domains: Physics PhD, Chemistry PhD, Finance MBA, and Consulting MBA. This benchmark enables evaluation of LLMs on challenging, real-world professional tasks requiring domain expertise.

9 retrieved papers
Performance measurement of over 40 models as report-generators and LLM-Judges

The authors evaluate more than 40 language models both as generators of professional reports and as judges that assess whether responses meet expert-defined criteria. They analyze trends across open/closed-source models, reasoning/instruct models, and model sizes.

10 retrieved papers
Methods to reduce LLM-Judge bias and evaluation cost

The authors develop techniques to mitigate self-enhancement bias in LLM-Judges and reduce evaluation costs by 2-3 orders of magnitude. Their approach achieves no more than 1% bias across three models from different providers while costing only $12 using the o3 model.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ProfBench benchmark with expert-created rubrics across multiple professional domains

The authors present ProfBench, a new benchmark containing over 7000 response-criterion pairs evaluated by human experts across four professional domains: Physics PhD, Chemistry PhD, Finance MBA, and Consulting MBA. This benchmark enables evaluation of LLMs on challenging, real-world professional tasks requiring domain expertise.

Contribution

Performance measurement of over 40 models as report-generators and LLM-Judges

The authors evaluate more than 40 language models both as generators of professional reports and as judges that assess whether responses meet expert-defined criteria. They analyze trends across open/closed-source models, reasoning/instruct models, and model sizes.

Contribution

Methods to reduce LLM-Judge bias and evaluation cost

The authors develop techniques to mitigate self-enhancement bias in LLM-Judges and reduce evaluation costs by 2-3 orders of magnitude. Their approach achieves no more than 1% bias across three models from different providers while costing only $12 using the o3 model.