ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge
Overview
Overall Novelty Assessment
ProfBench introduces a multi-domain professional knowledge benchmark spanning Physics PhD, Chemistry PhD, Finance MBA, and Consulting MBA tasks, with over 7000 response-criterion pairs validated by human experts. The paper resides in the Multi-Domain Professional Knowledge Benchmarks leaf, which contains only three papers including ProfBench itself, ExpertLongBench, and SuperGPQA. This sparse leaf within the Cross-Domain and Multi-Disciplinary Evaluation Frameworks branch suggests the work addresses a relatively underexplored research direction—comprehensive multi-domain professional evaluation remains less crowded than single-domain benchmarks found in Healthcare or Scientific evaluation branches.
The taxonomy reveals substantial activity in adjacent single-domain evaluation branches: Healthcare and Clinical Medicine Evaluation contains nine papers across three sub-categories, while Scientific and Technical Domain Evaluation spans six papers covering chemistry, genomics, and engineering. Business and Legal Domain Evaluation includes three papers focused on legal reasoning and financial analysis. ProfBench's multi-domain approach contrasts with these specialized branches by attempting to capture transferable reasoning across professional fields rather than drilling into discipline-specific nuances. The Cross-Domain parent branch also houses evaluation methodology frameworks and cross-lingual assessments, indicating growing interest in general-purpose evaluation paradigms beyond domain-specific test suites.
Among 29 candidates examined through limited semantic search, the contribution-level analysis reveals varied novelty signals. The core ProfBench benchmark contribution examined 9 candidates with no clear refutations, suggesting the specific combination of expert-created rubrics across these four professional domains represents relatively novel ground. Performance measurement of 40+ models examined 10 candidates without refutation, indicating this systematic comparison may offer new empirical insights. However, methods to reduce LLM-Judge bias examined 10 candidates and found 1 refutable match, suggesting prior work exists on bias mitigation and cost reduction in LLM-based evaluation, though the specific techniques applied to professional domain rubrics may still contribute incremental value.
Based on this limited search scope covering 29 semantically similar papers, ProfBench appears to occupy a moderately novel position within a sparse research direction. The multi-domain professional benchmark itself shows stronger novelty signals than the LLM-Judge methodology components. The analysis does not cover exhaustive literature on evaluation frameworks or domain-specific benchmarks outside the top-K semantic matches, so definitive claims about absolute novelty remain constrained by search boundaries.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors present ProfBench, a new benchmark containing over 7000 response-criterion pairs evaluated by human experts across four professional domains: Physics PhD, Chemistry PhD, Finance MBA, and Consulting MBA. This benchmark enables evaluation of LLMs on challenging, real-world professional tasks requiring domain expertise.
The authors evaluate more than 40 language models both as generators of professional reports and as judges that assess whether responses meet expert-defined criteria. They analyze trends across open/closed-source models, reasoning/instruct models, and model sizes.
The authors develop techniques to mitigate self-enhancement bias in LLM-Judges and reduce evaluation costs by 2-3 orders of magnitude. Their approach achieves no more than 1% bias across three models from different providers while costing only $12 using the o3 model.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[18] ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists PDF
[44] Supergpqa: Scaling llm evaluation across 285 graduate disciplines PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
ProfBench benchmark with expert-created rubrics across multiple professional domains
The authors present ProfBench, a new benchmark containing over 7000 response-criterion pairs evaluated by human experts across four professional domains: Physics PhD, Chemistry PhD, Finance MBA, and Consulting MBA. This benchmark enables evaluation of LLMs on challenging, real-world professional tasks requiring domain expertise.
[71] Researchrubrics: A benchmark of prompts and rubrics for evaluating deep research agents PDF
[72] Aecbench: A hierarchical benchmark for knowledge evaluation of large language models in the aec field PDF
[73] Expert evaluation of large language models for clinical dialogue summarization PDF
[74] Scalable evaluation framework for retrieval augmented generation in tobacco research using large Language models PDF
[75] Ucfe: A user-centric financial expertise benchmark for large language models PDF
[76] A Scalable Framework for Evaluating Health Language Models PDF
[77] Evaluation of Reliability Criteria for News Publishers with Large Language Models PDF
[79] Rubrics as rewards: Reinforcement learning beyond verifiable domains PDF
[80] Towards a personal health large language model PDF
Performance measurement of over 40 models as report-generators and LLM-Judges
The authors evaluate more than 40 language models both as generators of professional reports and as judges that assess whether responses meet expert-defined criteria. They analyze trends across open/closed-source models, reasoning/instruct models, and model sizes.
[61] Transforming science with large language models: A survey on ai-assisted scientific discovery, experimentation, content generation, and evaluation PDF
[62] From generation to judgment: Opportunities and challenges of llm-as-a-judge PDF
[63] A survey on the use of large language models (llms) in fake news PDF
[64] Generative AI and misinformation: a scoping review of the role of generative AI in the generation, detection, mitigation, and impact of misinformation PDF
[65] Automated test creation using large language models: A practical application PDF
[66] Fighting fire with fire: The dual role of LLMs in crafting and detecting elusive disinformation PDF
[67] A survey of textual cyber abuse detection using cutting-edge language models and large language models PDF
[68] Eduquick: A dataset toward evaluating summarization of informal educational content for social media PDF
[69] LLMs for Customized Marketing Content Generation and Evaluation at Scale PDF
[70] The Dual Threat of Large Language Models: Addressing Plagiarism and Deepfake Generation PDF
Methods to reduce LLM-Judge bias and evaluation cost
The authors develop techniques to mitigate self-enhancement bias in LLM-Judges and reduce evaluation costs by 2-3 orders of magnitude. Their approach achieves no more than 1% bias across three models from different providers while costing only $12 using the o3 model.