MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes
Overview
Overall Novelty Assessment
The paper introduces MoReBench, a benchmark of 1,000 moral scenarios paired with expert-defined rubric criteria for evaluating procedural moral reasoning in language models. It resides in the 'Procedural Dilemma Generation and Scenario-Based Testing' leaf, which contains four papers total, indicating a moderately populated research direction within the broader Benchmark Development and Evaluation Frameworks branch. This leaf focuses specifically on systematic scenario generation to test multi-step reasoning processes, distinguishing it from abstract theoretical assessments or everyday contextual dilemmas found in sibling leaves.
The taxonomy reveals several neighboring research directions that contextualize this work. Adjacent leaves include 'Abstract and Theoretical Moral Assessment' (testing normative ethical theories) and 'Everyday and Contextual Moral Dilemmas' (real-world nuanced situations), both emphasizing different evaluation philosophies. The broader Benchmark Development branch encompasses domain-specific evaluation and causal judgment tasks, while parallel branches address moral alignment analysis and robustness testing. MoReBench's process-focused methodology bridges procedural scenario generation with the auditing frameworks found in the 'Auditing and Methodological Frameworks' branch, suggesting cross-cutting relevance beyond its immediate taxonomic position.
Among 30 candidates examined, the MoReBench benchmark itself (Contribution 1) shows no clear refutation across 10 examined papers, suggesting relative novelty in its specific formulation. However, MoReBench-Theory (Contribution 2) and the process-focused rubric methodology (Contribution 3) each encountered one potentially overlapping prior work among 10 candidates examined. The limited search scope means these statistics reflect top-semantic-match coverage rather than exhaustive field analysis. The benchmark's emphasis on expert rubrics for procedural reasoning appears less explored than general scenario-based testing, though the theory-grounded component has closer precedents.
Based on the limited literature search of 30 candidates, the work appears to occupy a meaningful but not entirely uncharted position. The procedural focus and rubric-based evaluation represent a specific methodological angle within an active research area. The analysis cannot confirm whether larger-scale searches or domain-specific venues might reveal additional overlapping efforts, particularly for the theory-grounded evaluation component where one refutable candidate was identified.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce MoReBench, a benchmark containing 1,000 moral dilemma scenarios with over 23,000 expert-written rubric criteria. Unlike outcome-focused evaluations, this benchmark assesses structural elements of AI reasoning processes including identifying moral considerations, weighing trade-offs, and providing actionable recommendations.
The authors curate MoReBench-Theory, a dataset of 150 scenarios annotated under five major moral frameworks (Kantian Deontology, Benthamite Act Utilitarianism, Aristotelian Virtue Ethics, Scanlonian Contractualism, and Gauthierian Contractarianism) to evaluate whether AI models can reason according to diverse moral standards.
The authors propose a novel evaluation methodology that assesses AI reasoning processes rather than final decisions. This approach uses expert-developed rubric-based scoring to evaluate multiple dimensions of moral reasoning including identifying considerations, logical process, and outcome quality, enabling systematic evaluation at scale.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[3] Procedural Dilemma Generation for Evaluating Moral Reasoning in Humans and Language Models PDF
[13] Procedural Dilemma Generation for Moral Reasoning in Humans and Language Models PDF
[37] Off the rails: Procedural dilemma generation for moral reasoning PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
MoReBench benchmark for evaluating procedural moral reasoning
The authors introduce MoReBench, a benchmark containing 1,000 moral dilemma scenarios with over 23,000 expert-written rubric criteria. Unlike outcome-focused evaluations, this benchmark assesses structural elements of AI reasoning processes including identifying moral considerations, weighing trade-offs, and providing actionable recommendations.
[19] A Methodological Framework for Auditing Norm-Sensitive Behaviour in Large Language Models: Research Design for Employment Contexts PDF
[61] SPRI: Aligning Large Language Models with Context-Situated Principles PDF
[62] A CSCL script for supporting moral reasoning in the ethics classroom PDF
[63] Developing a behaviour rubric for the practical model of ethical behaviour for clinical nursing PDF
[64] Measuring Student Success Skills: A Review of the Literature on Ethical Thinking. PDF
[65] The role of moral reasoning in argumentation: Conscience, character, and care PDF
[66] Impact and persistence of ethical reasoning education on student learning: results from a module-based ethical reasoning educational program PDF
[67] An ethics transfer case assessment tool for measuring ethical reasoning abilities of engineering students using reflexive principlism approach PDF
[68] Integrating instruction in ethical reasoning into undergraduate business courses PDF
[69] Developing and validating a tool to assess ethical decision-making ability of nursing students, using rubrics PDF
MoReBench-Theory dataset for theory-grounded moral reasoning
The authors curate MoReBench-Theory, a dataset of 150 scenarios annotated under five major moral frameworks (Kantian Deontology, Benthamite Act Utilitarianism, Aristotelian Virtue Ethics, Scanlonian Contractualism, and Gauthierian Contractarianism) to evaluate whether AI models can reason according to diverse moral standards.
[15] Rethinking Machine EthicsâCan LLMs Perform Moral Reasoning through the Lens of Moral Theories? PDF
[12] Structured Moral Reasoning in Language Models: A Value-Grounded Evaluation Framework PDF
[40] Are Language Models Consequentialist or Deontological Moral Reasoners? PDF
[70] Evaluation of Ethical Decision Making in Large Language Models Across Classical Moral Frameworks PDF
[71] MORALISE: A Structured Benchmark for Moral Alignment in Visual Language Models PDF
[72] " Pull or Not to Pull?'': Investigating Moral Biases in Leading Large Language Models Across Ethical Dilemmas PDF
[73] Assessing moral decision making in large language models PDF
[74] Aligning ai with shared human values PDF
[75] Ethical Reasoning over Moral Alignment: A Case and Framework for In-Context Ethical Policies in LLMs PDF
[76] The Pluralistic Moral Gap: Understanding Judgment and Value Differences between Humans and Large Language Models PDF
Process-focused evaluation methodology using expert rubrics
The authors propose a novel evaluation methodology that assesses AI reasoning processes rather than final decisions. This approach uses expert-developed rubric-based scoring to evaluate multiple dimensions of moral reasoning including identifying considerations, logical process, and outcome quality, enabling systematic evaluation at scale.