Abstract:

As AI systems progresses, we rely more on them to make decisions with us and for us. To ensure that such decisions are aligned with human values, it is imperative for us to understand not only what decisions they make but also how they come to those decisions. Reasoning language models, which provide both final responses and (partially transparent) intermediate thinking traces, present a timely opportunity to study AI procedural reasoning. Unlike math and code problems which often have objectively correct answers, moral dilemmas are an excellent testbed for process-focused evaluation because they allow for multiple defensible conclusions. To do so, we present MoReBench: 1,000 moral scenarios, each paired with a set of rubric criteria that experts consider essential to include (or avoid) when reasoning about the scenarios. MoReBench contains over 23 thousand criteria including identifying moral considerations, weighing trade-offs, and giving actionable recommendations to cover cases on AI advising humans moral decisions as well as making moral decisions autonomously. Separately, we curate MoReBench-Theory: 150 examples to test whether AI can reason under five major frameworks in normative ethics. Our results show that scaling laws and existing benchmarks on math, code, and scientific reasoning tasks (fail to) predict models' abilities to perform moral reasoning. Models also show partiality towards specific moral frameworks (e.g., Benthamite Act Utilitarianism and Kantian Deontology), which might be side effects of popular training paradigms. Together, these benchmarks advance process-focused reasoning evaluation towards safer and more transparent AI.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MoReBench, a benchmark of 1,000 moral scenarios paired with expert-defined rubric criteria for evaluating procedural moral reasoning in language models. It resides in the 'Procedural Dilemma Generation and Scenario-Based Testing' leaf, which contains four papers total, indicating a moderately populated research direction within the broader Benchmark Development and Evaluation Frameworks branch. This leaf focuses specifically on systematic scenario generation to test multi-step reasoning processes, distinguishing it from abstract theoretical assessments or everyday contextual dilemmas found in sibling leaves.

The taxonomy reveals several neighboring research directions that contextualize this work. Adjacent leaves include 'Abstract and Theoretical Moral Assessment' (testing normative ethical theories) and 'Everyday and Contextual Moral Dilemmas' (real-world nuanced situations), both emphasizing different evaluation philosophies. The broader Benchmark Development branch encompasses domain-specific evaluation and causal judgment tasks, while parallel branches address moral alignment analysis and robustness testing. MoReBench's process-focused methodology bridges procedural scenario generation with the auditing frameworks found in the 'Auditing and Methodological Frameworks' branch, suggesting cross-cutting relevance beyond its immediate taxonomic position.

Among 30 candidates examined, the MoReBench benchmark itself (Contribution 1) shows no clear refutation across 10 examined papers, suggesting relative novelty in its specific formulation. However, MoReBench-Theory (Contribution 2) and the process-focused rubric methodology (Contribution 3) each encountered one potentially overlapping prior work among 10 candidates examined. The limited search scope means these statistics reflect top-semantic-match coverage rather than exhaustive field analysis. The benchmark's emphasis on expert rubrics for procedural reasoning appears less explored than general scenario-based testing, though the theory-grounded component has closer precedents.

Based on the limited literature search of 30 candidates, the work appears to occupy a meaningful but not entirely uncharted position. The procedural focus and rubric-based evaluation represent a specific methodological angle within an active research area. The analysis cannot confirm whether larger-scale searches or domain-specific venues might reveal additional overlapping efforts, particularly for the theory-grounded evaluation component where one refutable candidate was identified.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: evaluating procedural moral reasoning in language models. The field has organized itself around several complementary branches. Benchmark Development and Evaluation Frameworks focuses on creating standardized tests and scenario-based assessments, including procedural dilemma generation and structured evaluation protocols. Moral Alignment and Value Analysis examines how models encode and express ethical principles, often drawing on moral foundations theory or comparing model outputs to human moral intuitions. Robustness and Sensitivity Analysis investigates how stable moral judgments remain under perturbations or across different formulations, while Modeling Approaches and Architectures explores technical methods—ranging from neuro-symbolic integration to multi-model dialectical systems—for improving ethical reasoning. Prediction and Modeling of Human Moral Judgment seeks to capture the nuances of crowd or expert moral assessments, and Auditing and Methodological Frameworks provides systematic tools for transparency and norm-sensitivity checks. Specialized Applications and Contexts address domain-specific challenges such as legal reasoning or cross-cultural moral norms, and General Reasoning and Ethical Considerations covers broader philosophical or meta-ethical questions. A particularly active line of work centers on procedural dilemma generation and scenario-based testing, where researchers design complex moral situations to probe whether models can follow multi-step reasoning or apply ethical principles consistently. MoReBench[0] sits squarely in this cluster, emphasizing structured procedural scenarios that test step-by-step moral deliberation. Nearby efforts like Procedural Dilemma Generation[3] and Procedural Dilemma Moral[13] similarly focus on constructing rich, multi-stage dilemmas, though they may differ in the granularity of reasoning steps or the diversity of ethical frameworks invoked. In contrast, works such as Off the Rails[37] explore edge cases and adversarial scenarios, highlighting robustness concerns that complement the procedural focus. Across these branches, a central tension emerges between achieving high coverage of moral theories and maintaining practical evaluation efficiency, with ongoing questions about how to balance depth of reasoning assessment against the need for scalable, reproducible benchmarks.

Claimed Contributions

MoReBench benchmark for evaluating procedural moral reasoning

The authors introduce MoReBench, a benchmark containing 1,000 moral dilemma scenarios with over 23,000 expert-written rubric criteria. Unlike outcome-focused evaluations, this benchmark assesses structural elements of AI reasoning processes including identifying moral considerations, weighing trade-offs, and providing actionable recommendations.

10 retrieved papers
MoReBench-Theory dataset for theory-grounded moral reasoning

The authors curate MoReBench-Theory, a dataset of 150 scenarios annotated under five major moral frameworks (Kantian Deontology, Benthamite Act Utilitarianism, Aristotelian Virtue Ethics, Scanlonian Contractualism, and Gauthierian Contractarianism) to evaluate whether AI models can reason according to diverse moral standards.

10 retrieved papers
Can Refute
Process-focused evaluation methodology using expert rubrics

The authors propose a novel evaluation methodology that assesses AI reasoning processes rather than final decisions. This approach uses expert-developed rubric-based scoring to evaluate multiple dimensions of moral reasoning including identifying considerations, logical process, and outcome quality, enabling systematic evaluation at scale.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MoReBench benchmark for evaluating procedural moral reasoning

The authors introduce MoReBench, a benchmark containing 1,000 moral dilemma scenarios with over 23,000 expert-written rubric criteria. Unlike outcome-focused evaluations, this benchmark assesses structural elements of AI reasoning processes including identifying moral considerations, weighing trade-offs, and providing actionable recommendations.

Contribution

MoReBench-Theory dataset for theory-grounded moral reasoning

The authors curate MoReBench-Theory, a dataset of 150 scenarios annotated under five major moral frameworks (Kantian Deontology, Benthamite Act Utilitarianism, Aristotelian Virtue Ethics, Scanlonian Contractualism, and Gauthierian Contractarianism) to evaluate whether AI models can reason according to diverse moral standards.

Contribution

Process-focused evaluation methodology using expert rubrics

The authors propose a novel evaluation methodology that assesses AI reasoning processes rather than final decisions. This approach uses expert-developed rubric-based scoring to evaluate multiple dimensions of moral reasoning including identifying considerations, logical process, and outcome quality, enabling systematic evaluation at scale.