MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

moral reasoningreasoning evaluationai safety

As AI systems progresses, we rely more on them to make decisions with us and for us. To ensure that such decisions are aligned with human values, it is imperative for us to understand not only what decisions they make but also how they come to those decisions. Reasoning language models, which provide both final responses and (partially transparent) intermediate thinking traces, present a timely opportunity to study AI procedural reasoning. Unlike math and code problems which often have objectively correct answers, moral dilemmas are an excellent testbed for process-focused evaluation because they allow for multiple defensible conclusions. To do so, we present MoReBench: 1,000 moral scenarios, each paired with a set of rubric criteria that experts consider essential to include (or avoid) when reasoning about the scenarios. MoReBench contains over 23 thousand criteria including identifying moral considerations, weighing trade-offs, and giving actionable recommendations to cover cases on AI advising humans moral decisions as well as making moral decisions autonomously. Separately, we curate MoReBench-Theory: 150 examples to test whether AI can reason under five major frameworks in normative ethics. Our results show that scaling laws and existing benchmarks on math, code, and scientific reasoning tasks (fail to) predict models' abilities to perform moral reasoning. Models also show partiality towards specific moral frameworks (e.g., Benthamite Act Utilitarianism and Kantian Deontology), which might be side effects of popular training paradigms. Together, these benchmarks advance process-focused reasoning evaluation towards safer and more transparent AI.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MoReBench, a benchmark of 1,000 moral scenarios paired with expert-defined rubric criteria for evaluating procedural moral reasoning in language models. It resides in the 'Procedural Dilemma Generation and Scenario-Based Testing' leaf, which contains four papers total, indicating a moderately populated research direction within the broader Benchmark Development and Evaluation Frameworks branch. This leaf focuses specifically on systematic scenario generation to test multi-step reasoning processes, distinguishing it from abstract theoretical assessments or everyday contextual dilemmas found in sibling leaves.

The taxonomy reveals several neighboring research directions that contextualize this work. Adjacent leaves include 'Abstract and Theoretical Moral Assessment' (testing normative ethical theories) and 'Everyday and Contextual Moral Dilemmas' (real-world nuanced situations), both emphasizing different evaluation philosophies. The broader Benchmark Development branch encompasses domain-specific evaluation and causal judgment tasks, while parallel branches address moral alignment analysis and robustness testing. MoReBench's process-focused methodology bridges procedural scenario generation with the auditing frameworks found in the 'Auditing and Methodological Frameworks' branch, suggesting cross-cutting relevance beyond its immediate taxonomic position.

Among 30 candidates examined, the MoReBench benchmark itself (Contribution 1) shows no clear refutation across 10 examined papers, suggesting relative novelty in its specific formulation. However, MoReBench-Theory (Contribution 2) and the process-focused rubric methodology (Contribution 3) each encountered one potentially overlapping prior work among 10 candidates examined. The limited search scope means these statistics reflect top-semantic-match coverage rather than exhaustive field analysis. The benchmark's emphasis on expert rubrics for procedural reasoning appears less explored than general scenario-based testing, though the theory-grounded component has closer precedents.

Based on the limited literature search of 30 candidates, the work appears to occupy a meaningful but not entirely uncharted position. The procedural focus and rubric-based evaluation represent a specific methodological angle within an active research area. The analysis cannot confirm whether larger-scale searches or domain-specific venues might reveal additional overlapping efforts, particularly for the theory-grounded evaluation component where one refutable candidate was identified.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: evaluating procedural moral reasoning in language models. The field has organized itself around several complementary branches. Benchmark Development and Evaluation Frameworks focuses on creating standardized tests and scenario-based assessments, including procedural dilemma generation and structured evaluation protocols. Moral Alignment and Value Analysis examines how models encode and express ethical principles, often drawing on moral foundations theory or comparing model outputs to human moral intuitions. Robustness and Sensitivity Analysis investigates how stable moral judgments remain under perturbations or across different formulations, while Modeling Approaches and Architectures explores technical methods—ranging from neuro-symbolic integration to multi-model dialectical systems—for improving ethical reasoning. Prediction and Modeling of Human Moral Judgment seeks to capture the nuances of crowd or expert moral assessments, and Auditing and Methodological Frameworks provides systematic tools for transparency and norm-sensitivity checks. Specialized Applications and Contexts address domain-specific challenges such as legal reasoning or cross-cultural moral norms, and General Reasoning and Ethical Considerations covers broader philosophical or meta-ethical questions. A particularly active line of work centers on procedural dilemma generation and scenario-based testing, where researchers design complex moral situations to probe whether models can follow multi-step reasoning or apply ethical principles consistently. MoReBench[0] sits squarely in this cluster, emphasizing structured procedural scenarios that test step-by-step moral deliberation. Nearby efforts like Procedural Dilemma Generation[3] and Procedural Dilemma Moral[13] similarly focus on constructing rich, multi-stage dilemmas, though they may differ in the granularity of reasoning steps or the diversity of ethical frameworks invoked. In contrast, works such as Off the Rails[37] explore edge cases and adversarial scenarios, highlighting robustness concerns that complement the procedural focus. Across these branches, a central tension emerges between achieving high coverage of moral theories and maintaining practical evaluation efficiency, with ongoing questions about how to balance depth of reasoning assessment against the need for scalable, reproducible benchmarks.

Claimed Contributions

MoReBench benchmark for evaluating procedural moral reasoning

10 retrieved papers

The authors introduce MoReBench, a benchmark containing 1,000 moral dilemma scenarios with over 23,000 expert-written rubric criteria. Unlike outcome-focused evaluations, this benchmark assesses structural elements of AI reasoning processes including identifying moral considerations, weighing trade-offs, and providing actionable recommendations.

10 retrieved papers

MoReBench-Theory dataset for theory-grounded moral reasoning

Can Refute

10 retrieved papers

The authors curate MoReBench-Theory, a dataset of 150 scenarios annotated under five major moral frameworks (Kantian Deontology, Benthamite Act Utilitarianism, Aristotelian Virtue Ethics, Scanlonian Contractualism, and Gauthierian Contractarianism) to evaluate whether AI models can reason according to diverse moral standards.

10 retrieved papers

Can Refute

Process-focused evaluation methodology using expert rubrics

Can Refute

10 retrieved papers

The authors propose a novel evaluation methodology that assesses AI reasoning processes rather than final decisions. This approach uses expert-developed rubric-based scoring to evaluate multiple dimensions of moral reasoning including identifying considerations, logical process, and outcome quality, enabling systematic evaluation at scale.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[3] Procedural Dilemma Generation for Evaluating Moral Reasoning in Humans and Language Models PDF

FrÃ¤nken, Jan-Philipp, Gandhi, Kanishk, Jan-Philipp FrÃ¤nken, Kanishk Gandhi, Tori Qiu, Goodman, Noah D., Ayesha Khawaja, Gerstenberg, Tobias, Noah D. Goodman, Tobias Gerstenberg (2024)

[13] Procedural Dilemma Generation for Moral Reasoning in Humans and Language Models PDF

FrÃ¤nken, Jan-Philipp, Gandhi, Kanishk, Goodman, Noah, Gerstenberg, Tobias (2024)

[37] Off the rails: Procedural dilemma generation for moral reasoning PDF

Kanishk Gandhi, Jan-Philipp FrÃ¤nken, Gerstenberg, Tobias (2023)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MoReBench benchmark for evaluating procedural moral reasoning

[19] A Methodological Framework for Auditing Norm-Sensitive Behaviour in Large Language Models: Research Design for Employment Contexts PDF

Cannot Refute

[61] SPRI: Aligning Large Language Models with Context-Situated Principles PDF

Cannot Refute

[62] A CSCL script for supporting moral reasoning in the ethics classroom PDF

Cannot Refute

[63] Developing a behaviour rubric for the practical model of ethical behaviour for clinical nursing PDF

Cannot Refute

[64] Measuring Student Success Skills: A Review of the Literature on Ethical Thinking. PDF

Cannot Refute

[65] The role of moral reasoning in argumentation: Conscience, character, and care PDF

Cannot Refute

[66] Impact and persistence of ethical reasoning education on student learning: results from a module-based ethical reasoning educational program PDF

Cannot Refute

[67] An ethics transfer case assessment tool for measuring ethical reasoning abilities of engineering students using reflexive principlism approach PDF

Cannot Refute

[68] Integrating instruction in ethical reasoning into undergraduate business courses PDF

Cannot Refute

[69] Developing and validating a tool to assess ethical decision-making ability of nursing students, using rubrics PDF

Cannot Refute

Contribution

MoReBench-Theory dataset for theory-grounded moral reasoning

[15] Rethinking Machine EthicsâCan LLMs Perform Moral Reasoning through the Lens of Moral Theories? PDF

Can Refute

[12] Structured Moral Reasoning in Language Models: A Value-Grounded Evaluation Framework PDF

Cannot Refute

[40] Are Language Models Consequentialist or Deontological Moral Reasoners? PDF

Cannot Refute

[70] Evaluation of Ethical Decision Making in Large Language Models Across Classical Moral Frameworks PDF

Cannot Refute

[71] MORALISE: A Structured Benchmark for Moral Alignment in Visual Language Models PDF

Cannot Refute

[72] " Pull or Not to Pull?'': Investigating Moral Biases in Leading Large Language Models Across Ethical Dilemmas PDF

Cannot Refute

[73] Assessing moral decision making in large language models PDF

Cannot Refute

[74] Aligning ai with shared human values PDF

Cannot Refute

[75] Ethical Reasoning over Moral Alignment: A Case and Framework for In-Context Ethical Policies in LLMs PDF

Cannot Refute

[76] The Pluralistic Moral Gap: Understanding Judgment and Value Differences between Humans and Large Language Models PDF

Cannot Refute

Contribution

Process-focused evaluation methodology using expert rubrics

[56] PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning PDF

Can Refute

[51] ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning PDF

Cannot Refute

[52] Clinical reasoning assessment methods in prelicensure undergraduate nursing education: A scoping review PDF

Cannot Refute

[53] Finsearchcomp: Towards a realistic, expert-level evaluation of financial search and reasoning PDF

Cannot Refute

[54] STRICTA: Structured reasoning in critical text assessment for peer review and beyond PDF

Cannot Refute

[55] Beyond seeing: Evaluating multimodal llms on tool-enabled image perception, transformation, and reasoning PDF

Cannot Refute

[57] A Process-Oriented Approach to Assessing High School Students' Mathematical Problem-Solving Competence: Insights from Multidimensional Eye-Tracking â¦ PDF

Cannot Refute

[58] Pencils Down! Automatic Rubric-based Evaluation of Retrieve/Generate Systems PDF

Cannot Refute

[59] ChatGPT as a stable and fair tool for automated essay scoring PDF

Cannot Refute

[60] Developing and evaluating a set of process and product-oriented classroom assessment rubrics for assessing digital multimodal collaborative writing in L2 classes PDF

Cannot Refute

MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[3] Procedural Dilemma Generation for Evaluating Moral Reasoning in Humans and Language Models PDF

[13] Procedural Dilemma Generation for Moral Reasoning in Humans and Language Models PDF

[37] Off the rails: Procedural dilemma generation for moral reasoning PDF

Contribution Analysis

MoReBench benchmark for evaluating procedural moral reasoning

[19] A Methodological Framework for Auditing Norm-Sensitive Behaviour in Large Language Models: Research Design for Employment Contexts PDF

[61] SPRI: Aligning Large Language Models with Context-Situated Principles PDF

[62] A CSCL script for supporting moral reasoning in the ethics classroom PDF

[63] Developing a behaviour rubric for the practical model of ethical behaviour for clinical nursing PDF

[64] Measuring Student Success Skills: A Review of the Literature on Ethical Thinking. PDF

[65] The role of moral reasoning in argumentation: Conscience, character, and care PDF

[66] Impact and persistence of ethical reasoning education on student learning: results from a module-based ethical reasoning educational program PDF

[67] An ethics transfer case assessment tool for measuring ethical reasoning abilities of engineering students using reflexive principlism approach PDF

[68] Integrating instruction in ethical reasoning into undergraduate business courses PDF

[69] Developing and validating a tool to assess ethical decision-making ability of nursing students, using rubrics PDF

MoReBench-Theory dataset for theory-grounded moral reasoning

[15] Rethinking Machine EthicsâCan LLMs Perform Moral Reasoning through the Lens of Moral Theories? PDF

[12] Structured Moral Reasoning in Language Models: A Value-Grounded Evaluation Framework PDF

[40] Are Language Models Consequentialist or Deontological Moral Reasoners? PDF

[70] Evaluation of Ethical Decision Making in Large Language Models Across Classical Moral Frameworks PDF

[71] MORALISE: A Structured Benchmark for Moral Alignment in Visual Language Models PDF

[72] " Pull or Not to Pull?'': Investigating Moral Biases in Leading Large Language Models Across Ethical Dilemmas PDF

[73] Assessing moral decision making in large language models PDF

[74] Aligning ai with shared human values PDF

[75] Ethical Reasoning over Moral Alignment: A Case and Framework for In-Context Ethical Policies in LLMs PDF

[76] The Pluralistic Moral Gap: Understanding Judgment and Value Differences between Humans and Large Language Models PDF

Process-focused evaluation methodology using expert rubrics

[56] PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning PDF

[51] ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning PDF

[52] Clinical reasoning assessment methods in prelicensure undergraduate nursing education: A scoping review PDF

[53] Finsearchcomp: Towards a realistic, expert-level evaluation of financial search and reasoning PDF

[54] STRICTA: Structured reasoning in critical text assessment for peer review and beyond PDF

[55] Beyond seeing: Evaluating multimodal llms on tool-enabled image perception, transformation, and reasoning PDF

[57] A Process-Oriented Approach to Assessing High School Students' Mathematical Problem-Solving Competence: Insights from Multidimensional Eye-Tracking â¦ PDF

[58] Pencils Down! Automatic Rubric-based Evaluation of Retrieve/Generate Systems PDF

[59] ChatGPT as a stable and fair tool for automated essay scoring PDF

[60] Developing and evaluating a set of process and product-oriented classroom assessment rubrics for assessing digital multimodal collaborative writing in L2 classes PDF

Table of Contents

[15] Rethinking Machine EthicsâCan LLMs Perform Moral Reasoning through the Lens of Moral Theories? PDF

[57] A Process-Oriented Approach to Assessing High School Students' Mathematical Problem-Solving Competence: Insights from Multidimensional Eye-Tracking â¦ PDF