Choices Speak Louder than Questions

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

large language modelevaluation methodologiesmultiple choice question

Recent findings raise concerns about whether the evaluation of Multiple-Choice Question Answering (MCQA) accurately reflects the comprehension abilities of large language models. This paper explores the concept of \textit{choice sensitivity}, which refers to the tendency for model decisions to be more influenced by the answer options than by a genuine understanding of the question. We introduce a new scoring method called Normalized Probability Shift by the Question (NPSQ), designed to isolate the impact of the question itself and provide a more reliable assessment of comprehension. Through experiments involving various input formats, including cloze, symbols, and hybrid formats, we find that traditional scoring methods — such as those based on log-likelihood or its length-normalized variant — are vulnerable to superficial characteristics of the answer choices. In contrast, NPSQ remains stable even when modifications are made to the answer options.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces the concept of choice sensitivity and proposes NPSQ, a scoring method designed to isolate question comprehension from superficial answer-choice cues. It resides in the 'Choices-Only Shortcuts and Artifacts' leaf, which contains three papers total, including this one. This leaf sits within the broader 'Robustness and Bias in MCQ Evaluation' branch, indicating a moderately active research direction focused on exposing systematic flaws in MCQ-based LLM assessment. The taxonomy shows this is a recognized problem area but not yet saturated with solutions.

The taxonomy reveals closely related work in sibling leaves: 'Option Order and Position Sensitivity' examines ordering biases, while 'Consistency and Reliability Issues' addresses response variability across formulations. The parent branch excludes mitigation techniques, which are housed separately under 'Mitigation Strategies and Methodological Improvements.' This structural separation suggests the paper's diagnostic focus—quantifying choice sensitivity—aligns with identifying artifacts rather than proposing comprehensive fixes. Neighboring branches like 'MCQ Format Limitations and Alternatives' question the paradigm itself, providing broader context for why choice-level shortcuts matter.

Among twenty-two candidates examined, none clearly refute the three core contributions. The concept of choice sensitivity was assessed against five candidates with zero refutations; NPSQ scoring against seven candidates, also zero; and the robustness evaluation framework against ten candidates, again zero. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—the specific formulation of choice sensitivity and the NPSQ metric appear distinct from prior diagnostic methods. However, the sibling papers in the same taxonomy leaf likely address overlapping phenomena, though their exact methodological approaches may differ.

The analysis covers a focused subset of the literature, not an exhaustive survey. The taxonomy structure indicates this work contributes to an active but not overcrowded diagnostic niche. While no examined candidates directly overlap with NPSQ's technical approach, the broader field already recognizes choice-level artifacts as a critical evaluation flaw. The novelty appears to lie in the specific quantification method and its claimed robustness properties, though the limited search scope means adjacent work in the same leaf warrants closer comparison to fully assess incremental versus substantive contributions.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Evaluating multiple-choice question answering in large language models. The field has organized itself around six main branches that reflect both methodological concerns and application domains. Robustness and Bias in MCQ Evaluation examines how models exploit superficial cues—such as option order sensitivity (Option Order Sensitivity[8]) or choices-only shortcuts—rather than genuine reasoning. MCQ Format Limitations and Alternatives questions whether the multiple-choice paradigm itself is adequate, exploring issues like the number of options (Too Many Options[43]) and whether models truly understand questions or merely pattern-match. Mitigation Strategies and Methodological Improvements proposes techniques to reduce biases, including order randomization (Mitigating Order Sensitivity[1]) and consistency checks (Consistency in MCQ[14]). Domain-Specific MCQ Benchmarks and Applications focuses on specialized areas like medicine (KorMedMCQA[38], Challenging Medical Questions[31]) and safety (SafetyBench[5]), while General MCQ Benchmarking and Efficiency addresses scalable evaluation frameworks (tinyBenchmarks[36]). Finally, Multimodal and Retrieval-Augmented MCQ Answering extends the paradigm to vision-language tasks (MVBench[18]) and knowledge-enhanced retrieval (G-retriever[16]). A particularly active tension runs between works that expose artifacts versus those that propose fixes. Several studies reveal that models can achieve deceptively high scores by exploiting dataset shortcuts—Knowledgeable or Cheater[20] and Artifacts or Abduction[28] demonstrate how answer patterns leak information without requiring deep comprehension. Choices Speak Louder[0] sits squarely within this critical line, showing that models often rely on choice-level cues rather than integrating question context. This contrasts with mitigation-focused efforts like Mitigating Order Sensitivity[1], which seeks to stabilize evaluation through procedural controls, and Ubench Uncertainty[3], which quantifies model confidence to distinguish genuine knowledge from guessing. Together, these works highlight an open question: whether refining MCQ protocols can ever fully disentangle reasoning ability from artifact exploitation, or whether alternative evaluation paradigms are ultimately necessary.

Claimed Contributions

Concept and quantification of choice sensitivity

5 retrieved papers

The authors formally define choice sensitivity as the tendency for model decisions to be more influenced by answer options than by genuine question understanding. They provide a quantitative measure showing that 20-60% of model selections are determined by intrinsic differences among answer choices, independent of question context.

5 retrieved papers

Normalized Probability Shift by the Question (NPSQ) scoring method

7 retrieved papers

The authors propose NPSQ, a novel evaluation metric that normalizes the probability shift caused by the question's presence. This method isolates the question's impact from confounding effects of answer choices, providing a more robust assessment of true question understanding compared to traditional log-likelihood-based metrics.

7 retrieved papers

Evaluation framework demonstrating NPSQ robustness to adversarial choices

10 retrieved papers

The authors develop an evaluation framework using adversarial choices (intentionally irrelevant options) to test metric robustness. They demonstrate that NPSQ remains stable when answer options are manipulated, while traditional metrics show significant vulnerability to superficial choice characteristics.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[20] Is Your Large Language Model Knowledgeable or a Choices-Only Cheater? PDF

Balepur, Nishant (2024)

[28] Artifacts or abduction: How do llms answer multiple-choice questions without the question? PDF

Balepur, Nishant, Ravichander, Abhilasha, Rudinger, Rachel (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Concept and quantification of choice sensitivity

[51] UnibucLLM: Harnessing LLMs for automated prediction of item difficulty and response time for multiple-choice questions PDF

Cannot Refute

[52] Autodrive-qa-automated generation of multiple-choice questions for autonomous driving datasets using large vision-language models PDF

Cannot Refute

[53] Early prediction of student knowledge in gameâbased learning with distributed representations of assessment questions PDF

Cannot Refute

[54] Social question answering: Textual, user, and network features for best answer prediction PDF

Cannot Refute

[55] Group Discovery with Multiple-Choice Exams and Consumer Surveys: The Group-Question-Answer Model PDF

Cannot Refute

Contribution

Normalized Probability Shift by the Question (NPSQ) scoring method

[66] BBQ: A hand-built bias benchmark for question answering PDF

Cannot Refute

[67] ChatGPT Answers Common Patient Questions About Colonoscopy. PDF

Cannot Refute

[68] User response data: The potential for errors and biases PDF

Cannot Refute

[69] Improving Score Reliability of Multiple Choice Benchmarks with Consistency Evaluation and Altered Answer Choices PDF

Cannot Refute

[70] Can response order bias evaluations? PDF

Cannot Refute

[71] A Wrong Answer or a Wrong Question? An Intricate Relationship between Question Reformulation and Answer Selection in Conversational Question Answering PDF

Cannot Refute

[72] Looking Under the Hood : Tools for Diagnosing your Question Answering Engine PDF

Cannot Refute

Contribution

Evaluation framework demonstrating NPSQ robustness to adversarial choices

[56] A comprehensive evaluation framework for deep model robustness PDF

Cannot Refute

[57] Robustness in deep learning models for medical diagnostics: security and adversarial challenges towards robust AI applications PDF

Cannot Refute

[58] On Evaluating Adversarial Robustness of Large Vision-Language Models PDF

Cannot Refute

[59] Adversarial training for high-stakes reliability PDF

Cannot Refute

[60] Metric Learning for Adversarial Robustness PDF

Cannot Refute

[61] RDI: An adversarial robustness evaluation metric for deep neural networks based on sample clustering features PDF

Cannot Refute

[62] Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment PDF

Cannot Refute

[63] Toward Realistic Adversarial Attacks in IDS: A Novel Feasibility Metric for Transferability PDF

Cannot Refute

[64] A comparative analysis of large language models to evaluate robustness and reliability in adversarial conditions PDF

Cannot Refute

[65] Reliable feature selection for adversarially robust cyber-attack detection PDF

Cannot Refute

Choices Speak Louder than Questions

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[20] Is Your Large Language Model Knowledgeable or a Choices-Only Cheater? PDF

[28] Artifacts or abduction: How do llms answer multiple-choice questions without the question? PDF

Contribution Analysis

Concept and quantification of choice sensitivity

[51] UnibucLLM: Harnessing LLMs for automated prediction of item difficulty and response time for multiple-choice questions PDF

[52] Autodrive-qa-automated generation of multiple-choice questions for autonomous driving datasets using large vision-language models PDF

[53] Early prediction of student knowledge in gameâbased learning with distributed representations of assessment questions PDF

[54] Social question answering: Textual, user, and network features for best answer prediction PDF

[55] Group Discovery with Multiple-Choice Exams and Consumer Surveys: The Group-Question-Answer Model PDF

Normalized Probability Shift by the Question (NPSQ) scoring method

[66] BBQ: A hand-built bias benchmark for question answering PDF

[67] ChatGPT Answers Common Patient Questions About Colonoscopy. PDF

[68] User response data: The potential for errors and biases PDF

[69] Improving Score Reliability of Multiple Choice Benchmarks with Consistency Evaluation and Altered Answer Choices PDF

[70] Can response order bias evaluations? PDF

[71] A Wrong Answer or a Wrong Question? An Intricate Relationship between Question Reformulation and Answer Selection in Conversational Question Answering PDF

[72] Looking Under the Hood : Tools for Diagnosing your Question Answering Engine PDF

Evaluation framework demonstrating NPSQ robustness to adversarial choices

[56] A comprehensive evaluation framework for deep model robustness PDF

[57] Robustness in deep learning models for medical diagnostics: security and adversarial challenges towards robust AI applications PDF

[58] On Evaluating Adversarial Robustness of Large Vision-Language Models PDF

[59] Adversarial training for high-stakes reliability PDF

[60] Metric Learning for Adversarial Robustness PDF

[61] RDI: An adversarial robustness evaluation metric for deep neural networks based on sample clustering features PDF

[62] Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment PDF

[63] Toward Realistic Adversarial Attacks in IDS: A Novel Feasibility Metric for Transferability PDF

[64] A comparative analysis of large language models to evaluate robustness and reliability in adversarial conditions PDF

[65] Reliable feature selection for adversarially robust cyber-attack detection PDF

Table of Contents

[53] Early prediction of student knowledge in gameâbased learning with distributed representations of assessment questions PDF