Choices Speak Louder than Questions

ICLR 2026 Conference SubmissionAnonymous Authors
large language modelevaluation methodologiesmultiple choice question
Abstract:

Recent findings raise concerns about whether the evaluation of Multiple-Choice Question Answering (MCQA) accurately reflects the comprehension abilities of large language models. This paper explores the concept of \textit{choice sensitivity}, which refers to the tendency for model decisions to be more influenced by the answer options than by a genuine understanding of the question. We introduce a new scoring method called Normalized Probability Shift by the Question (NPSQ), designed to isolate the impact of the question itself and provide a more reliable assessment of comprehension. Through experiments involving various input formats, including cloze, symbols, and hybrid formats, we find that traditional scoring methods — such as those based on log-likelihood or its length-normalized variant — are vulnerable to superficial characteristics of the answer choices. In contrast, NPSQ remains stable even when modifications are made to the answer options.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces the concept of choice sensitivity and proposes NPSQ, a scoring method designed to isolate question comprehension from superficial answer-choice cues. It resides in the 'Choices-Only Shortcuts and Artifacts' leaf, which contains three papers total, including this one. This leaf sits within the broader 'Robustness and Bias in MCQ Evaluation' branch, indicating a moderately active research direction focused on exposing systematic flaws in MCQ-based LLM assessment. The taxonomy shows this is a recognized problem area but not yet saturated with solutions.

The taxonomy reveals closely related work in sibling leaves: 'Option Order and Position Sensitivity' examines ordering biases, while 'Consistency and Reliability Issues' addresses response variability across formulations. The parent branch excludes mitigation techniques, which are housed separately under 'Mitigation Strategies and Methodological Improvements.' This structural separation suggests the paper's diagnostic focus—quantifying choice sensitivity—aligns with identifying artifacts rather than proposing comprehensive fixes. Neighboring branches like 'MCQ Format Limitations and Alternatives' question the paradigm itself, providing broader context for why choice-level shortcuts matter.

Among twenty-two candidates examined, none clearly refute the three core contributions. The concept of choice sensitivity was assessed against five candidates with zero refutations; NPSQ scoring against seven candidates, also zero; and the robustness evaluation framework against ten candidates, again zero. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—the specific formulation of choice sensitivity and the NPSQ metric appear distinct from prior diagnostic methods. However, the sibling papers in the same taxonomy leaf likely address overlapping phenomena, though their exact methodological approaches may differ.

The analysis covers a focused subset of the literature, not an exhaustive survey. The taxonomy structure indicates this work contributes to an active but not overcrowded diagnostic niche. While no examined candidates directly overlap with NPSQ's technical approach, the broader field already recognizes choice-level artifacts as a critical evaluation flaw. The novelty appears to lie in the specific quantification method and its claimed robustness properties, though the limited search scope means adjacent work in the same leaf warrants closer comparison to fully assess incremental versus substantive contributions.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
22
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Evaluating multiple-choice question answering in large language models. The field has organized itself around six main branches that reflect both methodological concerns and application domains. Robustness and Bias in MCQ Evaluation examines how models exploit superficial cues—such as option order sensitivity (Option Order Sensitivity[8]) or choices-only shortcuts—rather than genuine reasoning. MCQ Format Limitations and Alternatives questions whether the multiple-choice paradigm itself is adequate, exploring issues like the number of options (Too Many Options[43]) and whether models truly understand questions or merely pattern-match. Mitigation Strategies and Methodological Improvements proposes techniques to reduce biases, including order randomization (Mitigating Order Sensitivity[1]) and consistency checks (Consistency in MCQ[14]). Domain-Specific MCQ Benchmarks and Applications focuses on specialized areas like medicine (KorMedMCQA[38], Challenging Medical Questions[31]) and safety (SafetyBench[5]), while General MCQ Benchmarking and Efficiency addresses scalable evaluation frameworks (tinyBenchmarks[36]). Finally, Multimodal and Retrieval-Augmented MCQ Answering extends the paradigm to vision-language tasks (MVBench[18]) and knowledge-enhanced retrieval (G-retriever[16]). A particularly active tension runs between works that expose artifacts versus those that propose fixes. Several studies reveal that models can achieve deceptively high scores by exploiting dataset shortcuts—Knowledgeable or Cheater[20] and Artifacts or Abduction[28] demonstrate how answer patterns leak information without requiring deep comprehension. Choices Speak Louder[0] sits squarely within this critical line, showing that models often rely on choice-level cues rather than integrating question context. This contrasts with mitigation-focused efforts like Mitigating Order Sensitivity[1], which seeks to stabilize evaluation through procedural controls, and Ubench Uncertainty[3], which quantifies model confidence to distinguish genuine knowledge from guessing. Together, these works highlight an open question: whether refining MCQ protocols can ever fully disentangle reasoning ability from artifact exploitation, or whether alternative evaluation paradigms are ultimately necessary.

Claimed Contributions

Concept and quantification of choice sensitivity

The authors formally define choice sensitivity as the tendency for model decisions to be more influenced by answer options than by genuine question understanding. They provide a quantitative measure showing that 20-60% of model selections are determined by intrinsic differences among answer choices, independent of question context.

5 retrieved papers
Normalized Probability Shift by the Question (NPSQ) scoring method

The authors propose NPSQ, a novel evaluation metric that normalizes the probability shift caused by the question's presence. This method isolates the question's impact from confounding effects of answer choices, providing a more robust assessment of true question understanding compared to traditional log-likelihood-based metrics.

7 retrieved papers
Evaluation framework demonstrating NPSQ robustness to adversarial choices

The authors develop an evaluation framework using adversarial choices (intentionally irrelevant options) to test metric robustness. They demonstrate that NPSQ remains stable when answer options are manipulated, while traditional metrics show significant vulnerability to superficial choice characteristics.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Concept and quantification of choice sensitivity

The authors formally define choice sensitivity as the tendency for model decisions to be more influenced by answer options than by genuine question understanding. They provide a quantitative measure showing that 20-60% of model selections are determined by intrinsic differences among answer choices, independent of question context.

Contribution

Normalized Probability Shift by the Question (NPSQ) scoring method

The authors propose NPSQ, a novel evaluation metric that normalizes the probability shift caused by the question's presence. This method isolates the question's impact from confounding effects of answer choices, providing a more robust assessment of true question understanding compared to traditional log-likelihood-based metrics.

Contribution

Evaluation framework demonstrating NPSQ robustness to adversarial choices

The authors develop an evaluation framework using adversarial choices (intentionally irrelevant options) to test metric robustness. They demonstrate that NPSQ remains stable when answer options are manipulated, while traditional metrics show significant vulnerability to superficial choice characteristics.

Choices Speak Louder than Questions | Novelty Validation