Choices Speak Louder than Questions
Overview
Overall Novelty Assessment
The paper introduces the concept of choice sensitivity and proposes NPSQ, a scoring method designed to isolate question comprehension from superficial answer-choice cues. It resides in the 'Choices-Only Shortcuts and Artifacts' leaf, which contains three papers total, including this one. This leaf sits within the broader 'Robustness and Bias in MCQ Evaluation' branch, indicating a moderately active research direction focused on exposing systematic flaws in MCQ-based LLM assessment. The taxonomy shows this is a recognized problem area but not yet saturated with solutions.
The taxonomy reveals closely related work in sibling leaves: 'Option Order and Position Sensitivity' examines ordering biases, while 'Consistency and Reliability Issues' addresses response variability across formulations. The parent branch excludes mitigation techniques, which are housed separately under 'Mitigation Strategies and Methodological Improvements.' This structural separation suggests the paper's diagnostic focus—quantifying choice sensitivity—aligns with identifying artifacts rather than proposing comprehensive fixes. Neighboring branches like 'MCQ Format Limitations and Alternatives' question the paradigm itself, providing broader context for why choice-level shortcuts matter.
Among twenty-two candidates examined, none clearly refute the three core contributions. The concept of choice sensitivity was assessed against five candidates with zero refutations; NPSQ scoring against seven candidates, also zero; and the robustness evaluation framework against ten candidates, again zero. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—the specific formulation of choice sensitivity and the NPSQ metric appear distinct from prior diagnostic methods. However, the sibling papers in the same taxonomy leaf likely address overlapping phenomena, though their exact methodological approaches may differ.
The analysis covers a focused subset of the literature, not an exhaustive survey. The taxonomy structure indicates this work contributes to an active but not overcrowded diagnostic niche. While no examined candidates directly overlap with NPSQ's technical approach, the broader field already recognizes choice-level artifacts as a critical evaluation flaw. The novelty appears to lie in the specific quantification method and its claimed robustness properties, though the limited search scope means adjacent work in the same leaf warrants closer comparison to fully assess incremental versus substantive contributions.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors formally define choice sensitivity as the tendency for model decisions to be more influenced by answer options than by genuine question understanding. They provide a quantitative measure showing that 20-60% of model selections are determined by intrinsic differences among answer choices, independent of question context.
The authors propose NPSQ, a novel evaluation metric that normalizes the probability shift caused by the question's presence. This method isolates the question's impact from confounding effects of answer choices, providing a more robust assessment of true question understanding compared to traditional log-likelihood-based metrics.
The authors develop an evaluation framework using adversarial choices (intentionally irrelevant options) to test metric robustness. They demonstrate that NPSQ remains stable when answer options are manipulated, while traditional metrics show significant vulnerability to superficial choice characteristics.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
Concept and quantification of choice sensitivity
The authors formally define choice sensitivity as the tendency for model decisions to be more influenced by answer options than by genuine question understanding. They provide a quantitative measure showing that 20-60% of model selections are determined by intrinsic differences among answer choices, independent of question context.
[51] UnibucLLM: Harnessing LLMs for automated prediction of item difficulty and response time for multiple-choice questions PDF
[52] Autodrive-qa-automated generation of multiple-choice questions for autonomous driving datasets using large vision-language models PDF
[53] Early prediction of student knowledge in gameâbased learning with distributed representations of assessment questions PDF
[54] Social question answering: Textual, user, and network features for best answer prediction PDF
[55] Group Discovery with Multiple-Choice Exams and Consumer Surveys: The Group-Question-Answer Model PDF
Normalized Probability Shift by the Question (NPSQ) scoring method
The authors propose NPSQ, a novel evaluation metric that normalizes the probability shift caused by the question's presence. This method isolates the question's impact from confounding effects of answer choices, providing a more robust assessment of true question understanding compared to traditional log-likelihood-based metrics.
[66] BBQ: A hand-built bias benchmark for question answering PDF
[67] ChatGPT Answers Common Patient Questions About Colonoscopy. PDF
[68] User response data: The potential for errors and biases PDF
[69] Improving Score Reliability of Multiple Choice Benchmarks with Consistency Evaluation and Altered Answer Choices PDF
[70] Can response order bias evaluations? PDF
[71] A Wrong Answer or a Wrong Question? An Intricate Relationship between Question Reformulation and Answer Selection in Conversational Question Answering PDF
[72] Looking Under the Hood : Tools for Diagnosing your Question Answering Engine PDF
Evaluation framework demonstrating NPSQ robustness to adversarial choices
The authors develop an evaluation framework using adversarial choices (intentionally irrelevant options) to test metric robustness. They demonstrate that NPSQ remains stable when answer options are manipulated, while traditional metrics show significant vulnerability to superficial choice characteristics.