Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning
Overview
Overall Novelty Assessment
The paper introduces SciReas and SciReas-Pro, holistic benchmark suites for scientific reasoning, alongside KRUX, a probing framework that disentangles knowledge retrieval from reasoning steps. It resides in the Holistic Scientific Reasoning Benchmarks leaf, which contains three papers total. This leaf sits within the broader Scientific Reasoning Evaluation and Benchmarking branch, indicating a moderately populated research direction focused on systematic assessment rather than method development. The taxonomy reveals that holistic benchmarking is less crowded than decomposition or retrieval-augmented approaches, suggesting room for comprehensive evaluation frameworks.
The paper's leaf neighbors Domain-Specific Reasoning Evaluation and Fine-Grained Reasoning Analysis, which target narrower scopes—single domains or subtask-level diagnostics. Nearby branches include Knowledge-Reasoning Separation Frameworks and Retrieval-Augmented Generation, which develop methods rather than benchmarks. The taxonomy's scope note clarifies that holistic benchmarks probe multiple tasks and dimensions simultaneously, excluding single-domain evaluations. This positioning suggests the work bridges evaluation and methodological insights, connecting to causal attribution studies in Knowledge Utilization and Reasoning Transparency that also examine knowledge-reasoning interplay.
Among thirty candidates examined, none clearly refuted any of the three contributions. The SciReas benchmark suite examined ten candidates with zero refutations, as did the KRUX framework and the empirical findings on knowledge bottlenecks. This limited search scope—top-K semantic matches plus citation expansion—indicates that within the retrieved literature, no prior work appears to provide overlapping holistic benchmarks or probing frameworks with identical design. The absence of refutations across all contributions suggests either genuine novelty or that closely related work lies outside the search radius.
Based on the thirty-candidate search, the work appears to occupy a relatively sparse niche within scientific reasoning evaluation. The taxonomy structure confirms that holistic benchmarking is less saturated than decomposition or RAG methods, and the contribution-level statistics show no immediate prior overlap. However, the limited search scope means this assessment reflects top-ranked semantic neighbors rather than exhaustive field coverage. The findings on knowledge bottlenecks may intersect with broader studies in Knowledge Utilization, though no specific refutations emerged here.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors curate SCIREAS, a unified evaluation suite combining 10 existing scientific benchmarks across multiple domains and task formats with standardized implementation. They also construct SCIREAS-PRO, a compact subset containing only instances that require complex reasoning, identified by performance differences under varying inference budgets.
The authors propose KRUX (Knowledge & Reasoning Utilization eXams), a framework that extracts atomic knowledge ingredients from model reasoning traces and provides them in-context to systematically separate and analyze the distinct roles of knowledge recall versus reasoning capability in scientific problem-solving.
Through controlled experiments using KRUX, the authors demonstrate that retrieving task-relevant parametric knowledge is a critical bottleneck, that reasoning-enhanced models gain additional improvements from external knowledge, and that chain-of-thought fine-tuning helps models surface relevant knowledge they already possess.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[13] Understanding LLM Scientific Reasoning through Promptings and Model's Explanation on the Answers PDF
[14] Knowledge or Reasoning? A Close Look at How LLMs Think Across Domains PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
SCIREAS and SCIREAS-PRO benchmarks
The authors curate SCIREAS, a unified evaluation suite combining 10 existing scientific benchmarks across multiple domains and task formats with standardized implementation. They also construct SCIREAS-PRO, a compact subset containing only instances that require complex reasoning, identified by performance differences under varying inference budgets.
[42] Mmlu-pro: A more robust and challenging multi-task language understanding benchmark PDF
[43] Genome-Bench: A Scientific Reasoning Benchmark from Real-World Expert Discussions PDF
[44] Scieval: A multi-level large language model evaluation benchmark for scientific research PDF
[45] Towards generalist biomedical AI PDF
[46] MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI PDF
[47] SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines PDF
[48] An MLCommons Scientific Benchmarks Ontology PDF
[49] SCI-Verifier: Scientific Verifier with Thinking PDF
[50] USB: A Unified Summarization Benchmark Across Tasks and Domains PDF
[51] MMCR: Benchmarking Cross-Source Reasoning in Scientific Papers PDF
KRUX probing framework
The authors propose KRUX (Knowledge & Reasoning Utilization eXams), a framework that extracts atomic knowledge ingredients from model reasoning traces and provides them in-context to systematically separate and analyze the distinct roles of knowledge recall versus reasoning capability in scientific problem-solving.
[1] Disentangling Memory and Reasoning Ability in Large Language Models PDF
[2] Disentangling Reasoning and Knowledge in Medical Large Language Models PDF
[52] Dissociating language and thought in large language models PDF
[53] Long-form factuality in large language models PDF
[54] Wise: Rethinking the knowledge memory for lifelong model editing of large language models PDF
[55] Towards understanding factual knowledge of large language models PDF
[56] Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks PDF
[57] Sources of hallucination by large language models on inference tasks PDF
[58] Towards a mechanistic interpretation of multi-step reasoning capabilities of language models PDF
[59] Co-occurrence is not factual association in language models PDF
Empirical findings on knowledge-reasoning interplay
Through controlled experiments using KRUX, the authors demonstrate that retrieving task-relevant parametric knowledge is a critical bottleneck, that reasoning-enhanced models gain additional improvements from external knowledge, and that chain-of-thought fine-tuning helps models surface relevant knowledge they already possess.