Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors
reasoningknowledge tracing/discovering/inducingapplications
Abstract:

Scientific problem solving poses unique challenges for LLMs, requiring both deep domain knowledge and the ability to apply such knowledge through complex reasoning. While automated scientific reasoners hold great promise for assisting human scientists, there is currently no widely adopted holistic benchmark for evaluating scientific reasoning, and few approaches systematically disentangle the distinct roles of knowledge and reasoning in these tasks.

To address these gaps, we introduce SciReas, a diverse suite of existing benchmarks for scientific reasoning tasks, and SciReas-Pro, a selective subset that requires more complex reasoning. Our holistic evaluation surfaces insights about scientific reasoning performance that remain hidden when relying on individual benchmarks alone. We then propose KRUX, a probing framework for studying the distinct roles of reasoning and knowledge in scientific tasks.

Combining the two, we conduct an in-depth analysis that yields several key findings:

(1) Retrieving task-relevant knowledge from model parameters is a critical bottleneck for LLMs in scientific reasoning;

(2) Reasoning models consistently benefit from external knowledge added in-context on top of the reasoning enhancement;

(3) Enhancing verbalized reasoning improves LLMs' ability to surface task-relevant knowledge.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces SciReas and SciReas-Pro, holistic benchmark suites for scientific reasoning, alongside KRUX, a probing framework that disentangles knowledge retrieval from reasoning steps. It resides in the Holistic Scientific Reasoning Benchmarks leaf, which contains three papers total. This leaf sits within the broader Scientific Reasoning Evaluation and Benchmarking branch, indicating a moderately populated research direction focused on systematic assessment rather than method development. The taxonomy reveals that holistic benchmarking is less crowded than decomposition or retrieval-augmented approaches, suggesting room for comprehensive evaluation frameworks.

The paper's leaf neighbors Domain-Specific Reasoning Evaluation and Fine-Grained Reasoning Analysis, which target narrower scopes—single domains or subtask-level diagnostics. Nearby branches include Knowledge-Reasoning Separation Frameworks and Retrieval-Augmented Generation, which develop methods rather than benchmarks. The taxonomy's scope note clarifies that holistic benchmarks probe multiple tasks and dimensions simultaneously, excluding single-domain evaluations. This positioning suggests the work bridges evaluation and methodological insights, connecting to causal attribution studies in Knowledge Utilization and Reasoning Transparency that also examine knowledge-reasoning interplay.

Among thirty candidates examined, none clearly refuted any of the three contributions. The SciReas benchmark suite examined ten candidates with zero refutations, as did the KRUX framework and the empirical findings on knowledge bottlenecks. This limited search scope—top-K semantic matches plus citation expansion—indicates that within the retrieved literature, no prior work appears to provide overlapping holistic benchmarks or probing frameworks with identical design. The absence of refutations across all contributions suggests either genuine novelty or that closely related work lies outside the search radius.

Based on the thirty-candidate search, the work appears to occupy a relatively sparse niche within scientific reasoning evaluation. The taxonomy structure confirms that holistic benchmarking is less saturated than decomposition or RAG methods, and the contribution-level statistics show no immediate prior overlap. However, the limited search scope means this assessment reflects top-ranked semantic neighbors rather than exhaustive field coverage. The findings on knowledge bottlenecks may intersect with broader studies in Knowledge Utilization, though no specific refutations emerged here.

Taxonomy

Core-task Taxonomy Papers
41
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Disentangling knowledge and reasoning in scientific problem-solving with large language models. The field has organized itself around several complementary perspectives on how LLMs combine factual knowledge with multi-step inference. Knowledge-Reasoning Separation Frameworks explicitly isolate memorized facts from logical operations, often through dual-system architectures or modular pipelines that treat retrieval and inference as distinct stages. Question Decomposition approaches break complex queries into simpler sub-problems, enabling transparent reasoning chains, while Mathematical Reasoning Decomposition specializes these techniques for formal domains. Retrieval-Augmented Generation methods inject external knowledge to support reasoning steps, and Integrated Reasoning Architectures unify these components into end-to-end systems. Scientific Reasoning Evaluation and Benchmarking provides datasets and metrics to assess both knowledge recall and inferential capabilities, and branches focusing on Knowledge Utilization examine how models access and apply domain facts, alongside studies of LLM Reasoning Paradigms that explore prompting strategies and guidance mechanisms. Recent work has intensified efforts to measure whether LLM failures stem from missing knowledge or flawed reasoning. Holistic Scientific Reasoning Benchmarks, where Probing Knowledge Reasoning[0] resides, aim to systematically probe both dimensions across scientific domains, contrasting with narrower evaluations like LLM Scientific Reasoning[13] or Knowledge or Reasoning[14] that focus on specific failure modes. Probing Knowledge Reasoning[0] emphasizes controlled experimental designs that vary knowledge availability and reasoning complexity independently, situating it alongside efforts such as Disentangling Memory Reasoning[1] and Medical Knowledge Reasoning[2], which similarly seek to attribute model behavior to distinct cognitive components. This diagnostic perspective complements decomposition-based methods like Versatile Decomposers[5] and retrieval-augmented frameworks, offering a principled basis for understanding when scientific problem-solving breaks down and how interventions might target knowledge gaps versus inferential weaknesses.

Claimed Contributions

SCIREAS and SCIREAS-PRO benchmarks

The authors curate SCIREAS, a unified evaluation suite combining 10 existing scientific benchmarks across multiple domains and task formats with standardized implementation. They also construct SCIREAS-PRO, a compact subset containing only instances that require complex reasoning, identified by performance differences under varying inference budgets.

10 retrieved papers
KRUX probing framework

The authors propose KRUX (Knowledge & Reasoning Utilization eXams), a framework that extracts atomic knowledge ingredients from model reasoning traces and provides them in-context to systematically separate and analyze the distinct roles of knowledge recall versus reasoning capability in scientific problem-solving.

10 retrieved papers
Empirical findings on knowledge-reasoning interplay

Through controlled experiments using KRUX, the authors demonstrate that retrieving task-relevant parametric knowledge is a critical bottleneck, that reasoning-enhanced models gain additional improvements from external knowledge, and that chain-of-thought fine-tuning helps models surface relevant knowledge they already possess.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SCIREAS and SCIREAS-PRO benchmarks

The authors curate SCIREAS, a unified evaluation suite combining 10 existing scientific benchmarks across multiple domains and task formats with standardized implementation. They also construct SCIREAS-PRO, a compact subset containing only instances that require complex reasoning, identified by performance differences under varying inference budgets.

Contribution

KRUX probing framework

The authors propose KRUX (Knowledge & Reasoning Utilization eXams), a framework that extracts atomic knowledge ingredients from model reasoning traces and provides them in-context to systematically separate and analyze the distinct roles of knowledge recall versus reasoning capability in scientific problem-solving.

Contribution

Empirical findings on knowledge-reasoning interplay

Through controlled experiments using KRUX, the authors demonstrate that retrieving task-relevant parametric knowledge is a critical bottleneck, that reasoning-enhanced models gain additional improvements from external knowledge, and that chain-of-thought fine-tuning helps models surface relevant knowledge they already possess.

Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning | Novelty Validation