Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

reasoningknowledge tracing/discovering/inducingapplications

Scientific problem solving poses unique challenges for LLMs, requiring both deep domain knowledge and the ability to apply such knowledge through complex reasoning. While automated scientific reasoners hold great promise for assisting human scientists, there is currently no widely adopted holistic benchmark for evaluating scientific reasoning, and few approaches systematically disentangle the distinct roles of knowledge and reasoning in these tasks.

To address these gaps, we introduce SciReas, a diverse suite of existing benchmarks for scientific reasoning tasks, and SciReas-Pro, a selective subset that requires more complex reasoning. Our holistic evaluation surfaces insights about scientific reasoning performance that remain hidden when relying on individual benchmarks alone. We then propose KRUX, a probing framework for studying the distinct roles of reasoning and knowledge in scientific tasks.

Combining the two, we conduct an in-depth analysis that yields several key findings:

(1) Retrieving task-relevant knowledge from model parameters is a critical bottleneck for LLMs in scientific reasoning;

(2) Reasoning models consistently benefit from external knowledge added in-context on top of the reasoning enhancement;

(3) Enhancing verbalized reasoning improves LLMs' ability to surface task-relevant knowledge.

Abstract:

Combining the two, we conduct an in-depth analysis that yields several key findings:

(1) Retrieving task-relevant knowledge from model parameters is a critical bottleneck for LLMs in scientific reasoning;

(2) Reasoning models consistently benefit from external knowledge added in-context on top of the reasoning enhancement;

(3) Enhancing verbalized reasoning improves LLMs' ability to surface task-relevant knowledge.

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces SciReas and SciReas-Pro, holistic benchmark suites for scientific reasoning, alongside KRUX, a probing framework that disentangles knowledge retrieval from reasoning steps. It resides in the Holistic Scientific Reasoning Benchmarks leaf, which contains three papers total. This leaf sits within the broader Scientific Reasoning Evaluation and Benchmarking branch, indicating a moderately populated research direction focused on systematic assessment rather than method development. The taxonomy reveals that holistic benchmarking is less crowded than decomposition or retrieval-augmented approaches, suggesting room for comprehensive evaluation frameworks.

The paper's leaf neighbors Domain-Specific Reasoning Evaluation and Fine-Grained Reasoning Analysis, which target narrower scopes—single domains or subtask-level diagnostics. Nearby branches include Knowledge-Reasoning Separation Frameworks and Retrieval-Augmented Generation, which develop methods rather than benchmarks. The taxonomy's scope note clarifies that holistic benchmarks probe multiple tasks and dimensions simultaneously, excluding single-domain evaluations. This positioning suggests the work bridges evaluation and methodological insights, connecting to causal attribution studies in Knowledge Utilization and Reasoning Transparency that also examine knowledge-reasoning interplay.

Among thirty candidates examined, none clearly refuted any of the three contributions. The SciReas benchmark suite examined ten candidates with zero refutations, as did the KRUX framework and the empirical findings on knowledge bottlenecks. This limited search scope—top-K semantic matches plus citation expansion—indicates that within the retrieved literature, no prior work appears to provide overlapping holistic benchmarks or probing frameworks with identical design. The absence of refutations across all contributions suggests either genuine novelty or that closely related work lies outside the search radius.

Based on the thirty-candidate search, the work appears to occupy a relatively sparse niche within scientific reasoning evaluation. The taxonomy structure confirms that holistic benchmarking is less saturated than decomposition or RAG methods, and the contribution-level statistics show no immediate prior overlap. However, the limited search scope means this assessment reflects top-ranked semantic neighbors rather than exhaustive field coverage. The findings on knowledge bottlenecks may intersect with broader studies in Knowledge Utilization, though no specific refutations emerged here.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Disentangling knowledge and reasoning in scientific problem-solving with large language models. The field has organized itself around several complementary perspectives on how LLMs combine factual knowledge with multi-step inference. Knowledge-Reasoning Separation Frameworks explicitly isolate memorized facts from logical operations, often through dual-system architectures or modular pipelines that treat retrieval and inference as distinct stages. Question Decomposition approaches break complex queries into simpler sub-problems, enabling transparent reasoning chains, while Mathematical Reasoning Decomposition specializes these techniques for formal domains. Retrieval-Augmented Generation methods inject external knowledge to support reasoning steps, and Integrated Reasoning Architectures unify these components into end-to-end systems. Scientific Reasoning Evaluation and Benchmarking provides datasets and metrics to assess both knowledge recall and inferential capabilities, and branches focusing on Knowledge Utilization examine how models access and apply domain facts, alongside studies of LLM Reasoning Paradigms that explore prompting strategies and guidance mechanisms. Recent work has intensified efforts to measure whether LLM failures stem from missing knowledge or flawed reasoning. Holistic Scientific Reasoning Benchmarks, where Probing Knowledge Reasoning[0] resides, aim to systematically probe both dimensions across scientific domains, contrasting with narrower evaluations like LLM Scientific Reasoning[13] or Knowledge or Reasoning[14] that focus on specific failure modes. Probing Knowledge Reasoning[0] emphasizes controlled experimental designs that vary knowledge availability and reasoning complexity independently, situating it alongside efforts such as Disentangling Memory Reasoning[1] and Medical Knowledge Reasoning[2], which similarly seek to attribute model behavior to distinct cognitive components. This diagnostic perspective complements decomposition-based methods like Versatile Decomposers[5] and retrieval-augmented frameworks, offering a principled basis for understanding when scientific problem-solving breaks down and how interventions might target knowledge gaps versus inferential weaknesses.

Claimed Contributions

SCIREAS and SCIREAS-PRO benchmarks

10 retrieved papers

The authors curate SCIREAS, a unified evaluation suite combining 10 existing scientific benchmarks across multiple domains and task formats with standardized implementation. They also construct SCIREAS-PRO, a compact subset containing only instances that require complex reasoning, identified by performance differences under varying inference budgets.

10 retrieved papers

KRUX probing framework

10 retrieved papers

The authors propose KRUX (Knowledge & Reasoning Utilization eXams), a framework that extracts atomic knowledge ingredients from model reasoning traces and provides them in-context to systematically separate and analyze the distinct roles of knowledge recall versus reasoning capability in scientific problem-solving.

10 retrieved papers

Empirical findings on knowledge-reasoning interplay

10 retrieved papers

Through controlled experiments using KRUX, the authors demonstrate that retrieving task-relevant parametric knowledge is a critical bottleneck, that reasoning-enhanced models gain additional improvements from external knowledge, and that chain-of-thought fine-tuning helps models surface relevant knowledge they already possess.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[13] Understanding LLM Scientific Reasoning through Promptings and Model's Explanation on the Answers PDF

Rueda, Alice, Alice Rueda, Mohammed S. Hassan, Argyrios Perivolaris, Samavi Reza, Bazen Gashaw Teferra, Rambhatla Sirisha, Reza Samavi, Wu Yuqi, Sirisha Rambhatla, Zhang Yan-bo, Yuqi Wu, Cao Bo, Yanbo Zhang, Sharma Divya, Bo Cao, Krishnan Sridhar, Divya Sharma, Sridhar Krishnan Venkat Bhat (2025)

[14] Knowledge or Reasoning? A Close Look at How LLMs Think Across Domains PDF

Wu Juncheng, Liu Sheng, Juncheng Wu, Tu, Haoqin, Sheng Liu, Yu Hang, Haoqin Tu, Huang Xiaoke, Hang Yu, Zou, James, Xiaoke Huang, Xie, Cihang, James Y. Zou, Zhou Yu-yin, Cihang Xie, Yuyin Zhou (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SCIREAS and SCIREAS-PRO benchmarks

[42] Mmlu-pro: A more robust and challenging multi-task language understanding benchmark PDF

Cannot Refute

[43] Genome-Bench: A Scientific Reasoning Benchmark from Real-World Expert Discussions PDF

Cannot Refute

[44] Scieval: A multi-level large language model evaluation benchmark for scientific research PDF

Cannot Refute

[45] Towards generalist biomedical AI PDF

Cannot Refute

[46] MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI PDF

Cannot Refute

[47] SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines PDF

Cannot Refute

[48] An MLCommons Scientific Benchmarks Ontology PDF

Cannot Refute

[49] SCI-Verifier: Scientific Verifier with Thinking PDF

Cannot Refute

[50] USB: A Unified Summarization Benchmark Across Tasks and Domains PDF

Cannot Refute

[51] MMCR: Benchmarking Cross-Source Reasoning in Scientific Papers PDF

Cannot Refute

Contribution

KRUX probing framework

[1] Disentangling Memory and Reasoning Ability in Large Language Models PDF

Cannot Refute

[2] Disentangling Reasoning and Knowledge in Medical Large Language Models PDF

Cannot Refute

[52] Dissociating language and thought in large language models PDF

Cannot Refute

[53] Long-form factuality in large language models PDF

Cannot Refute

[54] Wise: Rethinking the knowledge memory for lifelong model editing of large language models PDF

Cannot Refute

[55] Towards understanding factual knowledge of large language models PDF

Cannot Refute

[56] Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks PDF

Cannot Refute

[57] Sources of hallucination by large language models on inference tasks PDF

Cannot Refute

[58] Towards a mechanistic interpretation of multi-step reasoning capabilities of language models PDF

Cannot Refute

[59] Co-occurrence is not factual association in language models PDF

Cannot Refute

Contribution

Empirical findings on knowledge-reasoning interplay

[60] Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions PDF

Cannot Refute

[61] Improve vision language model chain-of-thought reasoning PDF

Cannot Refute

[62] Complexity-based prompting for multi-step reasoning PDF

Cannot Refute

[63] Markov chain of thought for efficient mathematical reasoning PDF

Cannot Refute

[64] Visual Chain-of-Thought Prompting for Knowledge-Based Visual Reasoning PDF

Cannot Refute

[65] Understanding before reasoning: Enhancing chain-of-thought with iterative summarization pre-prompting PDF

Cannot Refute

[66] Semi-structured chain-of-thought: Integrating multiple sources of knowledge for improved language model reasoning PDF

Cannot Refute

[67] Cot-rag: Integrating chain of thought and retrieval-augmented generation to enhance reasoning in large language models PDF

Cannot Refute

[68] RECoT: Relation-enhanced Chains-of-Thoughts for knowledge-intensive multi-hop questions answering PDF

Cannot Refute

[69] Verify-and-Edit: A Knowledge-Enhanced Chain-of-Thought Framework PDF

Cannot Refute

Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[13] Understanding LLM Scientific Reasoning through Promptings and Model's Explanation on the Answers PDF

[14] Knowledge or Reasoning? A Close Look at How LLMs Think Across Domains PDF

Contribution Analysis

SCIREAS and SCIREAS-PRO benchmarks

[42] Mmlu-pro: A more robust and challenging multi-task language understanding benchmark PDF

[43] Genome-Bench: A Scientific Reasoning Benchmark from Real-World Expert Discussions PDF

[44] Scieval: A multi-level large language model evaluation benchmark for scientific research PDF

[45] Towards generalist biomedical AI PDF

[46] MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI PDF

[47] SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines PDF

[48] An MLCommons Scientific Benchmarks Ontology PDF

[49] SCI-Verifier: Scientific Verifier with Thinking PDF

[50] USB: A Unified Summarization Benchmark Across Tasks and Domains PDF

[51] MMCR: Benchmarking Cross-Source Reasoning in Scientific Papers PDF

KRUX probing framework

[1] Disentangling Memory and Reasoning Ability in Large Language Models PDF

[2] Disentangling Reasoning and Knowledge in Medical Large Language Models PDF

[52] Dissociating language and thought in large language models PDF

[53] Long-form factuality in large language models PDF

[54] Wise: Rethinking the knowledge memory for lifelong model editing of large language models PDF

[55] Towards understanding factual knowledge of large language models PDF

[56] Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks PDF

[57] Sources of hallucination by large language models on inference tasks PDF

[58] Towards a mechanistic interpretation of multi-step reasoning capabilities of language models PDF

[59] Co-occurrence is not factual association in language models PDF

Empirical findings on knowledge-reasoning interplay

[60] Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions PDF

[61] Improve vision language model chain-of-thought reasoning PDF

[62] Complexity-based prompting for multi-step reasoning PDF

[63] Markov chain of thought for efficient mathematical reasoning PDF

[64] Visual Chain-of-Thought Prompting for Knowledge-Based Visual Reasoning PDF

[65] Understanding before reasoning: Enhancing chain-of-thought with iterative summarization pre-prompting PDF

[66] Semi-structured chain-of-thought: Integrating multiple sources of knowledge for improved language model reasoning PDF

[67] Cot-rag: Integrating chain of thought and retrieval-augmented generation to enhance reasoning in large language models PDF

[68] RECoT: Relation-enhanced Chains-of-Thoughts for knowledge-intensive multi-hop questions answering PDF

[69] Verify-and-Edit: A Knowledge-Enhanced Chain-of-Thought Framework PDF

Table of Contents