SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

LLM ReasoningAgentsControlled EvaluationRAG

Evaluating the reasoning ability of language models (LMs) is complicated by their extensive parametric world knowledge, where benchmark performance often reflects factual recall rather than genuine reasoning. Existing datasets and approaches (e.g., temporal filtering, paraphrasing, adversarial substitution) cannot cleanly separate the two. We present SynthWorlds, a framework that disentangles task reasoning complexity from factual knowledge. In SynthWorlds, we construct parallel corpora representing two worlds with identical interconnected structure: a real-mapped world, where models may exploit parametric knowledge, and a synthetic-mapped world, where such knowledge is meaningless. On top of these corpora, we design two mirrored tasks as case studies: multi-hop question answering and page navigation, which maintain equal reasoning difficulty across worlds. Experiments in parametric-only (e.g., closed-book QA) and knowledge-augmented (e.g., retrieval-augmented) LM settings reveal a persistent knowledge advantage gap, defined as the performance boost models gain from memorized parametric world knowledge. Knowledge acquisition and integration mechanisms reduce but do not eliminate this gap, highlighting opportunities for system improvements. Fully automatic and scalable, SynthWorlds provides a controlled environment for evaluating LMs in ways that were previously challenging, enabling precise and testable comparisons of reasoning and memorization.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces SynthWorlds, a framework that constructs parallel corpora representing real-mapped and synthetic-mapped worlds to isolate reasoning from parametric knowledge. It resides in the 'Synthetic and Counterfactual Task Construction' leaf, which contains four papers total, indicating a moderately populated but not overcrowded research direction. This leaf focuses specifically on creating artificial environments to nullify memorized facts, positioning SynthWorlds among approaches that manipulate factual grounding to test pure reasoning capabilities. The framework's dual-world design with mirrored tasks (multi-hop QA and page navigation) represents a systematic attempt to control for task complexity while varying knowledge availability.

The taxonomy reveals that SynthWorlds sits within the broader 'Controlled Evaluation Frameworks' branch, which neighbors 'Domain-Specific Reasoning Assessment' and 'General Reasoning Benchmarks.' Adjacent branches include 'Mechanistic Analysis' (probing internal representations) and 'Knowledge Integration Mechanisms' (augmenting models with external knowledge). The leaf's scope note explicitly excludes methods that modify inference procedures or augment with external knowledge, clarifying that SynthWorlds focuses on evaluation design rather than model architecture. Nearby work in 'Inference Pipeline Decomposition' addresses modular separation of retrieval and reasoning, representing a complementary but architecturally distinct approach to the same core problem.

Among thirty candidates examined, the framework contribution shows one refutable candidate from ten examined, suggesting some methodological overlap with prior synthetic task construction efforts. The two remaining contributions—parallel corpora construction and the knowledge advantage gap metric—each examined ten candidates with zero refutations, indicating these specific instantiations appear more novel within the limited search scope. The statistics suggest that while the general approach of synthetic environments has precedent, the particular implementation details and measurement framework may offer incremental advances. The analysis explicitly acknowledges examining top-K semantic matches rather than exhaustive coverage, meaning additional related work may exist beyond this sample.

Based on the limited thirty-candidate search, SynthWorlds appears to occupy established methodological territory (synthetic task construction) while contributing specific design choices and metrics. The taxonomy structure shows this is an active but not saturated research direction, with the paper's sibling works representing close methodological neighbors. The contribution-level analysis suggests the framework concept has some prior overlap, while the specific corpora and metrics show less immediate refutation within the examined sample. A more exhaustive search would be needed to definitively assess novelty across the broader literature.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: disentangling reasoning ability from parametric factual knowledge in language models. The field has organized itself around several complementary perspectives. Controlled Evaluation Frameworks for Separating Reasoning from Knowledge emphasize synthetic and counterfactual task construction to isolate reasoning from memorized facts, while Mechanistic Analysis of Knowledge and Reasoning Representations probes internal model structures to understand how these capabilities are encoded. Knowledge Integration Mechanisms for Reasoning Enhancement and Prompting and Inference Architectures for Reasoning explore how external knowledge sources and architectural choices can augment reasoning without conflating it with parametric memory. Meanwhile, branches such as Knowledge Manipulation and Unlearning, Cognitive and Theoretical Perspectives on Reasoning, and Specialized Reasoning Paradigms and Applications address interventions, theoretical grounding, and domain-specific challenges. Inference Pipeline Decomposition and Separation focuses on modular architectures that explicitly separate retrieval from reasoning steps, as seen in works like ReAct[4]. A particularly active line of inquiry centers on constructing tasks where models cannot rely on memorized knowledge, forcing them to demonstrate genuine reasoning. SynthWorlds[0] exemplifies this approach by creating entirely synthetic environments, closely aligning with efforts like Reasoning or Reciting[5] and Counterfactual Reasoning[42], which similarly manipulate task content to prevent shortcut solutions. In contrast, works such as Context Role Reasoning[15] and Dissociating Language Thought[3] investigate how context provision versus parametric recall influences reasoning performance, revealing trade-offs between knowledge availability and reasoning transparency. Across these branches, open questions persist about whether reasoning and knowledge can ever be fully separated, how to design evaluations that avoid confounds, and whether models truly generalize reasoning skills beyond their training distributions. SynthWorlds[0] sits squarely within the synthetic task construction cluster, sharing methodological kinship with Reasoning or Reciting[5] in its emphasis on controlled, knowledge-free environments, yet differing in the scope and complexity of the synthetic worlds it constructs.

Claimed Contributions

SynthWorlds framework for disentangling reasoning and knowledge

Can Refute

10 retrieved papers

The authors introduce a framework that constructs parallel corpora representing two worlds with identical structure: one mapped to real-world entities where parametric knowledge is useful, and another mapped to synthetic entities where such knowledge is meaningless. This enables controlled evaluation of language models by separating reasoning ability from factual recall.

10 retrieved papers

Can Refute

Two parallel corpora with corresponding task datasets

10 retrieved papers

The authors construct two parallel corpora (SYNTHWORLD-RM and SYNTHWORLD-SM), each containing 6,920 documents covering 161K facts, along with 1.2K multi-hop QA and 1K page navigation instances. These resources are released publicly to support future research.

10 retrieved papers

Knowledge advantage gap metric and empirical analysis

10 retrieved papers

The authors define and measure the knowledge advantage gap as the performance difference between real-mapped and synthetic-mapped settings. Their analysis reveals that this gap persists even with knowledge augmentation methods like retrieval-augmented generation and chain-of-thought prompting, highlighting opportunities for system improvements.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks PDF

AkyÃ¼rek, Ekin, Andreas Jacob, Chen, Boyuan, Kim, Najoung, Ross, Alexis, Wang Bailin, Yoon, Qiu, Linlu, Wu Zhaofeng (2024)

[15] Disentangling logic: The role of context in large language model reasoning capabilities PDF

Hua Wenyue, Wenyue Hua, Kaijie Zhu, Lingyao Li, Lizhou Fan, Mingyu Jin, Shuhang Lin, Haochen Xue, Zelong Li, Jindong Wang, Yongfeng Zhang, Yanwen Zhang (2025)

[42] If pigs could fly... can llms logically reason through counterfactuals? PDF

Bonagiri, Vamshi Krishna, Gaur, Manas, Thirunarayan, Krishnaprasad, Kumaraguru, Ponnurangam (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SynthWorlds framework for disentangling reasoning and knowledge

[5] Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks PDF

Can Refute

[1] Disentangling reasoning and knowledge in medical large language models PDF

Cannot Refute

[2] Disentangling memory and reasoning ability in large language models PDF

Cannot Refute

[3] Dissociating language and thought in large language models PDF

Cannot Refute

[11] The knowledge-reasoning dissociation: Fundamental limitations of llms in clinical natural language inference PDF

Cannot Refute

[12] Disentangling Reasoning Capabilities from Language Models with Compositional Reasoning Transformers PDF

Cannot Refute

[15] Disentangling logic: The role of context in large language model reasoning capabilities PDF

Cannot Refute

[16] Language models cannot reliably distinguish belief from knowledge and fact PDF

Cannot Refute

[27] When Less Language is More: Language-Reasoning Disentanglement Makes LLMs Better Multilingual Reasoners PDF

Cannot Refute

[71] Evaluating large language models through the lens of linguistic proficiency and world knowledge: A comparative study PDF

Cannot Refute

Contribution

Two parallel corpora with corresponding task datasets

[51] Self-Instruct: Aligning Language Models with Self-Generated Instructions PDF

Cannot Refute

[52] Stochastic constraint self-reflective syntax reconstruction in large language model internal representational spaces PDF

Cannot Refute

[53] Controlled text generation for large language model with dynamic attribute graphs PDF

Cannot Refute

[54] LLM Reasoning for Machine Translation: Synthetic Data Generation over Thinking Tokens PDF

Cannot Refute

[55] Clinical document corporaâreal ones, translated and synthetic substitutes, and assorted domain proxies: a survey of diversity in corpus design, with focus on German â¦ PDF

Cannot Refute

[56] Evaluating the quality of a corpus annotation scheme using pretrained language models PDF

Cannot Refute

[57] ESCAPE: a Large-scale Synthetic Corpus for Automatic Post-Editing PDF

Cannot Refute

[58] Language models as controlled natural language semantic parsers for knowledge graph question answering PDF

Cannot Refute

[59] Neural machine translation system for the kazakh language based on synthetic corpora PDF

Cannot Refute

[60] MEGATRON-CNTRL: Controllable story generation with external knowledge using large-scale language models PDF

Cannot Refute

Contribution

Knowledge advantage gap metric and empirical analysis

[61] Understanding the interplay between parametric and contextual knowledge for large language models PDF

Cannot Refute

[62] Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks PDF

Cannot Refute

[63] Evaluating the external and parametric knowledge fusion of large language models PDF

Cannot Refute

[64] Freshllms: Refreshing large language models with search engine augmentation PDF

Cannot Refute

[65] Parameters vs. context: Fine-grained control of knowledge reliance in language models PDF

Cannot Refute

[66] xrag: Extreme context compression for retrieval-augmented generation with one token PDF

Cannot Refute

[67] â¦ large language models improve accuracy: comparing retrieval augmented generation and artificial intelligence agents to noncustom models for evidence-based â¦ PDF

Cannot Refute

[68] When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories PDF

Cannot Refute

[69] MoRSE: Bridging the Gap in Cybersecurity Expertise with Retrieval Augmented Generation PDF

Cannot Refute

[70] Retrieve only when it needs: Adaptive retrieval augmentation for hallucination mitigation in large language models PDF

Cannot Refute

SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks PDF

[15] Disentangling logic: The role of context in large language model reasoning capabilities PDF

[42] If pigs could fly... can llms logically reason through counterfactuals? PDF

Contribution Analysis

SynthWorlds framework for disentangling reasoning and knowledge

[5] Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks PDF

[1] Disentangling reasoning and knowledge in medical large language models PDF

[2] Disentangling memory and reasoning ability in large language models PDF

[3] Dissociating language and thought in large language models PDF

[11] The knowledge-reasoning dissociation: Fundamental limitations of llms in clinical natural language inference PDF

[12] Disentangling Reasoning Capabilities from Language Models with Compositional Reasoning Transformers PDF

[15] Disentangling logic: The role of context in large language model reasoning capabilities PDF

[16] Language models cannot reliably distinguish belief from knowledge and fact PDF

[27] When Less Language is More: Language-Reasoning Disentanglement Makes LLMs Better Multilingual Reasoners PDF

[71] Evaluating large language models through the lens of linguistic proficiency and world knowledge: A comparative study PDF

Two parallel corpora with corresponding task datasets

[51] Self-Instruct: Aligning Language Models with Self-Generated Instructions PDF

[52] Stochastic constraint self-reflective syntax reconstruction in large language model internal representational spaces PDF

[53] Controlled text generation for large language model with dynamic attribute graphs PDF

[54] LLM Reasoning for Machine Translation: Synthetic Data Generation over Thinking Tokens PDF

[55] Clinical document corporaâreal ones, translated and synthetic substitutes, and assorted domain proxies: a survey of diversity in corpus design, with focus on German â¦ PDF

[56] Evaluating the quality of a corpus annotation scheme using pretrained language models PDF

[57] ESCAPE: a Large-scale Synthetic Corpus for Automatic Post-Editing PDF

[58] Language models as controlled natural language semantic parsers for knowledge graph question answering PDF

[59] Neural machine translation system for the kazakh language based on synthetic corpora PDF

[60] MEGATRON-CNTRL: Controllable story generation with external knowledge using large-scale language models PDF

Knowledge advantage gap metric and empirical analysis

[61] Understanding the interplay between parametric and contextual knowledge for large language models PDF

[62] Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks PDF

[63] Evaluating the external and parametric knowledge fusion of large language models PDF

[64] Freshllms: Refreshing large language models with search engine augmentation PDF

[65] Parameters vs. context: Fine-grained control of knowledge reliance in language models PDF

[66] xrag: Extreme context compression for retrieval-augmented generation with one token PDF

[67] â¦ large language models improve accuracy: comparing retrieval augmented generation and artificial intelligence agents to noncustom models for evidence-based â¦ PDF

[68] When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories PDF

[69] MoRSE: Bridging the Gap in Cybersecurity Expertise with Retrieval Augmented Generation PDF

[70] Retrieve only when it needs: Adaptive retrieval augmentation for hallucination mitigation in large language models PDF

Table of Contents

[55] Clinical document corporaâreal ones, translated and synthetic substitutes, and assorted domain proxies: a survey of diversity in corpus design, with focus on German â¦ PDF

[67] â¦ large language models improve accuracy: comparing retrieval augmented generation and artificial intelligence agents to noncustom models for evidence-based â¦ PDF