Abstract:

Evaluating the reasoning ability of language models (LMs) is complicated by their extensive parametric world knowledge, where benchmark performance often reflects factual recall rather than genuine reasoning. Existing datasets and approaches (e.g., temporal filtering, paraphrasing, adversarial substitution) cannot cleanly separate the two. We present SynthWorlds, a framework that disentangles task reasoning complexity from factual knowledge. In SynthWorlds, we construct parallel corpora representing two worlds with identical interconnected structure: a real-mapped world, where models may exploit parametric knowledge, and a synthetic-mapped world, where such knowledge is meaningless. On top of these corpora, we design two mirrored tasks as case studies: multi-hop question answering and page navigation, which maintain equal reasoning difficulty across worlds. Experiments in parametric-only (e.g., closed-book QA) and knowledge-augmented (e.g., retrieval-augmented) LM settings reveal a persistent knowledge advantage gap, defined as the performance boost models gain from memorized parametric world knowledge. Knowledge acquisition and integration mechanisms reduce but do not eliminate this gap, highlighting opportunities for system improvements. Fully automatic and scalable, SynthWorlds provides a controlled environment for evaluating LMs in ways that were previously challenging, enabling precise and testable comparisons of reasoning and memorization.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces SynthWorlds, a framework that constructs parallel corpora representing real-mapped and synthetic-mapped worlds to isolate reasoning from parametric knowledge. It resides in the 'Synthetic and Counterfactual Task Construction' leaf, which contains four papers total, indicating a moderately populated but not overcrowded research direction. This leaf focuses specifically on creating artificial environments to nullify memorized facts, positioning SynthWorlds among approaches that manipulate factual grounding to test pure reasoning capabilities. The framework's dual-world design with mirrored tasks (multi-hop QA and page navigation) represents a systematic attempt to control for task complexity while varying knowledge availability.

The taxonomy reveals that SynthWorlds sits within the broader 'Controlled Evaluation Frameworks' branch, which neighbors 'Domain-Specific Reasoning Assessment' and 'General Reasoning Benchmarks.' Adjacent branches include 'Mechanistic Analysis' (probing internal representations) and 'Knowledge Integration Mechanisms' (augmenting models with external knowledge). The leaf's scope note explicitly excludes methods that modify inference procedures or augment with external knowledge, clarifying that SynthWorlds focuses on evaluation design rather than model architecture. Nearby work in 'Inference Pipeline Decomposition' addresses modular separation of retrieval and reasoning, representing a complementary but architecturally distinct approach to the same core problem.

Among thirty candidates examined, the framework contribution shows one refutable candidate from ten examined, suggesting some methodological overlap with prior synthetic task construction efforts. The two remaining contributions—parallel corpora construction and the knowledge advantage gap metric—each examined ten candidates with zero refutations, indicating these specific instantiations appear more novel within the limited search scope. The statistics suggest that while the general approach of synthetic environments has precedent, the particular implementation details and measurement framework may offer incremental advances. The analysis explicitly acknowledges examining top-K semantic matches rather than exhaustive coverage, meaning additional related work may exist beyond this sample.

Based on the limited thirty-candidate search, SynthWorlds appears to occupy established methodological territory (synthetic task construction) while contributing specific design choices and metrics. The taxonomy structure shows this is an active but not saturated research direction, with the paper's sibling works representing close methodological neighbors. The contribution-level analysis suggests the framework concept has some prior overlap, while the specific corpora and metrics show less immediate refutation within the examined sample. A more exhaustive search would be needed to definitively assess novelty across the broader literature.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: disentangling reasoning ability from parametric factual knowledge in language models. The field has organized itself around several complementary perspectives. Controlled Evaluation Frameworks for Separating Reasoning from Knowledge emphasize synthetic and counterfactual task construction to isolate reasoning from memorized facts, while Mechanistic Analysis of Knowledge and Reasoning Representations probes internal model structures to understand how these capabilities are encoded. Knowledge Integration Mechanisms for Reasoning Enhancement and Prompting and Inference Architectures for Reasoning explore how external knowledge sources and architectural choices can augment reasoning without conflating it with parametric memory. Meanwhile, branches such as Knowledge Manipulation and Unlearning, Cognitive and Theoretical Perspectives on Reasoning, and Specialized Reasoning Paradigms and Applications address interventions, theoretical grounding, and domain-specific challenges. Inference Pipeline Decomposition and Separation focuses on modular architectures that explicitly separate retrieval from reasoning steps, as seen in works like ReAct[4]. A particularly active line of inquiry centers on constructing tasks where models cannot rely on memorized knowledge, forcing them to demonstrate genuine reasoning. SynthWorlds[0] exemplifies this approach by creating entirely synthetic environments, closely aligning with efforts like Reasoning or Reciting[5] and Counterfactual Reasoning[42], which similarly manipulate task content to prevent shortcut solutions. In contrast, works such as Context Role Reasoning[15] and Dissociating Language Thought[3] investigate how context provision versus parametric recall influences reasoning performance, revealing trade-offs between knowledge availability and reasoning transparency. Across these branches, open questions persist about whether reasoning and knowledge can ever be fully separated, how to design evaluations that avoid confounds, and whether models truly generalize reasoning skills beyond their training distributions. SynthWorlds[0] sits squarely within the synthetic task construction cluster, sharing methodological kinship with Reasoning or Reciting[5] in its emphasis on controlled, knowledge-free environments, yet differing in the scope and complexity of the synthetic worlds it constructs.

Claimed Contributions

SynthWorlds framework for disentangling reasoning and knowledge

The authors introduce a framework that constructs parallel corpora representing two worlds with identical structure: one mapped to real-world entities where parametric knowledge is useful, and another mapped to synthetic entities where such knowledge is meaningless. This enables controlled evaluation of language models by separating reasoning ability from factual recall.

10 retrieved papers
Can Refute
Two parallel corpora with corresponding task datasets

The authors construct two parallel corpora (SYNTHWORLD-RM and SYNTHWORLD-SM), each containing 6,920 documents covering 161K facts, along with 1.2K multi-hop QA and 1K page navigation instances. These resources are released publicly to support future research.

10 retrieved papers
Knowledge advantage gap metric and empirical analysis

The authors define and measure the knowledge advantage gap as the performance difference between real-mapped and synthetic-mapped settings. Their analysis reveals that this gap persists even with knowledge augmentation methods like retrieval-augmented generation and chain-of-thought prompting, highlighting opportunities for system improvements.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SynthWorlds framework for disentangling reasoning and knowledge

The authors introduce a framework that constructs parallel corpora representing two worlds with identical structure: one mapped to real-world entities where parametric knowledge is useful, and another mapped to synthetic entities where such knowledge is meaningless. This enables controlled evaluation of language models by separating reasoning ability from factual recall.

Contribution

Two parallel corpora with corresponding task datasets

The authors construct two parallel corpora (SYNTHWORLD-RM and SYNTHWORLD-SM), each containing 6,920 documents covering 161K facts, along with 1.2K multi-hop QA and 1K page navigation instances. These resources are released publicly to support future research.

Contribution

Knowledge advantage gap metric and empirical analysis

The authors define and measure the knowledge advantage gap as the performance difference between real-mapped and synthetic-mapped settings. Their analysis reveals that this gap persists even with knowledge augmentation methods like retrieval-augmented generation and chain-of-thought prompting, highlighting opportunities for system improvements.