A Fictional Q&A Dataset for Studying Memorization and Knowledge Acquisition

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.6 Download Report PDF

large language modelsmemorizationknowledge acquisitiondatasets

When language models are trained on textual data, they acquire both knowledge about the structure of language as well as knowledge of facts about the world. At inference time, their knowledge of facts can be leveraged to solve interesting problems and perform useful knowledge work for users. It is well known that language models can verbatim memorize long sequences from their training data. However, it is much less well understood how language models memorize facts seen during training. In this work, we propose a new dataset to specifically empower researchers to study the dual processes of fact memorization and verbatim sequence memorization. The dataset consists of synthetically-generated, webtext-like documents about fictional events, as well as question-answer pairs about the events. We conduct training experiments showing how synthetic data about fictional events can be effective in teasing apart different forms of memorization. We also document the challenges in effectively building realistic, fictional synthetic data.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a synthetic dataset of fictional events and question-answer pairs to study how language models memorize facts versus verbatim sequences. According to the taxonomy, this work occupies the 'Synthetic Data for Knowledge Studies' leaf under 'Knowledge Acquisition and Learning Dynamics'. Notably, this leaf contains only one paper—the original submission itself—indicating a sparse research direction. The broader parent branch includes three leaves covering pretraining knowledge acquisition, synthetic data methods, and generalization from learning, with approximately ten papers total, suggesting moderate activity in knowledge acquisition research but minimal prior focus on synthetic fictional data approaches.

The taxonomy reveals that neighboring research directions concentrate on pretraining dynamics with real-world corpora, continual learning frameworks for knowledge updates, and memorization phenomena studies that distinguish verbatim from factual retention. The 'Memorization Phenomena and Dynamics' branch contains four leaves with roughly fifteen papers examining verbatim memorization, factual retention patterns, and memorization-generalization trade-offs. The 'Knowledge Update and Injection Methods' branch explores post-training modifications through continual learning and editing techniques. The original paper diverges from these directions by constructing controlled fictional environments rather than analyzing real training data or updating existing models, positioning it at the intersection of memorization studies and knowledge acquisition research.

Among twenty-eight candidates examined through semantic search and citation expansion, the contribution-level analysis reveals mixed novelty signals. The FictionalQA dataset contribution examined ten candidates with one appearing to provide overlapping prior work, suggesting some precedent for synthetic knowledge datasets. The dataset generation pipeline contribution examined ten candidates with none clearly refuting it, indicating potential novelty in the construction methodology. The demonstration that verbatim and factual memorization differ examined eight candidates with one refutable match, implying that distinguishing these memorization types has received prior attention. Overall, two of three contributions show at least one refutable candidate among the limited search scope.

Based on the limited literature search covering top-thirty semantic matches, the work appears to occupy a relatively unexplored niche within knowledge acquisition research. The taxonomy structure confirms sparse prior activity in synthetic fictional data methods, though neighboring areas like memorization phenomena and continual learning are more populated. The contribution-level statistics suggest that while individual aspects have some precedent, the integrated approach may offer distinctive value. However, the analysis does not cover exhaustive citation networks or domain-specific venues that might reveal additional related work.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: fact memorization and knowledge acquisition in language models. The field encompasses how language models store, learn, and update factual information, organized into six main branches. Knowledge Storage and Representation Mechanisms examines the internal structures and neural substrates that encode facts, including work on knowledge graphs and memory architectures. Memorization Phenomena and Dynamics investigates when and how models retain training data, spanning studies like Memorization SoK[15] and explorations of in-context memorization. Knowledge Acquisition and Learning Dynamics focuses on how models absorb new information during training, including continual learning approaches such as Continual Knowledge Learning[3] and synthetic data methods like Fictional QA Dataset[0]. Knowledge Update and Injection Methods addresses post-training modifications to model knowledge, contrasting fine-tuning versus retrieval strategies. Factuality Evaluation and Boundaries probes the limits of what models know, with surveys like Factuality Survey[2] and boundary studies such as Factual Knowledge Boundary[4]. Domain-Specific Knowledge Applications explores specialized contexts from clinical to cross-modal settings. Recent work reveals tensions between memorization and generalization, with studies like Generalization vs Memorization[42] and Generalization Controlled Study[41] examining this trade-off. Another active line investigates continual knowledge updates without catastrophic forgetting, explored in works such as Continual Factoid Memorization[9] and Mind the Interference[13]. Fictional QA Dataset[0] sits within the Knowledge Acquisition and Learning Dynamics branch, specifically addressing synthetic data for knowledge studies. Unlike continual learning approaches that update existing models with new facts, this work constructs controlled fictional datasets to isolate knowledge acquisition mechanisms. It contrasts with retrieval-augmented methods like Fine-Tuning vs Retrieval[5] by focusing on how models internalize knowledge during training rather than accessing external sources, offering a complementary perspective on the fundamental processes underlying factual learning in language models.

Claimed Contributions

FictionalQA dataset for studying memorization and knowledge acquisition

Can Refute

10 retrieved papers

The authors introduce FictionalQA, a synthetically-generated dataset consisting of webtext-like documents about fictional events and associated question-answer pairs. The dataset is designed to enable controlled studies of fact memorization versus verbatim sequence memorization by combining realistic surface forms with fictional content that is factually disjoint from real-world knowledge.

10 retrieved papers

Can Refute

Dataset generation pipeline for producing realistic fictional corpora

10 retrieved papers

The authors develop a hierarchical pipeline that generates fictional data through multiple stages: seed events, structured fictsheets, diverse document styles (news, social media, corporate, blog), and question-answer pairs. The pipeline is designed as a living asset that can be regenerated and adapted by other researchers for different experimental needs.

10 retrieved papers

Demonstration that verbatim and factual memorization conditions differ

Can Refute

8 retrieved papers

Through controlled experiments, the authors show that rapid verbatim memorization (training loss approaching zero with increasing validation loss) does not necessarily align with conditions that promote factual generalization (both training and validation loss decreasing). This finding highlights fundamental differences in how overfitting and generalization occur during language model training.

8 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

FictionalQA dataset for studying memorization and knowledge acquisition

[55] Physics of language models: Part 3.1, knowledge storage and extraction PDF

Can Refute

[9] Continual memorization of factoids in large language models PDF

Cannot Refute

[49] Data mixing can induce phase transitions in knowledge acquisition PDF

Cannot Refute

[51] Pretraining with artificial language: Studying transferable knowledge in language models PDF

Cannot Refute

[52] Memorybank: Enhancing large language models with long-term memory PDF

Cannot Refute

[53] CASCADE Your Datasets for Cross-Mode Knowledge Retrieval of Language Models PDF

Cannot Refute

[54] Counterfactual Memorization in Neural Language Models PDF

Cannot Refute

[56] Evaluating the long-term memory of large language models PDF

Cannot Refute

[57] Demystifying Verbatim Memorization in Large Language Models PDF

Cannot Refute

[58] Beyond memorization: The challenge of random memory access in language models PDF

Cannot Refute

Contribution

Dataset generation pipeline for producing realistic fictional corpora

[66] DIVER: A Multi-Stage Approach for Reasoning-intensive Information Retrieval PDF

Cannot Refute

[67] Persona-Based Synthetic Data Generation Using Multi-Stage Conditioning with Large Language Models for Emotion Recognition PDF

Cannot Refute

[68] Latent cascade synthesis: Investigating iterative pseudo-contextual scaffold formation in contemporary large language models PDF

Cannot Refute

[69] Enhancing temporal commonsense understanding using disentangled attention-based method with a hybrid data framework PDF

Cannot Refute

[70] Krikri: Advancing Open Large Language Models for Greek PDF

Cannot Refute

[71] MM-StoryAgent: Immersive Narrated Storybook Video Generation with a Multi-Agent Paradigm across Text, Image and Audio PDF

Cannot Refute

[72] KM-Chat: A Large-Scale Synthetic Question-Answer Dataset for Open-Domain Conversational AI PDF

Cannot Refute

[73] StructText: A Synthetic Table-to-Text Approach for Benchmark Generation with Multi-Dimensional Evaluation PDF

Cannot Refute

[74] Multi-stage pre-training for low-resource domain adaptation PDF

Cannot Refute

[75] Multi-stage training with improved negative contrast for neural passage retrieval PDF

Cannot Refute

Contribution

Demonstration that verbatim and factual memorization conditions differ

[59] Factual Probing Is [MASK]: Learning vs. Learning to Recall PDF

Can Refute

[15] SoK: Memorization in General-Purpose Large Language Models PDF

Cannot Refute

[60] Hallucinate or Memorize? The Two Sides of Probabilistic Learning in Large Language Models PDF

Cannot Refute

[61] Memorization Over Reasoning? Exposing and Mitigating Verbatim Memorization in Large Language Models' Character Understanding Evaluation PDF

Cannot Refute

[62] Paradox of poetic intent in back-translation: evaluating the quality of large language models in Chinese translation PDF

Cannot Refute

[63] Enhancing learning and retention through the distribution of practice repetitions across multiple sessions PDF

Cannot Refute

[64] Verbatim Memorization in language models and EU Copyright Law PDF

Cannot Refute

[65] Rethinking the Role of Verbatim Memorization in LLM Privacy PDF

Cannot Refute

A Fictional Q&A Dataset for Studying Memorization and Knowledge Acquisition

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

FictionalQA dataset for studying memorization and knowledge acquisition

[55] Physics of language models: Part 3.1, knowledge storage and extraction PDF

[9] Continual memorization of factoids in large language models PDF

[49] Data mixing can induce phase transitions in knowledge acquisition PDF

[51] Pretraining with artificial language: Studying transferable knowledge in language models PDF

[52] Memorybank: Enhancing large language models with long-term memory PDF

[53] CASCADE Your Datasets for Cross-Mode Knowledge Retrieval of Language Models PDF

[54] Counterfactual Memorization in Neural Language Models PDF

[56] Evaluating the long-term memory of large language models PDF

[57] Demystifying Verbatim Memorization in Large Language Models PDF

[58] Beyond memorization: The challenge of random memory access in language models PDF

Dataset generation pipeline for producing realistic fictional corpora

[66] DIVER: A Multi-Stage Approach for Reasoning-intensive Information Retrieval PDF

[67] Persona-Based Synthetic Data Generation Using Multi-Stage Conditioning with Large Language Models for Emotion Recognition PDF

[68] Latent cascade synthesis: Investigating iterative pseudo-contextual scaffold formation in contemporary large language models PDF

[69] Enhancing temporal commonsense understanding using disentangled attention-based method with a hybrid data framework PDF

[70] Krikri: Advancing Open Large Language Models for Greek PDF

[71] MM-StoryAgent: Immersive Narrated Storybook Video Generation with a Multi-Agent Paradigm across Text, Image and Audio PDF

[72] KM-Chat: A Large-Scale Synthetic Question-Answer Dataset for Open-Domain Conversational AI PDF

[73] StructText: A Synthetic Table-to-Text Approach for Benchmark Generation with Multi-Dimensional Evaluation PDF

[74] Multi-stage pre-training for low-resource domain adaptation PDF

[75] Multi-stage training with improved negative contrast for neural passage retrieval PDF

Demonstration that verbatim and factual memorization conditions differ

[59] Factual Probing Is [MASK]: Learning vs. Learning to Recall PDF

[15] SoK: Memorization in General-Purpose Large Language Models PDF

[60] Hallucinate or Memorize? The Two Sides of Probabilistic Learning in Large Language Models PDF

[61] Memorization Over Reasoning? Exposing and Mitigating Verbatim Memorization in Large Language Models' Character Understanding Evaluation PDF

[62] Paradox of poetic intent in back-translation: evaluating the quality of large language models in Chinese translation PDF

[63] Enhancing learning and retention through the distribution of practice repetitions across multiple sessions PDF

[64] Verbatim Memorization in language models and EU Copyright Law PDF

[65] Rethinking the Role of Verbatim Memorization in LLM Privacy PDF

Table of Contents