A Fictional Q&A Dataset for Studying Memorization and Knowledge Acquisition

ICLR 2026 Conference SubmissionAnonymous Authors
large language modelsmemorizationknowledge acquisitiondatasets
Abstract:

When language models are trained on textual data, they acquire both knowledge about the structure of language as well as knowledge of facts about the world. At inference time, their knowledge of facts can be leveraged to solve interesting problems and perform useful knowledge work for users. It is well known that language models can verbatim memorize long sequences from their training data. However, it is much less well understood how language models memorize facts seen during training. In this work, we propose a new dataset to specifically empower researchers to study the dual processes of fact memorization and verbatim sequence memorization. The dataset consists of synthetically-generated, webtext-like documents about fictional events, as well as question-answer pairs about the events. We conduct training experiments showing how synthetic data about fictional events can be effective in teasing apart different forms of memorization. We also document the challenges in effectively building realistic, fictional synthetic data.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a synthetic dataset of fictional events and question-answer pairs to study how language models memorize facts versus verbatim sequences. According to the taxonomy, this work occupies the 'Synthetic Data for Knowledge Studies' leaf under 'Knowledge Acquisition and Learning Dynamics'. Notably, this leaf contains only one paper—the original submission itself—indicating a sparse research direction. The broader parent branch includes three leaves covering pretraining knowledge acquisition, synthetic data methods, and generalization from learning, with approximately ten papers total, suggesting moderate activity in knowledge acquisition research but minimal prior focus on synthetic fictional data approaches.

The taxonomy reveals that neighboring research directions concentrate on pretraining dynamics with real-world corpora, continual learning frameworks for knowledge updates, and memorization phenomena studies that distinguish verbatim from factual retention. The 'Memorization Phenomena and Dynamics' branch contains four leaves with roughly fifteen papers examining verbatim memorization, factual retention patterns, and memorization-generalization trade-offs. The 'Knowledge Update and Injection Methods' branch explores post-training modifications through continual learning and editing techniques. The original paper diverges from these directions by constructing controlled fictional environments rather than analyzing real training data or updating existing models, positioning it at the intersection of memorization studies and knowledge acquisition research.

Among twenty-eight candidates examined through semantic search and citation expansion, the contribution-level analysis reveals mixed novelty signals. The FictionalQA dataset contribution examined ten candidates with one appearing to provide overlapping prior work, suggesting some precedent for synthetic knowledge datasets. The dataset generation pipeline contribution examined ten candidates with none clearly refuting it, indicating potential novelty in the construction methodology. The demonstration that verbatim and factual memorization differ examined eight candidates with one refutable match, implying that distinguishing these memorization types has received prior attention. Overall, two of three contributions show at least one refutable candidate among the limited search scope.

Based on the limited literature search covering top-thirty semantic matches, the work appears to occupy a relatively unexplored niche within knowledge acquisition research. The taxonomy structure confirms sparse prior activity in synthetic fictional data methods, though neighboring areas like memorization phenomena and continual learning are more populated. The contribution-level statistics suggest that while individual aspects have some precedent, the integrated approach may offer distinctive value. However, the analysis does not cover exhaustive citation networks or domain-specific venues that might reveal additional related work.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
28
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: fact memorization and knowledge acquisition in language models. The field encompasses how language models store, learn, and update factual information, organized into six main branches. Knowledge Storage and Representation Mechanisms examines the internal structures and neural substrates that encode facts, including work on knowledge graphs and memory architectures. Memorization Phenomena and Dynamics investigates when and how models retain training data, spanning studies like Memorization SoK[15] and explorations of in-context memorization. Knowledge Acquisition and Learning Dynamics focuses on how models absorb new information during training, including continual learning approaches such as Continual Knowledge Learning[3] and synthetic data methods like Fictional QA Dataset[0]. Knowledge Update and Injection Methods addresses post-training modifications to model knowledge, contrasting fine-tuning versus retrieval strategies. Factuality Evaluation and Boundaries probes the limits of what models know, with surveys like Factuality Survey[2] and boundary studies such as Factual Knowledge Boundary[4]. Domain-Specific Knowledge Applications explores specialized contexts from clinical to cross-modal settings. Recent work reveals tensions between memorization and generalization, with studies like Generalization vs Memorization[42] and Generalization Controlled Study[41] examining this trade-off. Another active line investigates continual knowledge updates without catastrophic forgetting, explored in works such as Continual Factoid Memorization[9] and Mind the Interference[13]. Fictional QA Dataset[0] sits within the Knowledge Acquisition and Learning Dynamics branch, specifically addressing synthetic data for knowledge studies. Unlike continual learning approaches that update existing models with new facts, this work constructs controlled fictional datasets to isolate knowledge acquisition mechanisms. It contrasts with retrieval-augmented methods like Fine-Tuning vs Retrieval[5] by focusing on how models internalize knowledge during training rather than accessing external sources, offering a complementary perspective on the fundamental processes underlying factual learning in language models.

Claimed Contributions

FictionalQA dataset for studying memorization and knowledge acquisition

The authors introduce FictionalQA, a synthetically-generated dataset consisting of webtext-like documents about fictional events and associated question-answer pairs. The dataset is designed to enable controlled studies of fact memorization versus verbatim sequence memorization by combining realistic surface forms with fictional content that is factually disjoint from real-world knowledge.

10 retrieved papers
Can Refute
Dataset generation pipeline for producing realistic fictional corpora

The authors develop a hierarchical pipeline that generates fictional data through multiple stages: seed events, structured fictsheets, diverse document styles (news, social media, corporate, blog), and question-answer pairs. The pipeline is designed as a living asset that can be regenerated and adapted by other researchers for different experimental needs.

10 retrieved papers
Demonstration that verbatim and factual memorization conditions differ

Through controlled experiments, the authors show that rapid verbatim memorization (training loss approaching zero with increasing validation loss) does not necessarily align with conditions that promote factual generalization (both training and validation loss decreasing). This finding highlights fundamental differences in how overfitting and generalization occur during language model training.

8 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

FictionalQA dataset for studying memorization and knowledge acquisition

The authors introduce FictionalQA, a synthetically-generated dataset consisting of webtext-like documents about fictional events and associated question-answer pairs. The dataset is designed to enable controlled studies of fact memorization versus verbatim sequence memorization by combining realistic surface forms with fictional content that is factually disjoint from real-world knowledge.

Contribution

Dataset generation pipeline for producing realistic fictional corpora

The authors develop a hierarchical pipeline that generates fictional data through multiple stages: seed events, structured fictsheets, diverse document styles (news, social media, corporate, blog), and question-answer pairs. The pipeline is designed as a living asset that can be regenerated and adapted by other researchers for different experimental needs.

Contribution

Demonstration that verbatim and factual memorization conditions differ

Through controlled experiments, the authors show that rapid verbatim memorization (training loss approaching zero with increasing validation loss) does not necessarily align with conditions that promote factual generalization (both training and validation loss decreasing). This finding highlights fundamental differences in how overfitting and generalization occur during language model training.