CIMemories: A Compositional Benchmark For Contextual Integrity In LLMs

ICLR 2026 Conference SubmissionAnonymous Authors
Contextual Integrity; Inference-time Privacy; Input-output flow
Abstract:

Large Language Models (LLMs) increasingly use persistent memory from past interactions to enhance personalization and task performance. However, this memory creates critical risks when sensitive information is revealed in inappropriate contexts. We present CIMemories, a benchmark for evaluating whether LLMs appropriately control information flow from memory based on task context. CIMemories uses synthetic user profiles with 100+ attributes per user, paired with various task contexts where each attribute may be essential for some tasks but inappropriate for others. For example, mental health details are necessary for booking therapy but inappropriate when requesting time off from work. This design enables two forms of compositionality: (1) flexible memory composition by varying which attributes are necessary versus inappropriate across different settings, and (2) multi-task composition per user, measuring cumulative information disclosure across sessions. Our evaluation reveals frontier models exhibit between 14%-69% attribute-level violations (leaking inappropriate information), and that higher task completeness (sharing necessary information) is accompanied by increased violations, highlighting critical gaps in integrity-aware memory systems.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces CIMemories, a benchmark for evaluating whether memory-augmented LLMs appropriately control information disclosure across different task contexts. It resides in the 'Contextual Integrity and Privacy-Aware Memory Systems' leaf, which contains only three papers total, making this a relatively sparse research direction within the broader taxonomy of 50 papers. The sibling papers include work on memory operating systems and persistent memory scenarios, but none directly address the compositional evaluation of contextual information flow that CIMemories targets.

The taxonomy reveals that most memory research concentrates on architectural designs (nine papers in hierarchical systems alone) and retrieval-augmented generation (multiple subcategories with 15+ papers). The neighboring 'Knowledge Conflict Resolution and Information Flow Control' branch addresses contradictions between parametric and contextual knowledge but does not emphasize privacy-aware disclosure control. CIMemories bridges a gap between these architectural concerns and the emerging need for principled information flow evaluation, connecting privacy considerations to the broader memory-augmented LLM ecosystem.

Among 22 candidates examined across three contributions, none were found to clearly refute the paper's claims. The core benchmark contribution examined 10 candidates with zero refutable matches, the compositional design examined 10 candidates with zero refutable matches, and the privacy persona labeling method examined 2 candidates with zero refutable matches. This suggests that within the limited search scope, the specific combination of contextual integrity evaluation, compositional memory design, and scalable labeling appears relatively unexplored in prior work.

Based on the top-22 semantic matches examined, the work appears to occupy a novel position at the intersection of memory systems and privacy-aware information flow. The sparse population of its taxonomy leaf and absence of refuting candidates within the search scope suggest substantive novelty, though a broader literature search might reveal additional related efforts in privacy-preserving NLP or access control systems not captured by this memory-focused taxonomy.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
22
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: evaluating contextual information flow control in memory-augmented language models. The field has evolved into several interconnected branches that address how language models store, retrieve, and manage information. Memory Architecture and Mechanisms explores foundational designs—ranging from working memory systems like Working Memory Agents[13] to multi-tiered structures such as Multi Tiered Memory[48]—that determine how models organize and access stored knowledge. Retrieval-Augmented Generation and Knowledge Integration focuses on methods that blend parametric model knowledge with external retrieval, exemplified by works like Memorag[6] and Cache Augmented Generation[18], which balance efficiency and accuracy when incorporating retrieved context. Knowledge Conflict Resolution and Information Flow Control tackles the challenge of reconciling contradictory information from different sources, as seen in Context Memory Conflicts[29] and Parametric Nonparametric Memory[3]. Meanwhile, Contextual Integrity and Privacy-Aware Memory Systems emphasizes controlling what information flows where, ensuring that sensitive data respects privacy boundaries and contextual norms. Application-Specific Memory Systems tailors memory mechanisms to domains like dialogue or code generation, and Specialized Memory Mechanisms and Theoretical Frameworks investigates novel architectures such as Recurrent Memory Transformers[37] and theoretical underpinnings of memory dynamics. A particularly active line of work examines the trade-offs between parametric storage and dynamic retrieval: some studies like Parameters vs Context[5] and Lightmem[4] investigate when to rely on model weights versus external memory, while others such as Adaptive Semiparametric[9] propose hybrid strategies. Another emerging theme is the need for principled information flow control, especially when memory systems must respect privacy or contextual boundaries. CIMemories[0] sits squarely within the Contextual Integrity and Privacy-Aware Memory Systems branch, addressing how to evaluate whether memory-augmented models properly enforce contextual norms when retrieving and using stored information. Its emphasis on formal evaluation of information flow distinguishes it from neighbors like Memory OS[20], which focuses more on system-level memory management, and CIMemories Persistent[40], which extends similar ideas to persistent storage scenarios. Together, these works highlight an open question: how can we rigorously verify that memory mechanisms respect the intended boundaries of information sharing in complex, multi-context environments?

Claimed Contributions

CIMemories benchmark for evaluating contextual integrity in memory-augmented LLMs

The authors introduce CIMemories, a benchmark that uses synthetic user profiles with over 100 attributes per user paired with various task contexts to evaluate whether LLMs respect contextual integrity when using persistent memory. The benchmark enables compositional evaluation through flexible memory composition and multi-task composition per user.

10 retrieved papers
Compositional design with flexible memory and multi-task composition

The benchmark features a novel compositional design that allows dynamic variation of which attributes are necessary versus inappropriate across different settings, and measures cumulative information disclosure across multiple tasks per user to study how violations accumulate over time.

10 retrieved papers
Scalable contextual integrity labeling using privacy personas

The authors develop a scalable method for generating contextual integrity ground truth labels by using multiple privacy personas from established surveys, sampling labels multiple times per persona, and assigning final labels only where all personas agree, thereby respecting the inherent subjectivity in privacy norms while enabling large-scale evaluation.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

CIMemories benchmark for evaluating contextual integrity in memory-augmented LLMs

The authors introduce CIMemories, a benchmark that uses synthetic user profiles with over 100 attributes per user paired with various task contexts to evaluate whether LLMs respect contextual integrity when using persistent memory. The benchmark enables compositional evaluation through flexible memory composition and multi-task composition per user.

Contribution

Compositional design with flexible memory and multi-task composition

The benchmark features a novel compositional design that allows dynamic variation of which attributes are necessary versus inappropriate across different settings, and measures cumulative information disclosure across multiple tasks per user to study how violations accumulate over time.

Contribution

Scalable contextual integrity labeling using privacy personas

The authors develop a scalable method for generating contextual integrity ground truth labels by using multiple privacy personas from established surveys, sampling labels multiple times per persona, and assigning final labels only where all personas agree, thereby respecting the inherent subjectivity in privacy norms while enabling large-scale evaluation.