CIMemories: A Compositional Benchmark For Contextual Integrity In LLMs

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Contextual Integrity; Inference-time Privacy; Input-output flow

Large Language Models (LLMs) increasingly use persistent memory from past interactions to enhance personalization and task performance. However, this memory creates critical risks when sensitive information is revealed in inappropriate contexts. We present CIMemories, a benchmark for evaluating whether LLMs appropriately control information flow from memory based on task context. CIMemories uses synthetic user profiles with 100+ attributes per user, paired with various task contexts where each attribute may be essential for some tasks but inappropriate for others. For example, mental health details are necessary for booking therapy but inappropriate when requesting time off from work. This design enables two forms of compositionality: (1) flexible memory composition by varying which attributes are necessary versus inappropriate across different settings, and (2) multi-task composition per user, measuring cumulative information disclosure across sessions. Our evaluation reveals frontier models exhibit between 14%-69% attribute-level violations (leaking inappropriate information), and that higher task completeness (sharing necessary information) is accompanied by increased violations, highlighting critical gaps in integrity-aware memory systems.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces CIMemories, a benchmark for evaluating whether memory-augmented LLMs appropriately control information disclosure across different task contexts. It resides in the 'Contextual Integrity and Privacy-Aware Memory Systems' leaf, which contains only three papers total, making this a relatively sparse research direction within the broader taxonomy of 50 papers. The sibling papers include work on memory operating systems and persistent memory scenarios, but none directly address the compositional evaluation of contextual information flow that CIMemories targets.

The taxonomy reveals that most memory research concentrates on architectural designs (nine papers in hierarchical systems alone) and retrieval-augmented generation (multiple subcategories with 15+ papers). The neighboring 'Knowledge Conflict Resolution and Information Flow Control' branch addresses contradictions between parametric and contextual knowledge but does not emphasize privacy-aware disclosure control. CIMemories bridges a gap between these architectural concerns and the emerging need for principled information flow evaluation, connecting privacy considerations to the broader memory-augmented LLM ecosystem.

Among 22 candidates examined across three contributions, none were found to clearly refute the paper's claims. The core benchmark contribution examined 10 candidates with zero refutable matches, the compositional design examined 10 candidates with zero refutable matches, and the privacy persona labeling method examined 2 candidates with zero refutable matches. This suggests that within the limited search scope, the specific combination of contextual integrity evaluation, compositional memory design, and scalable labeling appears relatively unexplored in prior work.

Based on the top-22 semantic matches examined, the work appears to occupy a novel position at the intersection of memory systems and privacy-aware information flow. The sparse population of its taxonomy leaf and absence of refuting candidates within the search scope suggest substantive novelty, though a broader literature search might reveal additional related efforts in privacy-preserving NLP or access control systems not captured by this memory-focused taxonomy.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: evaluating contextual information flow control in memory-augmented language models. The field has evolved into several interconnected branches that address how language models store, retrieve, and manage information. Memory Architecture and Mechanisms explores foundational designs—ranging from working memory systems like Working Memory Agents[13] to multi-tiered structures such as Multi Tiered Memory[48]—that determine how models organize and access stored knowledge. Retrieval-Augmented Generation and Knowledge Integration focuses on methods that blend parametric model knowledge with external retrieval, exemplified by works like Memorag[6] and Cache Augmented Generation[18], which balance efficiency and accuracy when incorporating retrieved context. Knowledge Conflict Resolution and Information Flow Control tackles the challenge of reconciling contradictory information from different sources, as seen in Context Memory Conflicts[29] and Parametric Nonparametric Memory[3]. Meanwhile, Contextual Integrity and Privacy-Aware Memory Systems emphasizes controlling what information flows where, ensuring that sensitive data respects privacy boundaries and contextual norms. Application-Specific Memory Systems tailors memory mechanisms to domains like dialogue or code generation, and Specialized Memory Mechanisms and Theoretical Frameworks investigates novel architectures such as Recurrent Memory Transformers[37] and theoretical underpinnings of memory dynamics. A particularly active line of work examines the trade-offs between parametric storage and dynamic retrieval: some studies like Parameters vs Context[5] and Lightmem[4] investigate when to rely on model weights versus external memory, while others such as Adaptive Semiparametric[9] propose hybrid strategies. Another emerging theme is the need for principled information flow control, especially when memory systems must respect privacy or contextual boundaries. CIMemories[0] sits squarely within the Contextual Integrity and Privacy-Aware Memory Systems branch, addressing how to evaluate whether memory-augmented models properly enforce contextual norms when retrieving and using stored information. Its emphasis on formal evaluation of information flow distinguishes it from neighbors like Memory OS[20], which focuses more on system-level memory management, and CIMemories Persistent[40], which extends similar ideas to persistent storage scenarios. Together, these works highlight an open question: how can we rigorously verify that memory mechanisms respect the intended boundaries of information sharing in complex, multi-context environments?

Claimed Contributions

CIMemories benchmark for evaluating contextual integrity in memory-augmented LLMs

10 retrieved papers

The authors introduce CIMemories, a benchmark that uses synthetic user profiles with over 100 attributes per user paired with various task contexts to evaluate whether LLMs respect contextual integrity when using persistent memory. The benchmark enables compositional evaluation through flexible memory composition and multi-task composition per user.

10 retrieved papers

Compositional design with flexible memory and multi-task composition

10 retrieved papers

The benchmark features a novel compositional design that allows dynamic variation of which attributes are necessary versus inappropriate across different settings, and measures cumulative information disclosure across multiple tasks per user to study how violations accumulate over time.

10 retrieved papers

Scalable contextual integrity labeling using privacy personas

2 retrieved papers

The authors develop a scalable method for generating contextual integrity ground truth labels by using multiple privacy personas from established surveys, sampling labels multiple times per persona, and assigning final labels only where all personas agree, thereby respecting the inherent subjectivity in privacy norms while enabling large-scale evaluation.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[20] Memos: A memory os for ai system PDF

Li Zhiyu, Song Shi-chao, Zhiyu Li, Xi, Chenyang, Shichao Song, Wang Hanyu, Chenyang Xi, Tang Chen, Hanyu Wang, Niu Si-min, Chen Tang, Chen Ding, Simin Niu, Yang Jiawei, Ding Chen, Li Chunyu, Jiawei Yang, Chunyu Li, Zhao Jihao, Qingchen Yu, Jihao Zhao, Liu Peng, Yezhaohui Wang, Lin Ze-hao, Peng Liu, Wang Pengyuan, Zehao Lin, Huo Jiahao, Pengyuan Wang, Chen, Tianyi, Jiahao Huo, Chen Kai, Tianyi Chen, Li Kehang, Kai Chen, Tao Zhen, Ke-Rong Li, Zhenzhen Tao, Wu Hao, Junpeng Ren, Tang Bo, Huayi Lai, Wang, Zhenren, Hao Wu, Fan Zhaoxin, Bo Tang, Zhang, Ningyu, Zhenren Wang, Linfeng, Zhaoxin Fan, Yan, Junchi, Ningyu Zhang, Yang Mingchuan, Linfeng Zhang, Xu Tong, Junchi Yan, Xu Wei, Ming-Zhou Yang, Huajun, Tong Xu, Wang Haofen, Wei Xu, Yang, Hongkang, Huajun Chen, Zhang Wentao, Haofeng Wang, Xu, Zhi-Qin John, Hongkang Yang, Siheng, Wentao Zhang, Xiong Fei-yu, Zhikun Xu, Siheng Chen, Feiyu Xiong (2025)

[40] CIMemories: A Compositional Benchmark for Contextual Integrity of Persistent Memory in LLMs PDF

Niloofar Mireshghallah, Neal Mangaokar, Narine Kokhlikyan, Arman Zharmagambetov, Manzil Zaheer, Saeed Mahloujifar, Kamalika Chaudhuri (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

CIMemories benchmark for evaluating contextual integrity in memory-augmented LLMs

[16] Memory in Large Language Models: Mechanisms, Evaluation and Evolution PDF

Cannot Refute

[61] Preventing generation of verbatim memorization in language models gives a false sense of privacy PDF

Cannot Refute

[62] Private Memorization Editing: Turning Memorization into a Defense to Strengthen Data Privacy in Large Language Models PDF

Cannot Refute

[63] Advancing Conversational Psychotherapy: Integrating Privacy, Dual-Memory, and Domain Expertise with Large Language Models PDF

Cannot Refute

[64] Effectiveness of Privacy-preserving Algorithms in LLMs: A Benchmark and Empirical Analysis PDF

Cannot Refute

[65] An LLM-enabled human demonstration-assisted hybrid robot skill synthesis approach for human-robot collaborative assembly PDF

Cannot Refute

[66] Transformer-based generative memory embedding for adaptive contextual recall PDF

Cannot Refute

[67] Dynamic semantic memory retention in large language models: An exploration of spontaneous retrieval mechanisms PDF

Cannot Refute

[68] Dynamic neural alignment mechanisms in large language models to contextual integrity preservation PDF

Cannot Refute

[69] Preserving privacy through dememorization: An unlearning technique for mitigating memorization risks in language models PDF

Cannot Refute

Contribution

Compositional design with flexible memory and multi-task composition

[16] Memory in Large Language Models: Mechanisms, Evaluation and Evolution PDF

Cannot Refute

[40] CIMemories: A Compositional Benchmark for Contextual Integrity of Persistent Memory in LLMs PDF

Cannot Refute

[53] The sum leaks more than its parts: Compositional privacy risks and mitigations in multi-agent collaboration PDF

Cannot Refute

[54] The Routledge Handbook of Behavioural Accounting Research PDF

Cannot Refute

[55] Improving Deep Learning Framework Testing with Model-Level Metamorphic Testing PDF

Cannot Refute

[56] User profiling and satisfaction inference in public information access services PDF

Cannot Refute

[57] Retaining privileged information for multi-task learning PDF

Cannot Refute

[58] Strategic information provision in multidimensional environments PDF

Cannot Refute

[59] Context Parametrization with Compositional Adapters PDF

Cannot Refute

[60] A cut principle for information flow PDF

Cannot Refute

Contribution

Scalable contextual integrity labeling using privacy personas

[51] Search with discretion: value sensitive design of training data for information retrieval PDF

Cannot Refute

[52] Privacy Awareness for Information-Sharing Assistants: A Case-study on Form-filling with Contextual Integrity PDF

Cannot Refute

CIMemories: A Compositional Benchmark For Contextual Integrity In LLMs

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[20] Memos: A memory os for ai system PDF

[40] CIMemories: A Compositional Benchmark for Contextual Integrity of Persistent Memory in LLMs PDF

Contribution Analysis

CIMemories benchmark for evaluating contextual integrity in memory-augmented LLMs

[16] Memory in Large Language Models: Mechanisms, Evaluation and Evolution PDF

[61] Preventing generation of verbatim memorization in language models gives a false sense of privacy PDF

[62] Private Memorization Editing: Turning Memorization into a Defense to Strengthen Data Privacy in Large Language Models PDF

[63] Advancing Conversational Psychotherapy: Integrating Privacy, Dual-Memory, and Domain Expertise with Large Language Models PDF

[64] Effectiveness of Privacy-preserving Algorithms in LLMs: A Benchmark and Empirical Analysis PDF

[65] An LLM-enabled human demonstration-assisted hybrid robot skill synthesis approach for human-robot collaborative assembly PDF

[66] Transformer-based generative memory embedding for adaptive contextual recall PDF

[67] Dynamic semantic memory retention in large language models: An exploration of spontaneous retrieval mechanisms PDF

[68] Dynamic neural alignment mechanisms in large language models to contextual integrity preservation PDF

[69] Preserving privacy through dememorization: An unlearning technique for mitigating memorization risks in language models PDF

Compositional design with flexible memory and multi-task composition

[16] Memory in Large Language Models: Mechanisms, Evaluation and Evolution PDF

[40] CIMemories: A Compositional Benchmark for Contextual Integrity of Persistent Memory in LLMs PDF

[53] The sum leaks more than its parts: Compositional privacy risks and mitigations in multi-agent collaboration PDF

[54] The Routledge Handbook of Behavioural Accounting Research PDF

[55] Improving Deep Learning Framework Testing with Model-Level Metamorphic Testing PDF

[56] User profiling and satisfaction inference in public information access services PDF

[57] Retaining privileged information for multi-task learning PDF

[58] Strategic information provision in multidimensional environments PDF

[59] Context Parametrization with Compositional Adapters PDF

[60] A cut principle for information flow PDF

Scalable contextual integrity labeling using privacy personas

[51] Search with discretion: value sensitive design of training data for information retrieval PDF

[52] Privacy Awareness for Information-Sharing Assistants: A Case-study on Form-filling with Contextual Integrity PDF

Table of Contents