CIMemories: A Compositional Benchmark For Contextual Integrity In LLMs
Overview
Overall Novelty Assessment
The paper introduces CIMemories, a benchmark for evaluating whether memory-augmented LLMs appropriately control information disclosure across different task contexts. It resides in the 'Contextual Integrity and Privacy-Aware Memory Systems' leaf, which contains only three papers total, making this a relatively sparse research direction within the broader taxonomy of 50 papers. The sibling papers include work on memory operating systems and persistent memory scenarios, but none directly address the compositional evaluation of contextual information flow that CIMemories targets.
The taxonomy reveals that most memory research concentrates on architectural designs (nine papers in hierarchical systems alone) and retrieval-augmented generation (multiple subcategories with 15+ papers). The neighboring 'Knowledge Conflict Resolution and Information Flow Control' branch addresses contradictions between parametric and contextual knowledge but does not emphasize privacy-aware disclosure control. CIMemories bridges a gap between these architectural concerns and the emerging need for principled information flow evaluation, connecting privacy considerations to the broader memory-augmented LLM ecosystem.
Among 22 candidates examined across three contributions, none were found to clearly refute the paper's claims. The core benchmark contribution examined 10 candidates with zero refutable matches, the compositional design examined 10 candidates with zero refutable matches, and the privacy persona labeling method examined 2 candidates with zero refutable matches. This suggests that within the limited search scope, the specific combination of contextual integrity evaluation, compositional memory design, and scalable labeling appears relatively unexplored in prior work.
Based on the top-22 semantic matches examined, the work appears to occupy a novel position at the intersection of memory systems and privacy-aware information flow. The sparse population of its taxonomy leaf and absence of refuting candidates within the search scope suggest substantive novelty, though a broader literature search might reveal additional related efforts in privacy-preserving NLP or access control systems not captured by this memory-focused taxonomy.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce CIMemories, a benchmark that uses synthetic user profiles with over 100 attributes per user paired with various task contexts to evaluate whether LLMs respect contextual integrity when using persistent memory. The benchmark enables compositional evaluation through flexible memory composition and multi-task composition per user.
The benchmark features a novel compositional design that allows dynamic variation of which attributes are necessary versus inappropriate across different settings, and measures cumulative information disclosure across multiple tasks per user to study how violations accumulate over time.
The authors develop a scalable method for generating contextual integrity ground truth labels by using multiple privacy personas from established surveys, sampling labels multiple times per persona, and assigning final labels only where all personas agree, thereby respecting the inherent subjectivity in privacy norms while enabling large-scale evaluation.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[20] Memos: A memory os for ai system PDF
[40] CIMemories: A Compositional Benchmark for Contextual Integrity of Persistent Memory in LLMs PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
CIMemories benchmark for evaluating contextual integrity in memory-augmented LLMs
The authors introduce CIMemories, a benchmark that uses synthetic user profiles with over 100 attributes per user paired with various task contexts to evaluate whether LLMs respect contextual integrity when using persistent memory. The benchmark enables compositional evaluation through flexible memory composition and multi-task composition per user.
[16] Memory in Large Language Models: Mechanisms, Evaluation and Evolution PDF
[61] Preventing generation of verbatim memorization in language models gives a false sense of privacy PDF
[62] Private Memorization Editing: Turning Memorization into a Defense to Strengthen Data Privacy in Large Language Models PDF
[63] Advancing Conversational Psychotherapy: Integrating Privacy, Dual-Memory, and Domain Expertise with Large Language Models PDF
[64] Effectiveness of Privacy-preserving Algorithms in LLMs: A Benchmark and Empirical Analysis PDF
[65] An LLM-enabled human demonstration-assisted hybrid robot skill synthesis approach for human-robot collaborative assembly PDF
[66] Transformer-based generative memory embedding for adaptive contextual recall PDF
[67] Dynamic semantic memory retention in large language models: An exploration of spontaneous retrieval mechanisms PDF
[68] Dynamic neural alignment mechanisms in large language models to contextual integrity preservation PDF
[69] Preserving privacy through dememorization: An unlearning technique for mitigating memorization risks in language models PDF
Compositional design with flexible memory and multi-task composition
The benchmark features a novel compositional design that allows dynamic variation of which attributes are necessary versus inappropriate across different settings, and measures cumulative information disclosure across multiple tasks per user to study how violations accumulate over time.
[16] Memory in Large Language Models: Mechanisms, Evaluation and Evolution PDF
[40] CIMemories: A Compositional Benchmark for Contextual Integrity of Persistent Memory in LLMs PDF
[53] The sum leaks more than its parts: Compositional privacy risks and mitigations in multi-agent collaboration PDF
[54] The Routledge Handbook of Behavioural Accounting Research PDF
[55] Improving Deep Learning Framework Testing with Model-Level Metamorphic Testing PDF
[56] User profiling and satisfaction inference in public information access services PDF
[57] Retaining privileged information for multi-task learning PDF
[58] Strategic information provision in multidimensional environments PDF
[59] Context Parametrization with Compositional Adapters PDF
[60] A cut principle for information flow PDF
Scalable contextual integrity labeling using privacy personas
The authors develop a scalable method for generating contextual integrity ground truth labels by using multiple privacy personas from established surveys, sampling labels multiple times per persona, and assigning final labels only where all personas agree, thereby respecting the inherent subjectivity in privacy norms while enabling large-scale evaluation.