Measuring Physical-World Privacy Awareness of Large Language Models: An Evaluation Benchmark

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

LLMPrivacy

The deployment of Large Language Models (LLMs) in embodied agents creates an urgent need to measure their privacy awareness in the physical world. Existing evaluation methods, however, are confined to natural lanague based scenarios. To bridge this gap, we introduce EAPrivacy, a comprehensive evaluation benchmark designed to quantify the physical-world privacy awareness of LLM-powered agents. EAPrivacy utilizes procedurally generated scenarios across four tiers to test an agent's ability to handle sensitive objects, adapt to changing environments, balance task execution with privacy constraints, and resolve conflicts with social norms. Our measurements reveal a critical deficit in current models. The top-performing model, Gemini 2.5 Pro, achieved only 59% accuracy in scenarios involving changing physical environments. Furthermore, when a task was accompanied by a privacy request, models prioritized completion over the constraint in up to 86% of cases. In high-stakes situations pitting privacy against critical social norms, leading models like GPT-4o and Claude-3.5-haiku disregarded the social norm over 15% of the time. These findings, demonstrated by our benchmark, underscore a fundamental misalignment in LLMs regarding physically grounded privacy and establish the need for more robust, physically-aware alignment.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces EAPrivacy, a benchmark for evaluating physical-world privacy awareness in LLM-powered embodied agents. It resides in the Physical-World Privacy Assessment leaf, which contains only three papers total, indicating a relatively sparse research direction. The taxonomy shows this leaf is distinct from Digital Environment Privacy Evaluation (three papers focused on virtual interfaces and memory systems) and from broader safety frameworks. This positioning suggests the work addresses an emerging gap where privacy evaluation meets physical embodiment, a less crowded area compared to digital-only privacy assessments or general safety benchmarks.

The taxonomy reveals neighboring research in Privacy-Preserving Architectures (seven papers across tool-using agents, edge deployment, and healthcare robotics) and Safety and Contextual Reasoning Frameworks (three papers on risk assessment and dynamic adaptation). The paper's focus on evaluation distinguishes it from these mitigation-oriented branches. Within Privacy Evaluation and Benchmarking, the sibling papers in Physical-World Privacy Assessment share the embodied context but may differ in evaluation methodology or scenario design. The taxonomy's scope notes clarify that attack methods and deployment architectures are excluded from this evaluation-focused branch, helping position the work as diagnostic rather than defensive.

Across three contributions examined, the analysis reviewed thirty candidate papers total, with ten candidates per contribution. None of the contributions were clearly refuted by prior work in this limited search. The EAPrivacy benchmark contribution examined ten candidates with zero refutable matches, as did the four-tiered framework and PDDL-based representation contributions. This suggests that among the top-thirty semantically similar papers identified, none provided overlapping prior work on procedurally generated physical privacy scenarios with tiered complexity. The absence of refutations across all contributions indicates potential novelty within the examined scope, though the search was not exhaustive.

Given the limited search scope of thirty candidates and the sparse three-paper leaf in the taxonomy, the work appears to occupy relatively unexplored territory at the intersection of embodied agents and privacy evaluation. The analysis covers top-K semantic matches and does not claim comprehensive field coverage. The lack of refutable prior work among examined candidates, combined with the sparse taxonomy leaf, suggests the specific combination of physical-world scenarios, tiered evaluation, and PDDL-based representation may be distinctive within the surveyed literature.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Evaluating physical-world privacy awareness of large language models in embodied agents. The field structure reflects a multifaceted approach to understanding and addressing privacy concerns when LLMs operate in physical environments. Privacy Evaluation and Benchmarking focuses on developing systematic assessments and metrics for privacy-aware behavior, often through specialized benchmarks and scenario-based testing. Privacy Attack Methods and Vulnerabilities explores how adversaries might exploit embodied agents, examining backdoor triggers and contextual manipulation. Privacy-Preserving Architectures and Mitigation investigates technical solutions such as federated learning and differential privacy mechanisms tailored to embodied settings. Safety and Contextual Reasoning Frameworks emphasizes broader safety considerations and the ability of agents to reason about context-dependent privacy norms. Finally, Data Generation and Application Domains addresses domain-specific challenges in healthcare, surveillance, and other real-world deployments where privacy stakes are particularly high. Several active lines of work reveal key trade-offs between agent capability and privacy protection. One cluster examines privacy in social and assistive robotics, where agents must balance helpfulness with discretion in sensitive environments—works like Social Robot Privacy[3] and Privacy Aware Robot[15] explore how robots navigate household and care settings. Another thread investigates memory and data retention risks, as seen in Agent Memory Privacy[2] and Privacyasst[1], highlighting tensions between personalization and information leakage. Physical Privacy Awareness[0] sits squarely within the physical-world assessment branch, sharing concerns with Social Robot Privacy[3] and Privacy Aware Robot[15] about embodied contexts, yet it emphasizes systematic evaluation of LLM reasoning about privacy rather than architectural defenses. Compared to these neighbors, Physical Privacy Awareness[0] appears more focused on probing the inherent privacy awareness capabilities of foundation models in realistic physical scenarios, complementing works that address mitigation or domain-specific deployment challenges.

Claimed Contributions

EAPrivacy benchmark for physical-world privacy awareness evaluation

10 retrieved papers

The authors introduce EAPrivacy, a novel benchmark that systematically evaluates LLM-powered agents' privacy awareness in physical environments through four progressive tiers: sensitive object identification, privacy in shifting environments, inferential privacy under task conflicts, and social norms versus personal privacy. The benchmark comprises over 400 procedurally generated scenarios across more than 60 unique physical scenes.

10 retrieved papers

Four-tiered evaluation framework for physically-grounded privacy

10 retrieved papers

The authors design a four-tiered evaluation structure that progressively tests agents' abilities: recognizing sensitive objects in cluttered environments, adapting to dynamic physical contexts, resolving conflicts between tasks and inferred privacy constraints, and navigating ethical dilemmas where social norms conflict with personal privacy. Each tier addresses distinct aspects of privacy reasoning in physical settings.

10 retrieved papers

PDDL-based structured representation for physical environment evaluation

10 retrieved papers

The authors employ PDDL (Planning Domain Definition Language) format to represent physical environments and spatial relationships, moving beyond simple text descriptions. This structured approach enables systematic evaluation of agents' ability to ground privacy concepts in concrete physical spaces with explicit spatial reasoning requirements.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[3] Benchmarking llm privacy recognition for social robot decision making PDF

Sullivan, Dakota, Zhang, Shirley, Mutlu, Bilge, Fawaz, Kassem (2025)

[15] Low-Latency Privacy-Aware Robot Behavior guided by Automatically Generated Text Datasets PDF

Y Irisawa, T Yamazaki, S Ito, S Kurita (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

EAPrivacy benchmark for physical-world privacy awareness evaluation

[3] Benchmarking llm privacy recognition for social robot decision making PDF

Cannot Refute

[10] SafeMind: Benchmarking and Mitigating Safety Risks in Embodied LLM Agents PDF

Cannot Refute

[30] Hourvideo: 1-hour video-language understanding PDF

Cannot Refute

[31] Embodied understanding of driving scenarios PDF

Cannot Refute

[32] EgoNormia: Benchmarking Physical Social Norm Understanding PDF

Cannot Refute

[33] Measuring what matters: A benchmarking system for occupant satisfaction with workspace environments PDF

Cannot Refute

[34] Cognitive Edge Computing: A Comprehensive Survey on Optimizing Large Models and AI Agents for Pervasive Deployment PDF

Cannot Refute

[35] Mind the Third Eye! Benchmarking Privacy Awareness in MLLM-powered Smartphone Agents PDF

Cannot Refute

[36] Read the room: Inferring social context through dyadic interaction recognition in cyber-physical-social infrastructure systems PDF

Cannot Refute

[37] AI Should Sense Better, Not Just Scale Bigger: Adaptive Sensing as a Paradigm Shift PDF

Cannot Refute

Contribution

Four-tiered evaluation framework for physically-grounded privacy

[20] An efficient image privacy preservation scheme for smart city applications using compressive sensing and multi-level encryption PDF

Cannot Refute

[21] Doxing via the Lens: Revealing Location-related Privacy Leakage on Multi-modal Large Reasoning Models PDF

Cannot Refute

[22] Sharing and Generating Privacy-Preserving Spatio-Temporal Data Using Real-World Knowledge PDF

Cannot Refute

[23] Privacy-Preserving Personalized Fitness Recommender System P3FitRec: A Multi-level Deep Learning Approach PDF

Cannot Refute

[24] Real-World Trajectory Sharing with Local Differential Privacy PDF

Cannot Refute

[25] Hprop: Hierarchical privacy-preserving route planning for smart cities PDF

Cannot Refute

[26] Iotbeholder: A privacy snooping attack on user habitual behaviors from smart home wi-fi traffic PDF

Cannot Refute

[27] Three-tier Storage Framework Based on TBchain and IPFS for Protecting IoT Security and Privacy PDF

Cannot Refute

[28] A multi-level clustering approach for anonymizing large-scale physical activity data PDF

Cannot Refute

[29] PLASMA-Privacy-Preserved Lightweight and Secure Multi-level Authentication scheme for IoMT-based smart healthcare PDF

Cannot Refute

Contribution

PDDL-based structured representation for physical environment evaluation

[38] LLM+MAP: Bimanual Robot Task Planning using Large Language Models and Planning Domain Definition Language PDF

Cannot Refute

[39] PDDL planning with pretrained large language models PDF

Cannot Refute

[40] Learning with skill-based robot systems: Combining planning & knowledge representation with reinforcement learning PDF

Cannot Refute

[41] Autonomous Building of Structures in Unstructured Environments via AI Planning PDF

Cannot Refute

[42] Robust Planning as Probabilistic Inference PDF

Cannot Refute

[43] PlanOwl: Automated PDDL Files Generation from OWL Ontologies and Visual Language Models PDF

Cannot Refute

[44] Interactive Environment-Aware Planning System and Dialogue for Social Robots in Early Childhood Education PDF

Cannot Refute

[45] Simulation to Rules: A Dual-VLM Framework for Formal Visual Planning PDF

Cannot Refute

[46] An Investigation of LLM-Assisted Automated Planning Systems: Evaluating Vision and Language Models for Symbolic Task Planning PDF

Cannot Refute

[47] LifeBots I: Building the software infrastructure for supporting lifelong technologies PDF

Cannot Refute

Measuring Physical-World Privacy Awareness of Large Language Models: An Evaluation Benchmark

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[3] Benchmarking llm privacy recognition for social robot decision making PDF

[15] Low-Latency Privacy-Aware Robot Behavior guided by Automatically Generated Text Datasets PDF

Contribution Analysis

EAPrivacy benchmark for physical-world privacy awareness evaluation

[3] Benchmarking llm privacy recognition for social robot decision making PDF

[10] SafeMind: Benchmarking and Mitigating Safety Risks in Embodied LLM Agents PDF

[30] Hourvideo: 1-hour video-language understanding PDF

[31] Embodied understanding of driving scenarios PDF

[32] EgoNormia: Benchmarking Physical Social Norm Understanding PDF

[33] Measuring what matters: A benchmarking system for occupant satisfaction with workspace environments PDF

[34] Cognitive Edge Computing: A Comprehensive Survey on Optimizing Large Models and AI Agents for Pervasive Deployment PDF

[35] Mind the Third Eye! Benchmarking Privacy Awareness in MLLM-powered Smartphone Agents PDF

[36] Read the room: Inferring social context through dyadic interaction recognition in cyber-physical-social infrastructure systems PDF

[37] AI Should Sense Better, Not Just Scale Bigger: Adaptive Sensing as a Paradigm Shift PDF

Four-tiered evaluation framework for physically-grounded privacy

[20] An efficient image privacy preservation scheme for smart city applications using compressive sensing and multi-level encryption PDF

[21] Doxing via the Lens: Revealing Location-related Privacy Leakage on Multi-modal Large Reasoning Models PDF

[22] Sharing and Generating Privacy-Preserving Spatio-Temporal Data Using Real-World Knowledge PDF

[23] Privacy-Preserving Personalized Fitness Recommender System P3FitRec: A Multi-level Deep Learning Approach PDF

[24] Real-World Trajectory Sharing with Local Differential Privacy PDF

[25] Hprop: Hierarchical privacy-preserving route planning for smart cities PDF

[26] Iotbeholder: A privacy snooping attack on user habitual behaviors from smart home wi-fi traffic PDF

[27] Three-tier Storage Framework Based on TBchain and IPFS for Protecting IoT Security and Privacy PDF

[28] A multi-level clustering approach for anonymizing large-scale physical activity data PDF

[29] PLASMA-Privacy-Preserved Lightweight and Secure Multi-level Authentication scheme for IoMT-based smart healthcare PDF

PDDL-based structured representation for physical environment evaluation

[38] LLM+MAP: Bimanual Robot Task Planning using Large Language Models and Planning Domain Definition Language PDF

[39] PDDL planning with pretrained large language models PDF

[40] Learning with skill-based robot systems: Combining planning & knowledge representation with reinforcement learning PDF

[41] Autonomous Building of Structures in Unstructured Environments via AI Planning PDF

[42] Robust Planning as Probabilistic Inference PDF

[43] PlanOwl: Automated PDDL Files Generation from OWL Ontologies and Visual Language Models PDF

[44] Interactive Environment-Aware Planning System and Dialogue for Social Robots in Early Childhood Education PDF

[45] Simulation to Rules: A Dual-VLM Framework for Formal Visual Planning PDF

[46] An Investigation of LLM-Assisted Automated Planning Systems: Evaluating Vision and Language Models for Symbolic Task Planning PDF

[47] LifeBots I: Building the software infrastructure for supporting lifelong technologies PDF

Table of Contents