Abstract:

The deployment of Large Language Models (LLMs) in embodied agents creates an urgent need to measure their privacy awareness in the physical world. Existing evaluation methods, however, are confined to natural lanague based scenarios. To bridge this gap, we introduce EAPrivacy, a comprehensive evaluation benchmark designed to quantify the physical-world privacy awareness of LLM-powered agents. EAPrivacy utilizes procedurally generated scenarios across four tiers to test an agent's ability to handle sensitive objects, adapt to changing environments, balance task execution with privacy constraints, and resolve conflicts with social norms. Our measurements reveal a critical deficit in current models. The top-performing model, Gemini 2.5 Pro, achieved only 59% accuracy in scenarios involving changing physical environments. Furthermore, when a task was accompanied by a privacy request, models prioritized completion over the constraint in up to 86% of cases. In high-stakes situations pitting privacy against critical social norms, leading models like GPT-4o and Claude-3.5-haiku disregarded the social norm over 15% of the time. These findings, demonstrated by our benchmark, underscore a fundamental misalignment in LLMs regarding physically grounded privacy and establish the need for more robust, physically-aware alignment.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces EAPrivacy, a benchmark for evaluating physical-world privacy awareness in LLM-powered embodied agents. It resides in the Physical-World Privacy Assessment leaf, which contains only three papers total, indicating a relatively sparse research direction. The taxonomy shows this leaf is distinct from Digital Environment Privacy Evaluation (three papers focused on virtual interfaces and memory systems) and from broader safety frameworks. This positioning suggests the work addresses an emerging gap where privacy evaluation meets physical embodiment, a less crowded area compared to digital-only privacy assessments or general safety benchmarks.

The taxonomy reveals neighboring research in Privacy-Preserving Architectures (seven papers across tool-using agents, edge deployment, and healthcare robotics) and Safety and Contextual Reasoning Frameworks (three papers on risk assessment and dynamic adaptation). The paper's focus on evaluation distinguishes it from these mitigation-oriented branches. Within Privacy Evaluation and Benchmarking, the sibling papers in Physical-World Privacy Assessment share the embodied context but may differ in evaluation methodology or scenario design. The taxonomy's scope notes clarify that attack methods and deployment architectures are excluded from this evaluation-focused branch, helping position the work as diagnostic rather than defensive.

Across three contributions examined, the analysis reviewed thirty candidate papers total, with ten candidates per contribution. None of the contributions were clearly refuted by prior work in this limited search. The EAPrivacy benchmark contribution examined ten candidates with zero refutable matches, as did the four-tiered framework and PDDL-based representation contributions. This suggests that among the top-thirty semantically similar papers identified, none provided overlapping prior work on procedurally generated physical privacy scenarios with tiered complexity. The absence of refutations across all contributions indicates potential novelty within the examined scope, though the search was not exhaustive.

Given the limited search scope of thirty candidates and the sparse three-paper leaf in the taxonomy, the work appears to occupy relatively unexplored territory at the intersection of embodied agents and privacy evaluation. The analysis covers top-K semantic matches and does not claim comprehensive field coverage. The lack of refutable prior work among examined candidates, combined with the sparse taxonomy leaf, suggests the specific combination of physical-world scenarios, tiered evaluation, and PDDL-based representation may be distinctive within the surveyed literature.

Taxonomy

Core-task Taxonomy Papers
19
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Evaluating physical-world privacy awareness of large language models in embodied agents. The field structure reflects a multifaceted approach to understanding and addressing privacy concerns when LLMs operate in physical environments. Privacy Evaluation and Benchmarking focuses on developing systematic assessments and metrics for privacy-aware behavior, often through specialized benchmarks and scenario-based testing. Privacy Attack Methods and Vulnerabilities explores how adversaries might exploit embodied agents, examining backdoor triggers and contextual manipulation. Privacy-Preserving Architectures and Mitigation investigates technical solutions such as federated learning and differential privacy mechanisms tailored to embodied settings. Safety and Contextual Reasoning Frameworks emphasizes broader safety considerations and the ability of agents to reason about context-dependent privacy norms. Finally, Data Generation and Application Domains addresses domain-specific challenges in healthcare, surveillance, and other real-world deployments where privacy stakes are particularly high. Several active lines of work reveal key trade-offs between agent capability and privacy protection. One cluster examines privacy in social and assistive robotics, where agents must balance helpfulness with discretion in sensitive environments—works like Social Robot Privacy[3] and Privacy Aware Robot[15] explore how robots navigate household and care settings. Another thread investigates memory and data retention risks, as seen in Agent Memory Privacy[2] and Privacyasst[1], highlighting tensions between personalization and information leakage. Physical Privacy Awareness[0] sits squarely within the physical-world assessment branch, sharing concerns with Social Robot Privacy[3] and Privacy Aware Robot[15] about embodied contexts, yet it emphasizes systematic evaluation of LLM reasoning about privacy rather than architectural defenses. Compared to these neighbors, Physical Privacy Awareness[0] appears more focused on probing the inherent privacy awareness capabilities of foundation models in realistic physical scenarios, complementing works that address mitigation or domain-specific deployment challenges.

Claimed Contributions

EAPrivacy benchmark for physical-world privacy awareness evaluation

The authors introduce EAPrivacy, a novel benchmark that systematically evaluates LLM-powered agents' privacy awareness in physical environments through four progressive tiers: sensitive object identification, privacy in shifting environments, inferential privacy under task conflicts, and social norms versus personal privacy. The benchmark comprises over 400 procedurally generated scenarios across more than 60 unique physical scenes.

10 retrieved papers
Four-tiered evaluation framework for physically-grounded privacy

The authors design a four-tiered evaluation structure that progressively tests agents' abilities: recognizing sensitive objects in cluttered environments, adapting to dynamic physical contexts, resolving conflicts between tasks and inferred privacy constraints, and navigating ethical dilemmas where social norms conflict with personal privacy. Each tier addresses distinct aspects of privacy reasoning in physical settings.

10 retrieved papers
PDDL-based structured representation for physical environment evaluation

The authors employ PDDL (Planning Domain Definition Language) format to represent physical environments and spatial relationships, moving beyond simple text descriptions. This structured approach enables systematic evaluation of agents' ability to ground privacy concepts in concrete physical spaces with explicit spatial reasoning requirements.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

EAPrivacy benchmark for physical-world privacy awareness evaluation

The authors introduce EAPrivacy, a novel benchmark that systematically evaluates LLM-powered agents' privacy awareness in physical environments through four progressive tiers: sensitive object identification, privacy in shifting environments, inferential privacy under task conflicts, and social norms versus personal privacy. The benchmark comprises over 400 procedurally generated scenarios across more than 60 unique physical scenes.

Contribution

Four-tiered evaluation framework for physically-grounded privacy

The authors design a four-tiered evaluation structure that progressively tests agents' abilities: recognizing sensitive objects in cluttered environments, adapting to dynamic physical contexts, resolving conflicts between tasks and inferred privacy constraints, and navigating ethical dilemmas where social norms conflict with personal privacy. Each tier addresses distinct aspects of privacy reasoning in physical settings.

Contribution

PDDL-based structured representation for physical environment evaluation

The authors employ PDDL (Planning Domain Definition Language) format to represent physical environments and spatial relationships, moving beyond simple text descriptions. This structured approach enables systematic evaluation of agents' ability to ground privacy concepts in concrete physical spaces with explicit spatial reasoning requirements.