VoxPrivacy: A Benchmark for Evaluating Interactional Privacy of Speech Language Models

ICLR 2026 Conference SubmissionAnonymous Authors
BenchmarkSpeech Language ModelInteractional Privacy
Abstract:

As Speech Language Models (SLMs) transition from personal devices to shared, multi-user environments such as smart homes, a new challenge emerges: the model is expected to distinguish between users to manage information flow appropriately. Without this capability, an SLM could reveal one user’s confidential schedule to another—a privacy failure we term interactional privacy. Thus, the ability to generate speaker-aware responses becomes essential for SLM safe deployment. Current SLM benchmarks test dialogue ability but overlook speaker identity. Multi-speaker benchmarks check who said what without assessing whether SLMs adapt their responses. Privacy benchmarks focus on globally sensitive data (e.g., bank passwords) while neglecting contextually sensitive information (e.g., a user’s private appointment). To address this gap, we introduce VoxPrivacy, the first benchmark designed to evaluate interactional privacy in SLMs. VoxPrivacy spans three tiers of increasing difficulty, from following direct secrecy commands to proactively protecting privacy. Our evaluation of nine SLMs on a 32-hour bilingual dataset reveals a widespread vulnerability: most open-source models perform close to random chance (around 50% accuracy) on conditional privacy decisions, while even strong closed-source systems still fall short on proactive privacy inference. We further validate these findings on Real-VoxPrivacy, a human-recorded subset, confirming that the failures observed on synthetic data persist in real speech. We also demonstrate a viable path forward: by fine-tuning on a new 4,000-hour training set, we improve the model’s privacy-preserving capabilities while achieving fair robustness. To support future work, we are releasing the VoxPrivacy benchmark, the large-scale training set, and the fine-tuned model to help the development of safer and more context-aware SLMs.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces VoxPrivacy, a benchmark for evaluating interactional privacy in speech language models (SLMs) operating in multi-user environments. It resides in the 'Interactional Privacy Evaluation and Benchmarking' leaf, which contains only two papers total. This is a notably sparse research direction within the broader taxonomy of 22 papers across multiple branches. The sibling paper examines text-based multi-user privacy in LLMs, making VoxPrivacy the sole work explicitly addressing speech-specific interactional privacy evaluation. This positioning suggests the paper targets an underexplored niche where acoustic modalities and real-time multi-speaker interaction create distinct privacy challenges not covered by existing benchmarks.

The taxonomy reveals that neighboring research directions focus on technical anonymization methods (speaker anonymization, secure computation) and access control mechanisms rather than evaluation frameworks. The 'Privacy-Preserving Speech Processing Techniques' branch contains nine papers addressing cryptographic and anonymization approaches, while 'Access Control and Authentication' includes three papers on permission management. VoxPrivacy diverges from these by providing a measurement tool rather than a protection mechanism. The taxonomy's scope notes clarify that evaluation benchmarks are explicitly separated from technical privacy-preserving methods, positioning this work as complementary infrastructure for assessing existing systems rather than proposing new protection techniques.

Among 30 candidates examined, the three-tiered evaluation framework shows one refutable candidate from 10 examined, while the VoxPrivacy benchmark itself and the large-scale vulnerability findings show no clear refutations among their respective candidate sets. The framework contribution appears to have more substantial prior work overlap within the limited search scope, though the specific tiered structure for privacy capabilities may still offer differentiation. The benchmark and empirical findings appear more novel given the absence of refuting candidates, though this reflects the 30-paper search scope rather than exhaustive coverage. The speech-specific focus and 32-hour bilingual dataset represent concrete artifacts not directly matched in the examined literature.

Based on the limited search scope, the work addresses a demonstrably sparse research area with minimal direct competition in speech-based interactional privacy evaluation. The taxonomy structure confirms that evaluation frameworks for multi-user speech privacy remain underdeveloped compared to technical protection methods. However, the analysis covers top-30 semantic matches and does not capture the full landscape of privacy benchmarking in adjacent domains (e.g., text-based systems, general dialogue evaluation) that might inform assessments of incremental versus foundational contributions.

Taxonomy

Core-task Taxonomy Papers
22
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: evaluating interactional privacy in multi-user speech language models. The field addresses privacy challenges that arise when multiple speakers interact with voice-enabled systems, where protecting one user's information may conflict with another's needs or expectations. The taxonomy organizes work into four main branches. Privacy-Preserving Speech Processing Techniques focuses on cryptographic and anonymization methods that protect speech data at the signal or feature level, including approaches like speaker anonymization and federated learning for speech models. Multi-User Privacy Management in Speech Systems examines how systems handle privacy when multiple individuals are present, covering access control mechanisms, community-level privacy considerations, and evaluation frameworks for interactional privacy scenarios. General Multi-Party Privacy-Preserving Machine Learning provides foundational techniques such as secure multi-party computation and differential privacy adapted for collaborative settings. Multi-Party Conversational Systems explores the design and user experience of voice interfaces in shared environments, including smart speakers and conversational agents used by couples or families. Several active lines of work reveal key tensions in the field. One strand emphasizes technical anonymization and benchmarking, with studies like Multi-speaker Anonymization Benchmark[1] and Target Speaker Anonymization[12] developing methods to obscure speaker identity while preserving utility. Another explores policy and access mechanisms, as seen in Access Control Voice[6] and Community Privacy Speech[4], which address who should control privacy settings in shared contexts. VoxPrivacy[0] sits within the evaluation-focused cluster alongside Multi-user Privacy LLMs[2], both concerned with measuring how well systems respect privacy boundaries when multiple users interact. While Multi-user Privacy LLMs[2] examines text-based language models in collaborative scenarios, VoxPrivacy[0] extends this lens specifically to speech modalities, where acoustic information and real-time interaction introduce distinct privacy risks. This positioning highlights an emerging need for rigorous benchmarks that capture the nuanced, often conflicting privacy expectations in multi-speaker environments.

Claimed Contributions

VoxPrivacy benchmark for evaluating interactional privacy in SLMs

The authors present VoxPrivacy, a novel benchmark specifically designed to assess how well Speech Language Models maintain interactional privacy in multi-user spoken dialogues. It features a three-tiered task structure measuring capabilities from following direct secrecy commands to proactively protecting privacy, accompanied by a 32-hour bilingual dataset, a human-recorded validation subset, and a 4000-hour training set.

10 retrieved papers
Three-tiered evaluation framework for privacy capabilities

The authors develop a structured evaluation framework with three tiers of increasing cognitive difficulty: Tier 1 tests obedience to explicit secrecy commands, Tier 2 requires speaker-verified conditional disclosure using voice as a biometric key, and Tier 3 evaluates proactive privacy protection where models must autonomously infer sensitivity without instructions.

10 retrieved papers
Can Refute
Large-scale evaluation revealing widespread privacy vulnerabilities

The authors conduct a comprehensive evaluation of nine state-of-the-art SLMs, demonstrating that interactional privacy is a critical unsolved problem. Their findings show most open-source models achieve only around 50% accuracy on conditional privacy decisions, establishing clear baselines and identifying specific failure modes through controlled experiments and adversarial tests.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

VoxPrivacy benchmark for evaluating interactional privacy in SLMs

The authors present VoxPrivacy, a novel benchmark specifically designed to assess how well Speech Language Models maintain interactional privacy in multi-user spoken dialogues. It features a three-tiered task structure measuring capabilities from following direct secrecy commands to proactively protecting privacy, accompanied by a 32-hour bilingual dataset, a human-recorded validation subset, and a 4000-hour training set.

Contribution

Three-tiered evaluation framework for privacy capabilities

The authors develop a structured evaluation framework with three tiers of increasing cognitive difficulty: Tier 1 tests obedience to explicit secrecy commands, Tier 2 requires speaker-verified conditional disclosure using voice as a biometric key, and Tier 3 evaluates proactive privacy protection where models must autonomously infer sensitivity without instructions.

Contribution

Large-scale evaluation revealing widespread privacy vulnerabilities

The authors conduct a comprehensive evaluation of nine state-of-the-art SLMs, demonstrating that interactional privacy is a critical unsolved problem. Their findings show most open-source models achieve only around 50% accuracy on conditional privacy decisions, establishing clear baselines and identifying specific failure modes through controlled experiments and adversarial tests.

VoxPrivacy: A Benchmark for Evaluating Interactional Privacy of Speech Language Models | Novelty Validation