VoxPrivacy: A Benchmark for Evaluating Interactional Privacy of Speech Language Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

BenchmarkSpeech Language ModelInteractional Privacy

As Speech Language Models (SLMs) transition from personal devices to shared, multi-user environments such as smart homes, a new challenge emerges: the model is expected to distinguish between users to manage information flow appropriately. Without this capability, an SLM could reveal one user’s confidential schedule to another—a privacy failure we term interactional privacy. Thus, the ability to generate speaker-aware responses becomes essential for SLM safe deployment. Current SLM benchmarks test dialogue ability but overlook speaker identity. Multi-speaker benchmarks check who said what without assessing whether SLMs adapt their responses. Privacy benchmarks focus on globally sensitive data (e.g., bank passwords) while neglecting contextually sensitive information (e.g., a user’s private appointment). To address this gap, we introduce VoxPrivacy, the first benchmark designed to evaluate interactional privacy in SLMs. VoxPrivacy spans three tiers of increasing difficulty, from following direct secrecy commands to proactively protecting privacy. Our evaluation of nine SLMs on a 32-hour bilingual dataset reveals a widespread vulnerability: most open-source models perform close to random chance (around 50% accuracy) on conditional privacy decisions, while even strong closed-source systems still fall short on proactive privacy inference. We further validate these findings on Real-VoxPrivacy, a human-recorded subset, confirming that the failures observed on synthetic data persist in real speech. We also demonstrate a viable path forward: by fine-tuning on a new 4,000-hour training set, we improve the model’s privacy-preserving capabilities while achieving fair robustness. To support future work, we are releasing the VoxPrivacy benchmark, the large-scale training set, and the fine-tuned model to help the development of safer and more context-aware SLMs.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces VoxPrivacy, a benchmark for evaluating interactional privacy in speech language models (SLMs) operating in multi-user environments. It resides in the 'Interactional Privacy Evaluation and Benchmarking' leaf, which contains only two papers total. This is a notably sparse research direction within the broader taxonomy of 22 papers across multiple branches. The sibling paper examines text-based multi-user privacy in LLMs, making VoxPrivacy the sole work explicitly addressing speech-specific interactional privacy evaluation. This positioning suggests the paper targets an underexplored niche where acoustic modalities and real-time multi-speaker interaction create distinct privacy challenges not covered by existing benchmarks.

The taxonomy reveals that neighboring research directions focus on technical anonymization methods (speaker anonymization, secure computation) and access control mechanisms rather than evaluation frameworks. The 'Privacy-Preserving Speech Processing Techniques' branch contains nine papers addressing cryptographic and anonymization approaches, while 'Access Control and Authentication' includes three papers on permission management. VoxPrivacy diverges from these by providing a measurement tool rather than a protection mechanism. The taxonomy's scope notes clarify that evaluation benchmarks are explicitly separated from technical privacy-preserving methods, positioning this work as complementary infrastructure for assessing existing systems rather than proposing new protection techniques.

Among 30 candidates examined, the three-tiered evaluation framework shows one refutable candidate from 10 examined, while the VoxPrivacy benchmark itself and the large-scale vulnerability findings show no clear refutations among their respective candidate sets. The framework contribution appears to have more substantial prior work overlap within the limited search scope, though the specific tiered structure for privacy capabilities may still offer differentiation. The benchmark and empirical findings appear more novel given the absence of refuting candidates, though this reflects the 30-paper search scope rather than exhaustive coverage. The speech-specific focus and 32-hour bilingual dataset represent concrete artifacts not directly matched in the examined literature.

Based on the limited search scope, the work addresses a demonstrably sparse research area with minimal direct competition in speech-based interactional privacy evaluation. The taxonomy structure confirms that evaluation frameworks for multi-user speech privacy remain underdeveloped compared to technical protection methods. However, the analysis covers top-30 semantic matches and does not capture the full landscape of privacy benchmarking in adjacent domains (e.g., text-based systems, general dialogue evaluation) that might inform assessments of incremental versus foundational contributions.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: evaluating interactional privacy in multi-user speech language models. The field addresses privacy challenges that arise when multiple speakers interact with voice-enabled systems, where protecting one user's information may conflict with another's needs or expectations. The taxonomy organizes work into four main branches. Privacy-Preserving Speech Processing Techniques focuses on cryptographic and anonymization methods that protect speech data at the signal or feature level, including approaches like speaker anonymization and federated learning for speech models. Multi-User Privacy Management in Speech Systems examines how systems handle privacy when multiple individuals are present, covering access control mechanisms, community-level privacy considerations, and evaluation frameworks for interactional privacy scenarios. General Multi-Party Privacy-Preserving Machine Learning provides foundational techniques such as secure multi-party computation and differential privacy adapted for collaborative settings. Multi-Party Conversational Systems explores the design and user experience of voice interfaces in shared environments, including smart speakers and conversational agents used by couples or families. Several active lines of work reveal key tensions in the field. One strand emphasizes technical anonymization and benchmarking, with studies like Multi-speaker Anonymization Benchmark[1] and Target Speaker Anonymization[12] developing methods to obscure speaker identity while preserving utility. Another explores policy and access mechanisms, as seen in Access Control Voice[6] and Community Privacy Speech[4], which address who should control privacy settings in shared contexts. VoxPrivacy[0] sits within the evaluation-focused cluster alongside Multi-user Privacy LLMs[2], both concerned with measuring how well systems respect privacy boundaries when multiple users interact. While Multi-user Privacy LLMs[2] examines text-based language models in collaborative scenarios, VoxPrivacy[0] extends this lens specifically to speech modalities, where acoustic information and real-time interaction introduce distinct privacy risks. This positioning highlights an emerging need for rigorous benchmarks that capture the nuanced, often conflicting privacy expectations in multi-speaker environments.

Claimed Contributions

VoxPrivacy benchmark for evaluating interactional privacy in SLMs

10 retrieved papers

The authors present VoxPrivacy, a novel benchmark specifically designed to assess how well Speech Language Models maintain interactional privacy in multi-user spoken dialogues. It features a three-tiered task structure measuring capabilities from following direct secrecy commands to proactively protecting privacy, accompanied by a 32-hour bilingual dataset, a human-recorded validation subset, and a 4000-hour training set.

10 retrieved papers

Three-tiered evaluation framework for privacy capabilities

Can Refute

10 retrieved papers

The authors develop a structured evaluation framework with three tiers of increasing cognitive difficulty: Tier 1 tests obedience to explicit secrecy commands, Tier 2 requires speaker-verified conditional disclosure using voice as a biometric key, and Tier 3 evaluates proactive privacy protection where models must autonomously infer sensitivity without instructions.

10 retrieved papers

Can Refute

Large-scale evaluation revealing widespread privacy vulnerabilities

10 retrieved papers

The authors conduct a comprehensive evaluation of nine state-of-the-art SLMs, demonstrating that interactional privacy is a critical unsolved problem. Their findings show most open-source models achieve only around 50% accuracy on conditional privacy decisions, establishing clear baselines and identifying specific failure modes through controlled experiments and adversarial tests.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] Beyond Individual Concerns: Multi-user Privacy in Large Language Models PDF

Xiao Zhan, William Seymour, JosÃ© Such, Jose Such (2024) • International Conference on Conversational User Interfaces

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

VoxPrivacy benchmark for evaluating interactional privacy in SLMs

[25] Audiotrust: Benchmarking the multifaceted trustworthiness of audio large language models PDF

Cannot Refute

[37] Privacylens: Evaluating privacy norm awareness of language models in action PDF

Cannot Refute

[43] The voiceprivacy 2024 challenge evaluation plan PDF

Cannot Refute

[44] The man behind the sound: Demystifying audio private attribute profiling via multimodal large language model agents PDF

Cannot Refute

[45] Privacy Disclosure of Similarity Rank in Speech and Language Processing PDF

Cannot Refute

[46] Game-Time: Evaluating Temporal Dynamics in Spoken Language Models PDF

Cannot Refute

[47] A comparative analysis of word-level metric differential privacy: Benchmarking the privacy-utility trade-off PDF

Cannot Refute

[48] Effectiveness of Privacy-preserving Algorithms in LLMs: A Benchmark and Empirical Analysis PDF

Cannot Refute

[49] Long-Form Speech Generation with Spoken Language Models PDF

Cannot Refute

[50] On Differential Privacy for Language Models PDF

Cannot Refute

Contribution

Three-tiered evaluation framework for privacy capabilities

[35] Exploring the Privacy Protection Capabilities of Chinese Large Language Models PDF

Can Refute

[33] MUSE: Machine Unlearning Six-Way Evaluation for Language Models PDF

Cannot Refute

[34] Privlm-bench: A multi-level privacy evaluation benchmark for language models PDF

Cannot Refute

[36] Hierarchical semantic encoding for contextual understanding in large language models PDF

Cannot Refute

[37] Privacylens: Evaluating privacy norm awareness of language models in action PDF

Cannot Refute

[38] Leveraging hierarchical representations for preserving privacy and utility in text PDF

Cannot Refute

[39] Privacy-preserving Prompt Personalization in Federated Learning for Multimodal Large Language Models PDF

Cannot Refute

[40] Data privacy and safety with large language models PDF

Cannot Refute

[41] Assessing Visual Privacy Risks in Multimodal AI: A Novel Taxonomy-Grounded Evaluation of Vision-Language Models PDF

Cannot Refute

[42] Neural pathway embedding through hierarchical interchange networks in large language models PDF

Cannot Refute

Contribution

Large-scale evaluation revealing widespread privacy vulnerabilities

[23] Audio Jailbreak Attacks: Exposing Vulnerabilities in SpeechGPT in a White-Box Framework PDF

Cannot Refute

[24] Privacyasst: Safeguarding user privacy in tool-using large language model agents PDF

Cannot Refute

[25] Audiotrust: Benchmarking the multifaceted trustworthiness of audio large language models PDF

Cannot Refute

[26] Adversarial Speech for Voice Privacy Protection from Personalized Speech Generation PDF

Cannot Refute

[27] SPIRIT: Patching Speech Language Models against Jailbreak Attacks PDF

Cannot Refute

[28] Balancing Transparency and Risk: The Security and Privacy Risks of Open-Source Machine Learning Models PDF

Cannot Refute

[29] Configurable privacy-preserving automatic speech recognition PDF

Cannot Refute

[30] How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities PDF

Cannot Refute

[31] Assessing the Vulnerabilities of the Open-Source Artificial Intelligence (AI) Landscape: A Large-Scale Analysis of the Hugging Face Platform PDF

Cannot Refute

[32] Backdoor Attacks Against Speech Language Models PDF

Cannot Refute

VoxPrivacy: A Benchmark for Evaluating Interactional Privacy of Speech Language Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] Beyond Individual Concerns: Multi-user Privacy in Large Language Models PDF

Contribution Analysis

VoxPrivacy benchmark for evaluating interactional privacy in SLMs

[25] Audiotrust: Benchmarking the multifaceted trustworthiness of audio large language models PDF

[37] Privacylens: Evaluating privacy norm awareness of language models in action PDF

[43] The voiceprivacy 2024 challenge evaluation plan PDF

[44] The man behind the sound: Demystifying audio private attribute profiling via multimodal large language model agents PDF

[45] Privacy Disclosure of Similarity Rank in Speech and Language Processing PDF

[46] Game-Time: Evaluating Temporal Dynamics in Spoken Language Models PDF

[47] A comparative analysis of word-level metric differential privacy: Benchmarking the privacy-utility trade-off PDF

[48] Effectiveness of Privacy-preserving Algorithms in LLMs: A Benchmark and Empirical Analysis PDF

[49] Long-Form Speech Generation with Spoken Language Models PDF

[50] On Differential Privacy for Language Models PDF

Three-tiered evaluation framework for privacy capabilities

[35] Exploring the Privacy Protection Capabilities of Chinese Large Language Models PDF

[33] MUSE: Machine Unlearning Six-Way Evaluation for Language Models PDF

[34] Privlm-bench: A multi-level privacy evaluation benchmark for language models PDF

[36] Hierarchical semantic encoding for contextual understanding in large language models PDF

[37] Privacylens: Evaluating privacy norm awareness of language models in action PDF

[38] Leveraging hierarchical representations for preserving privacy and utility in text PDF

[39] Privacy-preserving Prompt Personalization in Federated Learning for Multimodal Large Language Models PDF

[40] Data privacy and safety with large language models PDF

[41] Assessing Visual Privacy Risks in Multimodal AI: A Novel Taxonomy-Grounded Evaluation of Vision-Language Models PDF

[42] Neural pathway embedding through hierarchical interchange networks in large language models PDF

Large-scale evaluation revealing widespread privacy vulnerabilities

[23] Audio Jailbreak Attacks: Exposing Vulnerabilities in SpeechGPT in a White-Box Framework PDF

[24] Privacyasst: Safeguarding user privacy in tool-using large language model agents PDF

[25] Audiotrust: Benchmarking the multifaceted trustworthiness of audio large language models PDF

[26] Adversarial Speech for Voice Privacy Protection from Personalized Speech Generation PDF

[27] SPIRIT: Patching Speech Language Models against Jailbreak Attacks PDF

[28] Balancing Transparency and Risk: The Security and Privacy Risks of Open-Source Machine Learning Models PDF

[29] Configurable privacy-preserving automatic speech recognition PDF

[30] How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities PDF

[31] Assessing the Vulnerabilities of the Open-Source Artificial Intelligence (AI) Landscape: A Large-Scale Analysis of the Hugging Face Platform PDF

[32] Backdoor Attacks Against Speech Language Models PDF

Table of Contents