WearVox: An Egocentric Multichannel Voice Assistant Benchmark for Wearables

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Speech Large Language ModelsSLLMVoice AssistantBenchmark

Wearable devices such as AI glasses are transforming voice assistants into always-available, hands-free collaborators that integrate seamlessly with daily life, but they also introduce challenges like egocentric audio affected by motion and noise, rapid micro-interactions, and the need to distinguish device-directed speech from background conversations. Existing benchmarks largely overlook these complexities, focusing instead on clean or generic conversational audio. To bridge this gap, we present WearVox, the first benchmark designed to rigorously evaluate voice assistants in realistic wearable scenarios. WearVox comprises 3,842 multi-channel, egocentric audio recordings collected via AI glasses across five diverse tasks including Search-Grounded QA, Closed-Book QA, Side-Talk Rejection, Tool Calling, and Speech Translation, spanning a wide range of indoor and outdoor environments and acoustic conditions. Each recording is accompanied by rich metadata, enabling nuanced analysis of model performance under real-world constraints. We benchmark leading proprietary and open-source speech Large Language Models (SLLMs) and find that most real-time SLLMs achieve accuracies on WearVox ranging from 29% to 59%, with substantial performance degradation on noisy outdoor audio, underscoring the difficulty and realism of the benchmark. Additionally, we conduct a case study with two new SLLMs that perform inference with single-channel and multi-channel audio, demonstrating that multi-channel audio inputs significantly enhance model robustness to environmental noise and improve discrimination between device-directed and background speech. Our results highlight the critical importance of spatial audio cues for context-aware voice assistants and establish WearVox as a comprehensive testbed for advancing wearable voice AI research.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces WearVox, a benchmark comprising 3,842 multi-channel egocentric audio recordings collected via AI glasses across five tasks. It occupies the 'Egocentric Audio and Real-World Scenario Benchmarks' leaf within the Evaluation Methodologies branch. Notably, this leaf contains only the original paper itself—no sibling papers appear in the taxonomy. This suggests the research direction is relatively sparse, with WearVox potentially pioneering a focused evaluation paradigm for wearable voice assistants under realistic acoustic conditions.

The taxonomy reveals neighboring evaluation work in 'Automated Persona-Driven Testing' and 'Respiratory Signal Evaluation in Dialogue,' both addressing dialogue assessment but from different angles. Broader branches like Multimodal Context Integration and Health Applications contain numerous papers (e.g., GazePointAR, PhysioLLM) that build wearable systems but may not emphasize egocentric audio benchmarking. WearVox thus bridges a gap: while many works assume cleaner audio or focus on system design, this benchmark targets the acoustic and environmental realism that prior evaluation frameworks largely overlook.

Among 28 candidates examined, none clearly refute the three core contributions. The WearVox benchmark itself (8 candidates examined, 0 refutable) appears novel in its multi-channel, egocentric audio focus. The multi-channel audio approach (10 candidates, 0 refutable) and SLLM evaluation (10 candidates, 0 refutable) also show no substantial prior overlap within the limited search scope. This suggests that, at least among the top-30 semantic matches and their citations, the combination of egocentric recording, diverse tasks, and real-world acoustic conditions represents a distinct contribution.

Based on the limited literature search, WearVox appears to occupy a relatively unexplored niche in wearable voice assistant evaluation. The absence of sibling papers in its taxonomy leaf and the lack of refuting candidates among 28 examined works indicate novelty, though a broader search might uncover related benchmarks in adjacent fields. The analysis covers top-K semantic matches and does not claim exhaustive coverage of all evaluation methodologies.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: evaluating voice assistants in realistic wearable scenarios. The field encompasses a diverse set of research directions organized around eight major branches. Multimodal Context Integration and Sensing explores how wearables fuse audio with visual, physiological, and environmental signals to enrich interaction, as seen in works like GazePointAR[2] and Sensor Conversational AI[1]. Health and Wellness Applications focus on leveraging voice interfaces for monitoring and coaching, with examples ranging from PhysioLLM[4] to Sleep Health LLM[10]. Security, Privacy, and Authentication address the unique challenges of protecting user data and verifying identity in always-on wearable contexts, exemplified by Voice Spoofing Defense[3] and Handsfree Authentication[11]. Interaction Design and User Experience examine how users engage with voice-driven wearables across modalities and form factors, while Domain-Specific Task Assistance targets specialized workflows such as emergency response (EMSAssist[18]) or accessibility support. System Architecture and Technical Foundations underpin these applications with infrastructure for low-latency processing and multimodal fusion, and Specialized Medical and Assistive Devices cater to clinical and therapeutic use cases. Finally, Evaluation Methodologies and Benchmarking provides the frameworks and datasets needed to assess performance in real-world settings. Within this landscape, a particularly active line of work centers on creating ecologically valid benchmarks that capture the acoustic and contextual complexity of everyday wearable use. WearVox[0] sits squarely in this Evaluation Methodologies branch, specifically under Egocentric Audio and Real-World Scenario Benchmarks, where it addresses the gap between controlled lab tests and the noisy, dynamic environments users actually inhabit. This contrasts with efforts like Memoro[5] and Prism QA[6], which emphasize multimodal memory and question-answering capabilities but may rely on less naturalistic evaluation protocols. Meanwhile, works such as Intelligent Wearable Assistants[7] and Wearables ChatGPT[8] push the boundaries of conversational AI integration but often assume cleaner audio conditions. By focusing on egocentric, real-world audio scenarios, WearVox[0] complements these system-building efforts with rigorous benchmarking that reflects the acoustic challenges—reverberation, background noise, and user motion—that wearable voice assistants must overcome in practice.

Claimed Contributions

WearVox benchmark for wearable voice assistants

8 retrieved papers

The authors introduce WearVox, a novel benchmark comprising 3,842 multi-channel, egocentric audio recordings collected via AI glasses across five diverse tasks. This benchmark is specifically designed to evaluate voice assistants under realistic wearable conditions, including motion, noise, and the need to distinguish device-directed speech from background conversations.

8 retrieved papers

Multi-channel audio approach for improved robustness

10 retrieved papers

The authors develop and evaluate two new speech Large Language Models, one using single-channel audio and another leveraging multi-channel audio built on the Llama 4 Scout architecture. Their case study demonstrates that multi-channel audio inputs greatly improve model resilience to environmental noise and enhance discrimination between device-directed speech and background conversations.

10 retrieved papers

Comprehensive evaluation of state-of-the-art SLLMs

10 retrieved papers

The authors conduct systematic experiments evaluating state-of-the-art open-source and proprietary speech Large Language Models on the WearVox benchmark. Their evaluation reveals that most real-time SLLMs achieve accuracies ranging from 29% to 59%, with substantial performance degradation on noisy outdoor audio, establishing baseline performance metrics for wearable voice assistant research.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

WearVox benchmark for wearable voice assistants

[51] Ego4d: Around the world in 3,000 hours of egocentric video PDF

Cannot Refute

[52] EgoLife: Towards Egocentric Life Assistant PDF

Cannot Refute

[53] Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world PDF

Cannot Refute

[54] The earsavas dataset: Enabling subject-aware vocal activity sensing on earables PDF

Cannot Refute

[55] Designing wearable personal assistants for surgeons: An egocentric approach PDF

Cannot Refute

[56] TeleEgo: Benchmarking Egocentric AI Assistants in the Wild PDF

Cannot Refute

[57] Benchmarking Egocentric Multimodal Goal Inference for Assistive Wearable Agents PDF

Cannot Refute

[58] The performance of wearable speech enhancement system under noisy environment: an experimental study PDF

Cannot Refute

Contribution

Multi-channel audio approach for improved robustness

[59] Multi-modal multi-channel target speech separation PDF

Cannot Refute

[60] Multichannel Modulo Sampling with Unlimited Noise PDF

Cannot Refute

[61] Elevating Robust ASR By Decoupling Multi-Channel Speaker Separation and Speech Recognition PDF

Cannot Refute

[62] AISHELL-5: The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition PDF

Cannot Refute

[63] MIMO-Speech: End-to-end multi-channel multi-speaker speech recognition PDF

Cannot Refute

[64] Advances in microphone array processing and multichannel speech enhancement PDF

Cannot Refute

[65] TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation PDF

Cannot Refute

[66] Multi-channel transformer transducer for speech recognition PDF

Cannot Refute

[67] Audio-visual end-to-end multi-channel speech separation, dereverberation and recognition PDF

Cannot Refute

[68] Deep beamforming networks for multi-channel speech recognition PDF

Cannot Refute

Contribution

Comprehensive evaluation of state-of-the-art SLLMs

[69] Speechx: Neural codec language model as a versatile speech transformer PDF

Cannot Refute

[70] Superb: Speech processing universal performance benchmark PDF

Cannot Refute

[71] Language model can listen while speaking PDF

Cannot Refute

[72] Salmon: A suite for acoustic language model evaluation PDF

Cannot Refute

[73] Speechlmscore: Evaluating speech generation using speech language model PDF

Cannot Refute

[74] A suite for acoustic language model evaluation PDF

Cannot Refute

[75] The Faetar Speech Recognition Benchmark PDF

Cannot Refute

[76] Exploiting Foundation Models and Speech Enhancement for Parkinson's Disease Detection from Speech in Real-World Operative Conditions PDF

Cannot Refute

[77] Audio-Language Models for Audio-Centric Tasks: A survey PDF

Cannot Refute

[78] Audio Diffusion with Large Language Models PDF

Cannot Refute

WearVox: An Egocentric Multichannel Voice Assistant Benchmark for Wearables

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

WearVox benchmark for wearable voice assistants

[51] Ego4d: Around the world in 3,000 hours of egocentric video PDF

[52] EgoLife: Towards Egocentric Life Assistant PDF

[53] Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world PDF

[54] The earsavas dataset: Enabling subject-aware vocal activity sensing on earables PDF

[55] Designing wearable personal assistants for surgeons: An egocentric approach PDF

[56] TeleEgo: Benchmarking Egocentric AI Assistants in the Wild PDF

[57] Benchmarking Egocentric Multimodal Goal Inference for Assistive Wearable Agents PDF

[58] The performance of wearable speech enhancement system under noisy environment: an experimental study PDF

Multi-channel audio approach for improved robustness

[59] Multi-modal multi-channel target speech separation PDF

[60] Multichannel Modulo Sampling with Unlimited Noise PDF

[61] Elevating Robust ASR By Decoupling Multi-Channel Speaker Separation and Speech Recognition PDF

[62] AISHELL-5: The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition PDF

[63] MIMO-Speech: End-to-end multi-channel multi-speaker speech recognition PDF

[64] Advances in microphone array processing and multichannel speech enhancement PDF

[65] TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation PDF

[66] Multi-channel transformer transducer for speech recognition PDF

[67] Audio-visual end-to-end multi-channel speech separation, dereverberation and recognition PDF

[68] Deep beamforming networks for multi-channel speech recognition PDF

Comprehensive evaluation of state-of-the-art SLLMs

[69] Speechx: Neural codec language model as a versatile speech transformer PDF

[70] Superb: Speech processing universal performance benchmark PDF

[71] Language model can listen while speaking PDF

[72] Salmon: A suite for acoustic language model evaluation PDF

[73] Speechlmscore: Evaluating speech generation using speech language model PDF

[74] A suite for acoustic language model evaluation PDF

[75] The Faetar Speech Recognition Benchmark PDF

[76] Exploiting Foundation Models and Speech Enhancement for Parkinson's Disease Detection from Speech in Real-World Operative Conditions PDF

[77] Audio-Language Models for Audio-Centric Tasks: A survey PDF

[78] Audio Diffusion with Large Language Models PDF

Table of Contents