WearVox: An Egocentric Multichannel Voice Assistant Benchmark for Wearables

ICLR 2026 Conference SubmissionAnonymous Authors
Speech Large Language ModelsSLLMVoice AssistantBenchmark
Abstract:

Wearable devices such as AI glasses are transforming voice assistants into always-available, hands-free collaborators that integrate seamlessly with daily life, but they also introduce challenges like egocentric audio affected by motion and noise, rapid micro-interactions, and the need to distinguish device-directed speech from background conversations. Existing benchmarks largely overlook these complexities, focusing instead on clean or generic conversational audio. To bridge this gap, we present WearVox, the first benchmark designed to rigorously evaluate voice assistants in realistic wearable scenarios. WearVox comprises 3,842 multi-channel, egocentric audio recordings collected via AI glasses across five diverse tasks including Search-Grounded QA, Closed-Book QA, Side-Talk Rejection, Tool Calling, and Speech Translation, spanning a wide range of indoor and outdoor environments and acoustic conditions. Each recording is accompanied by rich metadata, enabling nuanced analysis of model performance under real-world constraints. We benchmark leading proprietary and open-source speech Large Language Models (SLLMs) and find that most real-time SLLMs achieve accuracies on WearVox ranging from 29% to 59%, with substantial performance degradation on noisy outdoor audio, underscoring the difficulty and realism of the benchmark. Additionally, we conduct a case study with two new SLLMs that perform inference with single-channel and multi-channel audio, demonstrating that multi-channel audio inputs significantly enhance model robustness to environmental noise and improve discrimination between device-directed and background speech. Our results highlight the critical importance of spatial audio cues for context-aware voice assistants and establish WearVox as a comprehensive testbed for advancing wearable voice AI research.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces WearVox, a benchmark comprising 3,842 multi-channel egocentric audio recordings collected via AI glasses across five tasks. It occupies the 'Egocentric Audio and Real-World Scenario Benchmarks' leaf within the Evaluation Methodologies branch. Notably, this leaf contains only the original paper itself—no sibling papers appear in the taxonomy. This suggests the research direction is relatively sparse, with WearVox potentially pioneering a focused evaluation paradigm for wearable voice assistants under realistic acoustic conditions.

The taxonomy reveals neighboring evaluation work in 'Automated Persona-Driven Testing' and 'Respiratory Signal Evaluation in Dialogue,' both addressing dialogue assessment but from different angles. Broader branches like Multimodal Context Integration and Health Applications contain numerous papers (e.g., GazePointAR, PhysioLLM) that build wearable systems but may not emphasize egocentric audio benchmarking. WearVox thus bridges a gap: while many works assume cleaner audio or focus on system design, this benchmark targets the acoustic and environmental realism that prior evaluation frameworks largely overlook.

Among 28 candidates examined, none clearly refute the three core contributions. The WearVox benchmark itself (8 candidates examined, 0 refutable) appears novel in its multi-channel, egocentric audio focus. The multi-channel audio approach (10 candidates, 0 refutable) and SLLM evaluation (10 candidates, 0 refutable) also show no substantial prior overlap within the limited search scope. This suggests that, at least among the top-30 semantic matches and their citations, the combination of egocentric recording, diverse tasks, and real-world acoustic conditions represents a distinct contribution.

Based on the limited literature search, WearVox appears to occupy a relatively unexplored niche in wearable voice assistant evaluation. The absence of sibling papers in its taxonomy leaf and the lack of refuting candidates among 28 examined works indicate novelty, though a broader search might uncover related benchmarks in adjacent fields. The analysis covers top-K semantic matches and does not claim exhaustive coverage of all evaluation methodologies.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
28
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: evaluating voice assistants in realistic wearable scenarios. The field encompasses a diverse set of research directions organized around eight major branches. Multimodal Context Integration and Sensing explores how wearables fuse audio with visual, physiological, and environmental signals to enrich interaction, as seen in works like GazePointAR[2] and Sensor Conversational AI[1]. Health and Wellness Applications focus on leveraging voice interfaces for monitoring and coaching, with examples ranging from PhysioLLM[4] to Sleep Health LLM[10]. Security, Privacy, and Authentication address the unique challenges of protecting user data and verifying identity in always-on wearable contexts, exemplified by Voice Spoofing Defense[3] and Handsfree Authentication[11]. Interaction Design and User Experience examine how users engage with voice-driven wearables across modalities and form factors, while Domain-Specific Task Assistance targets specialized workflows such as emergency response (EMSAssist[18]) or accessibility support. System Architecture and Technical Foundations underpin these applications with infrastructure for low-latency processing and multimodal fusion, and Specialized Medical and Assistive Devices cater to clinical and therapeutic use cases. Finally, Evaluation Methodologies and Benchmarking provides the frameworks and datasets needed to assess performance in real-world settings. Within this landscape, a particularly active line of work centers on creating ecologically valid benchmarks that capture the acoustic and contextual complexity of everyday wearable use. WearVox[0] sits squarely in this Evaluation Methodologies branch, specifically under Egocentric Audio and Real-World Scenario Benchmarks, where it addresses the gap between controlled lab tests and the noisy, dynamic environments users actually inhabit. This contrasts with efforts like Memoro[5] and Prism QA[6], which emphasize multimodal memory and question-answering capabilities but may rely on less naturalistic evaluation protocols. Meanwhile, works such as Intelligent Wearable Assistants[7] and Wearables ChatGPT[8] push the boundaries of conversational AI integration but often assume cleaner audio conditions. By focusing on egocentric, real-world audio scenarios, WearVox[0] complements these system-building efforts with rigorous benchmarking that reflects the acoustic challenges—reverberation, background noise, and user motion—that wearable voice assistants must overcome in practice.

Claimed Contributions

WearVox benchmark for wearable voice assistants

The authors introduce WearVox, a novel benchmark comprising 3,842 multi-channel, egocentric audio recordings collected via AI glasses across five diverse tasks. This benchmark is specifically designed to evaluate voice assistants under realistic wearable conditions, including motion, noise, and the need to distinguish device-directed speech from background conversations.

8 retrieved papers
Multi-channel audio approach for improved robustness

The authors develop and evaluate two new speech Large Language Models, one using single-channel audio and another leveraging multi-channel audio built on the Llama 4 Scout architecture. Their case study demonstrates that multi-channel audio inputs greatly improve model resilience to environmental noise and enhance discrimination between device-directed speech and background conversations.

10 retrieved papers
Comprehensive evaluation of state-of-the-art SLLMs

The authors conduct systematic experiments evaluating state-of-the-art open-source and proprietary speech Large Language Models on the WearVox benchmark. Their evaluation reveals that most real-time SLLMs achieve accuracies ranging from 29% to 59%, with substantial performance degradation on noisy outdoor audio, establishing baseline performance metrics for wearable voice assistant research.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

WearVox benchmark for wearable voice assistants

The authors introduce WearVox, a novel benchmark comprising 3,842 multi-channel, egocentric audio recordings collected via AI glasses across five diverse tasks. This benchmark is specifically designed to evaluate voice assistants under realistic wearable conditions, including motion, noise, and the need to distinguish device-directed speech from background conversations.

Contribution

Multi-channel audio approach for improved robustness

The authors develop and evaluate two new speech Large Language Models, one using single-channel audio and another leveraging multi-channel audio built on the Llama 4 Scout architecture. Their case study demonstrates that multi-channel audio inputs greatly improve model resilience to environmental noise and enhance discrimination between device-directed speech and background conversations.

Contribution

Comprehensive evaluation of state-of-the-art SLLMs

The authors conduct systematic experiments evaluating state-of-the-art open-source and proprietary speech Large Language Models on the WearVox benchmark. Their evaluation reveals that most real-time SLLMs achieve accuracies ranging from 29% to 59%, with substantial performance degradation on noisy outdoor audio, establishing baseline performance metrics for wearable voice assistant research.