WearVox: An Egocentric Multichannel Voice Assistant Benchmark for Wearables
Overview
Overall Novelty Assessment
The paper introduces WearVox, a benchmark comprising 3,842 multi-channel egocentric audio recordings collected via AI glasses across five tasks. It occupies the 'Egocentric Audio and Real-World Scenario Benchmarks' leaf within the Evaluation Methodologies branch. Notably, this leaf contains only the original paper itself—no sibling papers appear in the taxonomy. This suggests the research direction is relatively sparse, with WearVox potentially pioneering a focused evaluation paradigm for wearable voice assistants under realistic acoustic conditions.
The taxonomy reveals neighboring evaluation work in 'Automated Persona-Driven Testing' and 'Respiratory Signal Evaluation in Dialogue,' both addressing dialogue assessment but from different angles. Broader branches like Multimodal Context Integration and Health Applications contain numerous papers (e.g., GazePointAR, PhysioLLM) that build wearable systems but may not emphasize egocentric audio benchmarking. WearVox thus bridges a gap: while many works assume cleaner audio or focus on system design, this benchmark targets the acoustic and environmental realism that prior evaluation frameworks largely overlook.
Among 28 candidates examined, none clearly refute the three core contributions. The WearVox benchmark itself (8 candidates examined, 0 refutable) appears novel in its multi-channel, egocentric audio focus. The multi-channel audio approach (10 candidates, 0 refutable) and SLLM evaluation (10 candidates, 0 refutable) also show no substantial prior overlap within the limited search scope. This suggests that, at least among the top-30 semantic matches and their citations, the combination of egocentric recording, diverse tasks, and real-world acoustic conditions represents a distinct contribution.
Based on the limited literature search, WearVox appears to occupy a relatively unexplored niche in wearable voice assistant evaluation. The absence of sibling papers in its taxonomy leaf and the lack of refuting candidates among 28 examined works indicate novelty, though a broader search might uncover related benchmarks in adjacent fields. The analysis covers top-K semantic matches and does not claim exhaustive coverage of all evaluation methodologies.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce WearVox, a novel benchmark comprising 3,842 multi-channel, egocentric audio recordings collected via AI glasses across five diverse tasks. This benchmark is specifically designed to evaluate voice assistants under realistic wearable conditions, including motion, noise, and the need to distinguish device-directed speech from background conversations.
The authors develop and evaluate two new speech Large Language Models, one using single-channel audio and another leveraging multi-channel audio built on the Llama 4 Scout architecture. Their case study demonstrates that multi-channel audio inputs greatly improve model resilience to environmental noise and enhance discrimination between device-directed speech and background conversations.
The authors conduct systematic experiments evaluating state-of-the-art open-source and proprietary speech Large Language Models on the WearVox benchmark. Their evaluation reveals that most real-time SLLMs achieve accuracies ranging from 29% to 59%, with substantial performance degradation on noisy outdoor audio, establishing baseline performance metrics for wearable voice assistant research.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
WearVox benchmark for wearable voice assistants
The authors introduce WearVox, a novel benchmark comprising 3,842 multi-channel, egocentric audio recordings collected via AI glasses across five diverse tasks. This benchmark is specifically designed to evaluate voice assistants under realistic wearable conditions, including motion, noise, and the need to distinguish device-directed speech from background conversations.
[51] Ego4d: Around the world in 3,000 hours of egocentric video PDF
[52] EgoLife: Towards Egocentric Life Assistant PDF
[53] Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world PDF
[54] The earsavas dataset: Enabling subject-aware vocal activity sensing on earables PDF
[55] Designing wearable personal assistants for surgeons: An egocentric approach PDF
[56] TeleEgo: Benchmarking Egocentric AI Assistants in the Wild PDF
[57] Benchmarking Egocentric Multimodal Goal Inference for Assistive Wearable Agents PDF
[58] The performance of wearable speech enhancement system under noisy environment: an experimental study PDF
Multi-channel audio approach for improved robustness
The authors develop and evaluate two new speech Large Language Models, one using single-channel audio and another leveraging multi-channel audio built on the Llama 4 Scout architecture. Their case study demonstrates that multi-channel audio inputs greatly improve model resilience to environmental noise and enhance discrimination between device-directed speech and background conversations.
[59] Multi-modal multi-channel target speech separation PDF
[60] Multichannel Modulo Sampling with Unlimited Noise PDF
[61] Elevating Robust ASR By Decoupling Multi-Channel Speaker Separation and Speech Recognition PDF
[62] AISHELL-5: The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition PDF
[63] MIMO-Speech: End-to-end multi-channel multi-speaker speech recognition PDF
[64] Advances in microphone array processing and multichannel speech enhancement PDF
[65] TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation PDF
[66] Multi-channel transformer transducer for speech recognition PDF
[67] Audio-visual end-to-end multi-channel speech separation, dereverberation and recognition PDF
[68] Deep beamforming networks for multi-channel speech recognition PDF
Comprehensive evaluation of state-of-the-art SLLMs
The authors conduct systematic experiments evaluating state-of-the-art open-source and proprietary speech Large Language Models on the WearVox benchmark. Their evaluation reveals that most real-time SLLMs achieve accuracies ranging from 29% to 59%, with substantial performance degradation on noisy outdoor audio, establishing baseline performance metrics for wearable voice assistant research.