AudioTrust: Benchmarking The Multifaceted Trustworthiness of Audio Large Language Models
Overview
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce AudioTrust, the first comprehensive benchmark designed to systematically evaluate the trustworthiness of Audio Large Language Models across six critical dimensions: fairness, hallucination, safety, privacy, robustness, and authentication. The framework addresses unique vulnerabilities introduced by audio's acoustic properties that existing text-based evaluation frameworks cannot capture.
The authors construct a large-scale dataset comprising over 4,420 audio samples spanning 26 distinct sub-tasks and 18 experimental configurations. The samples are purposefully collected from real-world scenarios such as daily conversations, emergency calls, and voice assistant interactions to probe trustworthiness across multiple dimensions.
The authors develop an automated evaluation pipeline that employs model-based evaluators (GPT-4o and Qwen3) with human validation to ensure rigorous and reproducible assessment. The pipeline achieves over 97% agreement rate with human experts and enables scalable quantification of model outputs across diverse high-risk audio scenarios.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[12] Avtrustbench: Assessing and enhancing reliability and robustness in audio-visual llms PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
AudioTrust benchmark framework for evaluating ALLM trustworthiness
The authors introduce AudioTrust, the first comprehensive benchmark designed to systematically evaluate the trustworthiness of Audio Large Language Models across six critical dimensions: fairness, hallucination, safety, privacy, robustness, and authentication. The framework addresses unique vulnerabilities introduced by audio's acoustic properties that existing text-based evaluation frameworks cannot capture.
[1] AudioBench: A Universal Benchmark for Audio Large Language Models PDF
[4] When audio and text disagree: Revealing text bias in large audio-language models PDF
[12] Avtrustbench: Assessing and enhancing reliability and robustness in audio-visual llms PDF
[60] Ahelm: A holistic evaluation of audio-language models PDF
[61] Safebench: A safety evaluation framework for multimodal large language models PDF
[62] JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models PDF
[63] Air-bench: Benchmarking large audio-language models via generative comprehension PDF
[64] Thinking with Sound: Audio Chain-of-Thought Enables Multimodal Reasoning in Large Audio-Language Models PDF
[65] Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models PDF
[66] MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models PDF
Curated dataset of over 4,420 audio samples across 26 sub-tasks
The authors construct a large-scale dataset comprising over 4,420 audio samples spanning 26 distinct sub-tasks and 18 experimental configurations. The samples are purposefully collected from real-world scenarios such as daily conversations, emergency calls, and voice assistant interactions to probe trustworthiness across multiple dimensions.
[12] Avtrustbench: Assessing and enhancing reliability and robustness in audio-visual llms PDF
[51] Windy events detection in big bioacoustics datasets using a pre-trained Convolutional Neural Network PDF
[52] A zero-shot model for diagnosing unknown composite faults in train bearings based on label feature vector generated fault features PDF
[53] Sounds of the deep: How input representation, model choice, and dataset size influence underwater sound classification performance PDF
[54] Integrating Vehicle Acoustic Data for Enhanced Urban Traffic Management: A Study on Speed Classification in Suzhou PDF
[55] THAI Speech Emotion Recognition (THAI-SER) corpus PDF
[56] An invasive species model and dataset for bioacoustic monitoring of common brushtail possum PDF
[57] Acoustic-Based Industrial Diagnostics: A Scalable Noise-Robust Multiclass Framework for Anomaly Detection PDF
[58] COVID-19 sounds: a large-scale audio dataset for digital respiratory screening PDF
[59] Listener Acoustic Personalization ChallengeâLAP24: Head-Related Transfer Function Dataset Harmonization PDF
Human-validated automated evaluation pipeline for scalable assessment
The authors develop an automated evaluation pipeline that employs model-based evaluators (GPT-4o and Qwen3) with human validation to ensure rigorous and reproducible assessment. The pipeline achieves over 97% agreement rate with human experts and enables scalable quantification of model outputs across diverse high-risk audio scenarios.