AudioTrust: Benchmarking The Multifaceted Trustworthiness of Audio Large Language Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.8 Download Report PDF

Audio Large Language Model

The rapid development and widespread adoption of Audio Large Language Models (ALLMs) require a rigorous assessment of their trustworthiness. However, existing evaluation frameworks, primarily designed for text, are not equipped to handle the unique vulnerabilities introduced by audio’s acoustic properties. We find that significant trustworthiness risks in ALLMs arise from non-semantic acoustic cues, such as timbre, accent, and background noise, which can be used to manipulate model behavior. To address this gap, we propose AudioTrust, the first framework for large-scale and systematic evaluation of ALLM trustworthiness concerning these audio-specific risks. AudioTrust spans six key dimensions: fairness, hallucination, safety, privacy, robustness, and authenticition. It is implemented through 26 distinct sub-tasks and a curated dataset of over 4,420 audio samples collected from real-world scenarios (e.g., daily conversations, emergency calls, and voice assistant interactions), purposefully constructed to probe the trustworthiness of ALLMs across multiple dimensions. Our comprehensive evaluation includes 18 distinct experimental configurations and employs human-validated automated pipelines to objectively and scalably quantify model outputs. Experimental results reveal the boundaries and limitations of 14 state-of-the-art (SOTA) open-source and closed-source ALLMs when confronted with diverse high-risk audio scenarios, thereby offering critical insights into the secure and trustworthy deployment of future audio models. Our platform and benchmark are publicly available at https://anonymous.4open.science/r/AudioTrust-8715/.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: trustworthiness evaluation of audio large language models. As audio-capable large language models become increasingly deployed, the field has organized around a multi-dimensional view of trustworthiness that spans adversarial robustness, hallucination detection, fairness, privacy, reliability, and explainability. The taxonomy reflects this breadth through branches addressing comprehensive assessment frameworks that evaluate models across multiple risk dimensions simultaneously (e.g., AudioTrust[0], Avtrustbench[12]), alongside specialized branches targeting individual concerns such as adversarial attacks (Audio Injection Robustness[3], Chat Audio Attacks[6]), modality conflicts (Audio Text Disagreement[4]), demographic bias (Spoken Dialogue Bias[41]), and privacy risks (Audio Private Profiling[9]). Additional branches cover instruction-following fidelity (IFEval Audio[14]), faithfulness and explainability (Faithfulness Audio Language[17]), and domain-specific applications ranging from mental health support (Mental Wellbeing Companions[21]) to speech quality assessment (Descriptive Speech Quality[19]). A particularly active line of work focuses on holistic benchmarking that integrates perception, reasoning, and safety dimensions, with AudioBench[1] and Holistic Audio Language Evaluation[2] establishing multi-faceted evaluation protocols. In contrast, many studies drill into specific vulnerabilities: adversarial robustness research explores injection attacks and jailbreaking (Chat Audio Attacks Benchmark[7], Multimodal Jailbreaking[36]), while hallucination studies examine object-level errors (Object Hallucination Audio[11], Audio Hallucination Assessment[26]) and mitigation strategies (AHAMask[8]). AudioTrust[0] sits within the comprehensive assessment branch, emphasizing multi-dimensional risk evaluation that spans adversarial, fairness, privacy, and reliability concerns in a unified framework. Compared to narrower benchmarks like Audio Injection Robustness[3] or domain-focused evaluations such as Interface Trust Health[5], AudioTrust[0] adopts a broader lens, aiming to capture the interplay among diverse trustworthiness facets rather than isolating individual threat models or application contexts.

Claimed Contributions

AudioTrust benchmark framework for evaluating ALLM trustworthiness

10 retrieved papers

The authors introduce AudioTrust, the first comprehensive benchmark designed to systematically evaluate the trustworthiness of Audio Large Language Models across six critical dimensions: fairness, hallucination, safety, privacy, robustness, and authentication. The framework addresses unique vulnerabilities introduced by audio's acoustic properties that existing text-based evaluation frameworks cannot capture.

10 retrieved papers

Curated dataset of over 4,420 audio samples across 26 sub-tasks

10 retrieved papers

The authors construct a large-scale dataset comprising over 4,420 audio samples spanning 26 distinct sub-tasks and 18 experimental configurations. The samples are purposefully collected from real-world scenarios such as daily conversations, emergency calls, and voice assistant interactions to probe trustworthiness across multiple dimensions.

10 retrieved papers

Human-validated automated evaluation pipeline for scalable assessment

Can Refute

10 retrieved papers

The authors develop an automated evaluation pipeline that employs model-based evaluators (GPT-4o and Qwen3) with human validation to ensure rigorous and reproducible assessment. The pipeline achieves over 97% agreement rate with human experts and enables scalable quantification of model outputs across diverse high-risk audio scenarios.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[12] Avtrustbench: Assessing and enhancing reliability and robustness in audio-visual llms PDF

Chowdhury, Sanjoy, Nag, Sayan, Sanjoy Chowdhury, Dasgupta, Subhrajyoti, Sayan Nag, Wang Yao-ting, Subhrajyoti Dasgupta, Elhoseiny, Mohamed, Yaoting Wang, Gao Ruohan, Mohamed Elhoseiny, Manocha, Dinesh, Ruohan Gao, Dinesh Manocha (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

AudioTrust benchmark framework for evaluating ALLM trustworthiness

[1] AudioBench: A Universal Benchmark for Audio Large Language Models PDF

Cannot Refute

[4] When audio and text disagree: Revealing text bias in large audio-language models PDF

Cannot Refute

[12] Avtrustbench: Assessing and enhancing reliability and robustness in audio-visual llms PDF

Cannot Refute

[60] Ahelm: A holistic evaluation of audio-language models PDF

Cannot Refute

[61] Safebench: A safety evaluation framework for multimodal large language models PDF

Cannot Refute

[62] JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models PDF

Cannot Refute

[63] Air-bench: Benchmarking large audio-language models via generative comprehension PDF

Cannot Refute

[64] Thinking with Sound: Audio Chain-of-Thought Enables Multimodal Reasoning in Large Audio-Language Models PDF

Cannot Refute

[65] Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models PDF

Cannot Refute

[66] MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models PDF

Cannot Refute

Contribution

Curated dataset of over 4,420 audio samples across 26 sub-tasks

[12] Avtrustbench: Assessing and enhancing reliability and robustness in audio-visual llms PDF

Cannot Refute

[51] Windy events detection in big bioacoustics datasets using a pre-trained Convolutional Neural Network PDF

Cannot Refute

[52] A zero-shot model for diagnosing unknown composite faults in train bearings based on label feature vector generated fault features PDF

Cannot Refute

[53] Sounds of the deep: How input representation, model choice, and dataset size influence underwater sound classification performance PDF

Cannot Refute

[54] Integrating Vehicle Acoustic Data for Enhanced Urban Traffic Management: A Study on Speed Classification in Suzhou PDF

Cannot Refute

[55] THAI Speech Emotion Recognition (THAI-SER) corpus PDF

Cannot Refute

[56] An invasive species model and dataset for bioacoustic monitoring of common brushtail possum PDF

Cannot Refute

[57] Acoustic-Based Industrial Diagnostics: A Scalable Noise-Robust Multiclass Framework for Anomaly Detection PDF

Cannot Refute

[58] COVID-19 sounds: a large-scale audio dataset for digital respiratory screening PDF

Cannot Refute

[59] Listener Acoustic Personalization ChallengeâLAP24: Head-Related Transfer Function Dataset Harmonization PDF

Cannot Refute

Contribution

Human-validated automated evaluation pipeline for scalable assessment

[75] Human-Calibrated Automated Testing and Validation of Generative Language Models PDF

Can Refute

[67] HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models PDF

Cannot Refute

[68] PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization PDF

Cannot Refute

[69] Large Language Model-Powered Automated Assessment: A Systematic Review PDF

Cannot Refute

[70] Alignbench: Benchmarking chinese alignment of large language models PDF

Cannot Refute

[71] STAIR-AIG: Optimizing the automated item generation process through human-AI collaboration for critical thinking assessment PDF

Cannot Refute

[72] MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation PDF

Cannot Refute

[73] Videoautoarena: An automated arena for evaluating large multimodal models in video analysis through user simulation PDF

Cannot Refute

[74] IQA-EVAL: Automatic Evaluation of Human-Model Interactive Question Answering PDF

Cannot Refute

[76] Chartcap: Mitigating hallucination of dense chart captioning PDF

Cannot Refute

AudioTrust: Benchmarking The Multifaceted Trustworthiness of Audio Large Language Models

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[12] Avtrustbench: Assessing and enhancing reliability and robustness in audio-visual llms PDF

Contribution Analysis

AudioTrust benchmark framework for evaluating ALLM trustworthiness

[1] AudioBench: A Universal Benchmark for Audio Large Language Models PDF

[4] When audio and text disagree: Revealing text bias in large audio-language models PDF

[12] Avtrustbench: Assessing and enhancing reliability and robustness in audio-visual llms PDF

[60] Ahelm: A holistic evaluation of audio-language models PDF

[61] Safebench: A safety evaluation framework for multimodal large language models PDF

[62] JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models PDF

[63] Air-bench: Benchmarking large audio-language models via generative comprehension PDF

[64] Thinking with Sound: Audio Chain-of-Thought Enables Multimodal Reasoning in Large Audio-Language Models PDF

[65] Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models PDF

[66] MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models PDF

Curated dataset of over 4,420 audio samples across 26 sub-tasks

[12] Avtrustbench: Assessing and enhancing reliability and robustness in audio-visual llms PDF

[51] Windy events detection in big bioacoustics datasets using a pre-trained Convolutional Neural Network PDF

[52] A zero-shot model for diagnosing unknown composite faults in train bearings based on label feature vector generated fault features PDF

[53] Sounds of the deep: How input representation, model choice, and dataset size influence underwater sound classification performance PDF

[54] Integrating Vehicle Acoustic Data for Enhanced Urban Traffic Management: A Study on Speed Classification in Suzhou PDF

[55] THAI Speech Emotion Recognition (THAI-SER) corpus PDF

[56] An invasive species model and dataset for bioacoustic monitoring of common brushtail possum PDF

[57] Acoustic-Based Industrial Diagnostics: A Scalable Noise-Robust Multiclass Framework for Anomaly Detection PDF

[58] COVID-19 sounds: a large-scale audio dataset for digital respiratory screening PDF

[59] Listener Acoustic Personalization ChallengeâLAP24: Head-Related Transfer Function Dataset Harmonization PDF

Human-validated automated evaluation pipeline for scalable assessment

[75] Human-Calibrated Automated Testing and Validation of Generative Language Models PDF

[67] HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models PDF

[68] PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization PDF

[69] Large Language Model-Powered Automated Assessment: A Systematic Review PDF

[70] Alignbench: Benchmarking chinese alignment of large language models PDF

[71] STAIR-AIG: Optimizing the automated item generation process through human-AI collaboration for critical thinking assessment PDF

[72] MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation PDF

[73] Videoautoarena: An automated arena for evaluating large multimodal models in video analysis through user simulation PDF

[74] IQA-EVAL: Automatic Evaluation of Human-Model Interactive Question Answering PDF

[76] Chartcap: Mitigating hallucination of dense chart captioning PDF

Table of Contents

[59] Listener Acoustic Personalization ChallengeâLAP24: Head-Related Transfer Function Dataset Harmonization PDF