Abstract:

The rapid development and widespread adoption of Audio Large Language Models (ALLMs) require a rigorous assessment of their trustworthiness. However, existing evaluation frameworks, primarily designed for text, are not equipped to handle the unique vulnerabilities introduced by audio’s acoustic properties. We find that significant trustworthiness risks in ALLMs arise from non-semantic acoustic cues, such as timbre, accent, and background noise, which can be used to manipulate model behavior. To address this gap, we propose AudioTrust, the first framework for large-scale and systematic evaluation of ALLM trustworthiness concerning these audio-specific risks. AudioTrust spans six key dimensions: fairness, hallucination, safety, privacy, robustness, and authenticition. It is implemented through 26 distinct sub-tasks and a curated dataset of over 4,420 audio samples collected from real-world scenarios (e.g., daily conversations, emergency calls, and voice assistant interactions), purposefully constructed to probe the trustworthiness of ALLMs across multiple dimensions. Our comprehensive evaluation includes 18 distinct experimental configurations and employs human-validated automated pipelines to objectively and scalably quantify model outputs. Experimental results reveal the boundaries and limitations of 14 state-of-the-art (SOTA) open-source and closed-source ALLMs when confronted with diverse high-risk audio scenarios, thereby offering critical insights into the secure and trustworthy deployment of future audio models. Our platform and benchmark are publicly available at https://anonymous.4open.science/r/AudioTrust-8715/.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: trustworthiness evaluation of audio large language models. As audio-capable large language models become increasingly deployed, the field has organized around a multi-dimensional view of trustworthiness that spans adversarial robustness, hallucination detection, fairness, privacy, reliability, and explainability. The taxonomy reflects this breadth through branches addressing comprehensive assessment frameworks that evaluate models across multiple risk dimensions simultaneously (e.g., AudioTrust[0], Avtrustbench[12]), alongside specialized branches targeting individual concerns such as adversarial attacks (Audio Injection Robustness[3], Chat Audio Attacks[6]), modality conflicts (Audio Text Disagreement[4]), demographic bias (Spoken Dialogue Bias[41]), and privacy risks (Audio Private Profiling[9]). Additional branches cover instruction-following fidelity (IFEval Audio[14]), faithfulness and explainability (Faithfulness Audio Language[17]), and domain-specific applications ranging from mental health support (Mental Wellbeing Companions[21]) to speech quality assessment (Descriptive Speech Quality[19]). A particularly active line of work focuses on holistic benchmarking that integrates perception, reasoning, and safety dimensions, with AudioBench[1] and Holistic Audio Language Evaluation[2] establishing multi-faceted evaluation protocols. In contrast, many studies drill into specific vulnerabilities: adversarial robustness research explores injection attacks and jailbreaking (Chat Audio Attacks Benchmark[7], Multimodal Jailbreaking[36]), while hallucination studies examine object-level errors (Object Hallucination Audio[11], Audio Hallucination Assessment[26]) and mitigation strategies (AHAMask[8]). AudioTrust[0] sits within the comprehensive assessment branch, emphasizing multi-dimensional risk evaluation that spans adversarial, fairness, privacy, and reliability concerns in a unified framework. Compared to narrower benchmarks like Audio Injection Robustness[3] or domain-focused evaluations such as Interface Trust Health[5], AudioTrust[0] adopts a broader lens, aiming to capture the interplay among diverse trustworthiness facets rather than isolating individual threat models or application contexts.

Claimed Contributions

AudioTrust benchmark framework for evaluating ALLM trustworthiness

The authors introduce AudioTrust, the first comprehensive benchmark designed to systematically evaluate the trustworthiness of Audio Large Language Models across six critical dimensions: fairness, hallucination, safety, privacy, robustness, and authentication. The framework addresses unique vulnerabilities introduced by audio's acoustic properties that existing text-based evaluation frameworks cannot capture.

10 retrieved papers
Curated dataset of over 4,420 audio samples across 26 sub-tasks

The authors construct a large-scale dataset comprising over 4,420 audio samples spanning 26 distinct sub-tasks and 18 experimental configurations. The samples are purposefully collected from real-world scenarios such as daily conversations, emergency calls, and voice assistant interactions to probe trustworthiness across multiple dimensions.

10 retrieved papers
Human-validated automated evaluation pipeline for scalable assessment

The authors develop an automated evaluation pipeline that employs model-based evaluators (GPT-4o and Qwen3) with human validation to ensure rigorous and reproducible assessment. The pipeline achieves over 97% agreement rate with human experts and enables scalable quantification of model outputs across diverse high-risk audio scenarios.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

AudioTrust benchmark framework for evaluating ALLM trustworthiness

The authors introduce AudioTrust, the first comprehensive benchmark designed to systematically evaluate the trustworthiness of Audio Large Language Models across six critical dimensions: fairness, hallucination, safety, privacy, robustness, and authentication. The framework addresses unique vulnerabilities introduced by audio's acoustic properties that existing text-based evaluation frameworks cannot capture.

Contribution

Curated dataset of over 4,420 audio samples across 26 sub-tasks

The authors construct a large-scale dataset comprising over 4,420 audio samples spanning 26 distinct sub-tasks and 18 experimental configurations. The samples are purposefully collected from real-world scenarios such as daily conversations, emergency calls, and voice assistant interactions to probe trustworthiness across multiple dimensions.

Contribution

Human-validated automated evaluation pipeline for scalable assessment

The authors develop an automated evaluation pipeline that employs model-based evaluators (GPT-4o and Qwen3) with human validation to ensure rigorous and reproducible assessment. The pipeline achieves over 97% agreement rate with human experts and enables scalable quantification of model outputs across diverse high-risk audio scenarios.

AudioTrust: Benchmarking The Multifaceted Trustworthiness of Audio Large Language Models | Novelty Validation