UALM: Unified Audio Language Model for Understanding, Generation and Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Audio Language ModelAudio UnderstandingAudio Generation

Recent advances in the audio language modeling (ALM) domain tackle audio understanding and text-to-audio generation as separate tasks. Very few studies attempt to unify these tasks -- an essential step toward advanced multimodal reasoning. This paper introduces Unified Audio Language Model (UALM), which aims to unify audio understanding, text-to-audio generation, and multimodal reasoning in a single model. To achieve this goal, we first present UALM-Gen, a text-to-audio language model that directly predicts audio tokens and is comparable to state-of-the-art diffusion-based models. We then demonstrate, using proper data blending, training recipes, and inference techniques, that our single UALM model matches the quality of state-of-the-art specialized models in audio understanding, text-to-audio generation, and text reasoning. Furthermore, we present UALM-R1, a multimodal reasoning model that utilizes both text and audio in the intermediate thinking steps to facilitate complex generation tasks. To our knowledge, this is the first demonstration in audio research of cross-modal generative reasoning, with its effectiveness confirmed by subjective evaluations.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces UALM, a unified audio-language model integrating understanding, text-to-audio generation, and multimodal reasoning within a single autoregressive framework. It resides in the 'Autoregressive Unified Models' leaf, which contains only three papers including the original work, Step Audio, and one other sibling. This represents a relatively sparse research direction within the broader taxonomy of fifty papers across thirty-six topics, suggesting that autoregressive unification for audio tasks remains an emerging area compared to more populated branches like general audio understanding or diffusion-based generation.

The taxonomy reveals that UALM sits at the intersection of multiple research streams. Its closest neighbors include diffusion-based audio-language models and multi-agent architectures within the unified models branch, while adjacent branches cover audio reasoning with chain-of-thought mechanisms and large-scale multimodal foundation models. The autoregressive approach contrasts with diffusion paradigms employed by models in the sibling leaf, and the emphasis on single-model unification diverges from modular multi-agent systems. The taxonomy's scope and exclude notes clarify that UALM's autoregressive token prediction distinguishes it from non-autoregressive alternatives, positioning it within a specific architectural philosophy.

Among thirty candidates examined, the analysis identifies varying degrees of prior work overlap across contributions. UALM-Gen examined ten candidates with two appearing to provide overlapping prior work on LLM-based text-to-audio generation. The unified UALM model similarly examined ten candidates with one refutable match, suggesting some precedent for unified audio understanding and generation architectures. UALM-R1's cross-modal generative reasoning examined ten candidates with zero refutable matches, indicating this contribution may represent a more novel direction within the limited search scope. The statistics reflect a focused semantic search rather than exhaustive coverage, leaving open the possibility of additional relevant work beyond the top-thirty matches.

Based on the limited literature search, UALM-R1's cross-modal reasoning appears most distinctive, while UALM-Gen and the unified model show some overlap with existing autoregressive and unified approaches. The sparse population of the autoregressive unified models leaf suggests the overall direction is less crowded, though the presence of sibling papers indicates concurrent exploration. The analysis covers top-thirty semantic matches and does not claim comprehensive field coverage, particularly for work outside autoregressive paradigms or published after the search cutoff.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: unified audio understanding generation and multimodal reasoning. The field encompasses a broad spectrum of approaches organized into seven main branches. Unified Audio-Language Models focus on end-to-end architectures that jointly process and generate audio and language, often employing autoregressive frameworks like UALM[0] and Step Audio[1]. Audio Understanding and Reasoning emphasizes interpretive capabilities, including chain-of-thought methods such as Audio Chain-of-Thought[12] and reasoning-focused systems like Audio Reasoner[17]. Multimodal Audio-Visual Generation targets synthesis tasks that combine sound with visual input, exemplified by MMAudio[29] and AudioGen Omni[19]. Large-Scale Multimodal Foundation Models represent comprehensive systems like Gemini[2] and Unified IO[4] that handle diverse modalities at scale. Multimodal Perception and Integration explores how different sensory streams are fused, while Human Multimodal Perception and Cognition investigates cognitive phenomena such as the McGurk Effect[41] and Bayesian Causal Inference[37]. Applied Multimodal Systems and Evaluation addresses practical deployment and benchmarking challenges. Recent work reveals contrasting emphases between autoregressive unified models and specialized reasoning pipelines. Autoregressive approaches like UALM[0] and Step Audio[1] prioritize seamless generation and understanding within a single framework, trading architectural simplicity for potential limitations in explicit reasoning transparency. Meanwhile, systems such as Audio Chain-of-Thought[12] and ThinkSound[20] incorporate structured reasoning steps to enhance interpretability and complex problem-solving. UALM[0] sits within the autoregressive unified branch alongside Step Audio[1] and shares architectural philosophy with Unified IO[4], yet distinguishes itself by focusing specifically on audio-language integration rather than the broader modality coverage of Unified IO[4]. Compared to Audio Comprehension Enhancement[5], which targets incremental improvements in understanding, UALM[0] pursues a more holistic generation-understanding duality. Open questions persist around balancing model unification with task-specific performance and determining optimal granularity for reasoning mechanisms across diverse audio contexts.

Claimed Contributions

UALM-Gen: LLM-based text-to-audio generation model

Can Refute

10 retrieved papers

The authors introduce UALM-Gen, a decoder-only language model for text-to-audio generation that directly predicts audio tokens. Through data scaling, classifier-free guidance, and direct preference optimization, UALM-Gen achieves quality comparable to state-of-the-art diffusion-based models.

10 retrieved papers

Can Refute

UALM: unified model for audio understanding, generation, and text reasoning

Can Refute

10 retrieved papers

The authors present UALM, a single language model that simultaneously handles audio understanding, text-to-audio generation, and text problem solving. Using careful data blending and a modality alignment stage, UALM matches specialized state-of-the-art models in each domain.

10 retrieved papers

Can Refute

UALM-R1: multimodal reasoning model with cross-modal generative reasoning

10 retrieved papers

The authors introduce UALM-R1, which enables multimodal reasoning that uses both text and audio in intermediate thinking steps. This includes enrichment, dialogue, and self-reflection capabilities for complex generation tasks, representing the first demonstration of cross-modal generative reasoning in audio research.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Step-audio: Unified understanding and generation in intelligent speech interaction PDF

Huang, Ailin, Wu Boyong, Ailin Huang, Wang, Bruce, Boyong Wu, Yan Chao, Bruce Wang, Hu Chen, Chao Yan, Feng Cheng-li, Chen Hu, Tian, Fei, Chengli Feng, Shen Feiyu, Fei Tian, Li, Jingbei, Feiyu Shen, Chen Mingrui, Jingbei Li, Liu Peng, Mingrui Chen, Miao Ruihang, Peng Liu, You Wang, Ruihang Miao, Chen Xi, Wang You, Yang Xuerui, Xi Chen, Yechang, Xue-Ting Yang, Zhang Yuxiang, Yechang Huang, Gong Zheng, Yuxiang Zhang, Zhang Zi-xin, Zheng Gong, Zhou Hongyu, Zixin Zhang, Sun Jian-jian, Hongyu Zhou, Li Brian, JianâYuan Sun, Feng, Chengting, Brian Li, Chengting Feng, Hu, Hanpeng, Changyi Wan, WU Jianchang, Hanpeng Hu, Jianchang Wu, Jiangjie Zhen, Yuan Song, Ranchen Ming, Zhang Xuelin, Song Yuan, Zhou Yu, Xuelin Zhang, Li Bingxin, Yu Zhou, Ma Buyun, Bingxin Li, Wang Hong-yuan, Buyun Ma, An Kang, Hongyuan Wang, Ji Wei, Kang An, Li Wen, Wei Ji, Wen Xuan, Wen Li, Kong Xiang-wen, Xuan Wen, Ma, Yuankai, Xiangwen Kong, Liang Yuanwei, Yuan Ma, Mou Yun, Yuanwei Liang, Yun-Fei Mou, Wang Bin, Bahtiyar Ahmidi, Li Bo, Bin Wang, Miao Changxin, Bo Li, Xu Chen, Changxing Miao, Chen Xu, Shi Dapeng, Chenrun Wang, Sun De-shan, Da Shi, Deshan Sun, Dingyuan Hu, Dula Sai, Enle Liu, Guanzhe Huang, Wang Heng, Gulin Yan, Jia Haonan, Heng Wang, Zhang Hao-yang, Hao Jia, Gong Jiahao, Haoyang Zhang, Guo Jun-jing, Jiahao Gong, Liu Jiashuai, Jun-Nan Guo, Liu Jiahong, Jiashuai Liu, Feng Jie, Jiahong Liu, Wu Jie, Jie Feng, Wu, Jiaoren, Jie Wu, Yang, Jie, Jiao Wu, Wang Jinguo, Jie Yang, ZHANG JingYang, Jinguo Wang, Lin Junzhe, Jingyang Zhang, Li Kaixiang, Junzhe Lin, Xia Lei, Kai-yang Li, Zhou Li, Lei Xia, Zhao Liang, Li Zhou, Liang Zhao, Chen Mei, Longlong Gu, Wu Menglin, Mei Chen, Li Ming, Menglin Wu, LI Mingxiao, Ming Li, Li Mingliang, Mingxiao Li, Mingliang Li, Wang Na, Min Liang, Hao Nie, Na Wang, Qiling, Nie Hao, Qiling Wu, Sun Ran, Qi-Liang Tan, Shuai Shuai, Ran Sun, Yang Shi-liang, Shaoliang Pang, Gao Shu-li, Shi-Yu Yang, Yuan Shanshan, Shu-Guang Gao, Liu Siqi, Shanshan Yuan, Deng, Shihong, Siqi Liu, Jiang Shi-lei, Shihong Deng, Liu Sitong, Shilei Jiang, Sitong Liu, Wang Tianyu, Tiancheng Cao, Deng Wen-jin, Tianyu Wang, Wen-Yu Deng, Wuxun Xie, He, Wenqing, Weipeng Ming, Sun Wen, Wenqing He, Han Xin, Wencheng Sun, Huang Xin, Xin-hao Han, Deng Xiao-min, Xin Huang, Liu Xiaojia, Xiao-Zhen Deng, Wu Xin, Xiao-Jun Liu, Zhao Xu, Xin Wu, WEI Yanan, Xu Zhao, Yu Yanbo, Yana Wei, Cao Yang, Yanbo Yu, Yangguang, Yang Cao, Yangguang Li, Xu Yanming, Yangzhen Ma, Wang Yao-Yu, Yanming Xu, Shi Ya-qiang, Yaoyu Wang, Wang YiLei, Ya-jun Shi, Zhou, Yizhuang, Yilei Wang, Zhong, Yinmin, Yizhuang Zhou, Zhang Yang, Yinmin Zhong, Yang Zhang, Luo Yu, Yaoben Wei, Lu Yuanwei, Yu Luo, Yin Yu-he, Yuanwei Lu, Luo Yuchu, Yuhe Yin, Yan Yuting, Yu-Qiang Ding, Dai Ya-qi, Yuting Yan, Yang Yu-xiang, YaâNan Dai, Xie Zhe, Yuxiang Yang, Ge, Zheng, Zhenghui Xie, Sun Zheng, Zheng Ge, Zhewei, Zheng Sun, Chang, Zhichao, Zhewei Huang, Guan Zhi-sheng, Zhichao Chang, Yang Zi-dong, Zhi-Ying Guan, Zhang Zi-li, Zi-Hao Yang, Jiao, Binxing, Zili Zhang, Jiang, Daxin, Binxing Jiao, Shum, Heung-Yeung, Daxin Jiang, Chen Jiansheng, H. Shum, Li Jing, Jiansheng Chen, Shuchang, Jing Li, Zhang Xiang-yu, Shuchang Zhou, Zhang Xin-hao, Xiangyu Zhang, Zhu, Yibo, Xinhao Zhang, Yibo Zhu (2025)

[4] Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action PDF

Jiasen LÃ¼, Christopher Clark, Jiasen Lu, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, Aniruddha Kembhavi (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

UALM-Gen: LLM-based text-to-audio generation model

[68] UniAudio: An Audio Foundation Model Toward Universal Audio Generation PDF

Can Refute

[74] Uniaudio: Towards universal audio generation with large language models PDF

Can Refute

[65] AudioLM: A Language Modeling Approach to Audio Generation PDF

Cannot Refute

[66] AudioLDM: Text-to-Audio Generation with Latent Diffusion Models PDF

Cannot Refute

[67] Audio-language models for audio-centric tasks: A survey PDF

Cannot Refute

[69] Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction PDF

Cannot Refute

[70] Text-to-Audio Generation using Instruction Guided Latent Diffusion Model PDF

Cannot Refute

[71] Speech token prediction via compressed-to-fine language modeling for speech generation PDF

Cannot Refute

[72] CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech PDF

Cannot Refute

[73] Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation PDF

Cannot Refute

Contribution

UALM: unified model for audio understanding, generation, and text reasoning

[1] Step-audio: Unified understanding and generation in intelligent speech interaction PDF

Can Refute

[6] Unifiedmllm: Enabling unified representation for multi-modal multi-tasks with large language model PDF

Cannot Refute

[7] GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities PDF

Cannot Refute

[8] Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities PDF

Cannot Refute

[9] Mellow: a small audio language model for reasoning PDF

Cannot Refute

[14] Joint Audio and Speech Understanding PDF

Cannot Refute

[51] Listen, think, and understand PDF

Cannot Refute

[52] Unival: Unified model for image, video, audio and language tasks PDF

Cannot Refute

[53] Unified Model for Image, Video, Audio and Language Tasks PDF

Cannot Refute

[54] U-sam: An audio language model for unified speech, audio, and music understanding PDF

Cannot Refute

Contribution

UALM-R1: multimodal reasoning model with cross-modal generative reasoning

[55] Multimodal Chain-of-Thought Reasoning in Language Models PDF

Cannot Refute

[56] R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization PDF

Cannot Refute

[57] PixelLM: Pixel Reasoning with Large Multimodal Model PDF

Cannot Refute

[58] Multimodal Reasoning with Multimodal Knowledge Graph PDF

Cannot Refute

[59] Multimodal chain-of-thought reasoning: A comprehensive survey PDF

Cannot Refute

[60] Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi PDF

Cannot Refute

[61] Generative cross-modal retrieval: Memorizing images in multimodal language models for retrieval and beyond PDF

Cannot Refute

[62] A survey of multimodal deep generative models PDF

Cannot Refute

[63] Generative multimodal models are in-context learners PDF

Cannot Refute

[64] Corvid: Improving multimodal large language models towards chain-of-thought reasoning PDF

Cannot Refute

UALM: Unified Audio Language Model for Understanding, Generation and Reasoning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Step-audio: Unified understanding and generation in intelligent speech interaction PDF

[4] Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action PDF

Contribution Analysis

UALM-Gen: LLM-based text-to-audio generation model

[68] UniAudio: An Audio Foundation Model Toward Universal Audio Generation PDF

[74] Uniaudio: Towards universal audio generation with large language models PDF

[65] AudioLM: A Language Modeling Approach to Audio Generation PDF

[66] AudioLDM: Text-to-Audio Generation with Latent Diffusion Models PDF

[67] Audio-language models for audio-centric tasks: A survey PDF

[69] Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction PDF

[70] Text-to-Audio Generation using Instruction Guided Latent Diffusion Model PDF

[71] Speech token prediction via compressed-to-fine language modeling for speech generation PDF

[72] CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech PDF

[73] Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation PDF

UALM: unified model for audio understanding, generation, and text reasoning

[1] Step-audio: Unified understanding and generation in intelligent speech interaction PDF

[6] Unifiedmllm: Enabling unified representation for multi-modal multi-tasks with large language model PDF

[7] GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities PDF

[8] Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities PDF

[9] Mellow: a small audio language model for reasoning PDF

[14] Joint Audio and Speech Understanding PDF

[51] Listen, think, and understand PDF

[52] Unival: Unified model for image, video, audio and language tasks PDF

[53] Unified Model for Image, Video, Audio and Language Tasks PDF

[54] U-sam: An audio language model for unified speech, audio, and music understanding PDF

UALM-R1: multimodal reasoning model with cross-modal generative reasoning

[55] Multimodal Chain-of-Thought Reasoning in Language Models PDF

[56] R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization PDF

[57] PixelLM: Pixel Reasoning with Large Multimodal Model PDF

[58] Multimodal Reasoning with Multimodal Knowledge Graph PDF

[59] Multimodal chain-of-thought reasoning: A comprehensive survey PDF

[60] Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi PDF

[61] Generative cross-modal retrieval: Memorizing images in multimodal language models for retrieval and beyond PDF

[62] A survey of multimodal deep generative models PDF

[63] Generative multimodal models are in-context learners PDF

[64] Corvid: Improving multimodal large language models towards chain-of-thought reasoning PDF

Table of Contents