UALM: Unified Audio Language Model for Understanding, Generation and Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors
Audio Language ModelAudio UnderstandingAudio Generation
Abstract:

Recent advances in the audio language modeling (ALM) domain tackle audio understanding and text-to-audio generation as separate tasks. Very few studies attempt to unify these tasks -- an essential step toward advanced multimodal reasoning. This paper introduces Unified Audio Language Model (UALM), which aims to unify audio understanding, text-to-audio generation, and multimodal reasoning in a single model. To achieve this goal, we first present UALM-Gen, a text-to-audio language model that directly predicts audio tokens and is comparable to state-of-the-art diffusion-based models. We then demonstrate, using proper data blending, training recipes, and inference techniques, that our single UALM model matches the quality of state-of-the-art specialized models in audio understanding, text-to-audio generation, and text reasoning. Furthermore, we present UALM-R1, a multimodal reasoning model that utilizes both text and audio in the intermediate thinking steps to facilitate complex generation tasks. To our knowledge, this is the first demonstration in audio research of cross-modal generative reasoning, with its effectiveness confirmed by subjective evaluations.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces UALM, a unified audio-language model integrating understanding, text-to-audio generation, and multimodal reasoning within a single autoregressive framework. It resides in the 'Autoregressive Unified Models' leaf, which contains only three papers including the original work, Step Audio, and one other sibling. This represents a relatively sparse research direction within the broader taxonomy of fifty papers across thirty-six topics, suggesting that autoregressive unification for audio tasks remains an emerging area compared to more populated branches like general audio understanding or diffusion-based generation.

The taxonomy reveals that UALM sits at the intersection of multiple research streams. Its closest neighbors include diffusion-based audio-language models and multi-agent architectures within the unified models branch, while adjacent branches cover audio reasoning with chain-of-thought mechanisms and large-scale multimodal foundation models. The autoregressive approach contrasts with diffusion paradigms employed by models in the sibling leaf, and the emphasis on single-model unification diverges from modular multi-agent systems. The taxonomy's scope and exclude notes clarify that UALM's autoregressive token prediction distinguishes it from non-autoregressive alternatives, positioning it within a specific architectural philosophy.

Among thirty candidates examined, the analysis identifies varying degrees of prior work overlap across contributions. UALM-Gen examined ten candidates with two appearing to provide overlapping prior work on LLM-based text-to-audio generation. The unified UALM model similarly examined ten candidates with one refutable match, suggesting some precedent for unified audio understanding and generation architectures. UALM-R1's cross-modal generative reasoning examined ten candidates with zero refutable matches, indicating this contribution may represent a more novel direction within the limited search scope. The statistics reflect a focused semantic search rather than exhaustive coverage, leaving open the possibility of additional relevant work beyond the top-thirty matches.

Based on the limited literature search, UALM-R1's cross-modal reasoning appears most distinctive, while UALM-Gen and the unified model show some overlap with existing autoregressive and unified approaches. The sparse population of the autoregressive unified models leaf suggests the overall direction is less crowded, though the presence of sibling papers indicates concurrent exploration. The analysis covers top-thirty semantic matches and does not claim comprehensive field coverage, particularly for work outside autoregressive paradigms or published after the search cutoff.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: unified audio understanding generation and multimodal reasoning. The field encompasses a broad spectrum of approaches organized into seven main branches. Unified Audio-Language Models focus on end-to-end architectures that jointly process and generate audio and language, often employing autoregressive frameworks like UALM[0] and Step Audio[1]. Audio Understanding and Reasoning emphasizes interpretive capabilities, including chain-of-thought methods such as Audio Chain-of-Thought[12] and reasoning-focused systems like Audio Reasoner[17]. Multimodal Audio-Visual Generation targets synthesis tasks that combine sound with visual input, exemplified by MMAudio[29] and AudioGen Omni[19]. Large-Scale Multimodal Foundation Models represent comprehensive systems like Gemini[2] and Unified IO[4] that handle diverse modalities at scale. Multimodal Perception and Integration explores how different sensory streams are fused, while Human Multimodal Perception and Cognition investigates cognitive phenomena such as the McGurk Effect[41] and Bayesian Causal Inference[37]. Applied Multimodal Systems and Evaluation addresses practical deployment and benchmarking challenges. Recent work reveals contrasting emphases between autoregressive unified models and specialized reasoning pipelines. Autoregressive approaches like UALM[0] and Step Audio[1] prioritize seamless generation and understanding within a single framework, trading architectural simplicity for potential limitations in explicit reasoning transparency. Meanwhile, systems such as Audio Chain-of-Thought[12] and ThinkSound[20] incorporate structured reasoning steps to enhance interpretability and complex problem-solving. UALM[0] sits within the autoregressive unified branch alongside Step Audio[1] and shares architectural philosophy with Unified IO[4], yet distinguishes itself by focusing specifically on audio-language integration rather than the broader modality coverage of Unified IO[4]. Compared to Audio Comprehension Enhancement[5], which targets incremental improvements in understanding, UALM[0] pursues a more holistic generation-understanding duality. Open questions persist around balancing model unification with task-specific performance and determining optimal granularity for reasoning mechanisms across diverse audio contexts.

Claimed Contributions

UALM-Gen: LLM-based text-to-audio generation model

The authors introduce UALM-Gen, a decoder-only language model for text-to-audio generation that directly predicts audio tokens. Through data scaling, classifier-free guidance, and direct preference optimization, UALM-Gen achieves quality comparable to state-of-the-art diffusion-based models.

10 retrieved papers
Can Refute
UALM: unified model for audio understanding, generation, and text reasoning

The authors present UALM, a single language model that simultaneously handles audio understanding, text-to-audio generation, and text problem solving. Using careful data blending and a modality alignment stage, UALM matches specialized state-of-the-art models in each domain.

10 retrieved papers
Can Refute
UALM-R1: multimodal reasoning model with cross-modal generative reasoning

The authors introduce UALM-R1, which enables multimodal reasoning that uses both text and audio in intermediate thinking steps. This includes enrichment, dialogue, and self-reflection capabilities for complex generation tasks, representing the first demonstration of cross-modal generative reasoning in audio research.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

UALM-Gen: LLM-based text-to-audio generation model

The authors introduce UALM-Gen, a decoder-only language model for text-to-audio generation that directly predicts audio tokens. Through data scaling, classifier-free guidance, and direct preference optimization, UALM-Gen achieves quality comparable to state-of-the-art diffusion-based models.

Contribution

UALM: unified model for audio understanding, generation, and text reasoning

The authors present UALM, a single language model that simultaneously handles audio understanding, text-to-audio generation, and text problem solving. Using careful data blending and a modality alignment stage, UALM matches specialized state-of-the-art models in each domain.

Contribution

UALM-R1: multimodal reasoning model with cross-modal generative reasoning

The authors introduce UALM-R1, which enables multimodal reasoning that uses both text and audio in intermediate thinking steps. This includes enrichment, dialogue, and self-reflection capabilities for complex generation tasks, representing the first demonstration of cross-modal generative reasoning in audio research.