Declarative Audio Editing with Audio Language Model

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Audio editingLatent diffusion modelAudio language model

Audio editing plays a central role in VR/AR immersion, virtual conferencing, sound design, and other interactive media. However, recent generative audio editing models depend on template-like instruction formats and are restricted to mono-channel audio. These models fail to deal with declarative audio editing, where the user declares what the desired outcome should be, while leaving the details of editing operations to the system. We introduce SmartDJ, a novel framework for stereo audio editing that combines the reasoning capability of audio language models with the generative power of latent diffusion. Given a high-level instruction, SmartDJ decomposes it into a sequence of atomic edit operations, such as adding, removing, or spatially relocating events. These operations are then executed by a diffusion model trained to manipulate stereo audio. To support this, we design a data synthesis pipeline that produces paired examples of high-level instructions, atomic edit operations, and audios before and after each edit operation. Experiments demonstrate that SmartDJ achieves superior perceptual quality, spatial realism, and semantic alignment compared to prior audio editing methods. Demos are provided in the supplementary file. Code and data will be released upon acceptance.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

SmartDJ contributes a framework that decomposes high-level user instructions into atomic stereo audio edit operations via language model reasoning, then executes these operations using a diffusion model. The paper resides in the 'Declarative Audio Editing with Reasoning' leaf, which contains only four papers total within a 20-paper taxonomy. This indicates a relatively sparse and emerging research direction, suggesting the work addresses a problem space that has not yet attracted extensive prior exploration. The small sibling set implies limited direct competition in this specific paradigm of reasoning-driven stereo editing.

The taxonomy reveals that SmartDJ's nearest neighbors include 'Text-Conditioned Audio Generation' (covering music and general audio synthesis) and 'Spatial Audio Generation and Understanding' (addressing immersive 3D soundscapes and stereo positioning). While text-to-audio models generate content from prompts without explicit reasoning decomposition, and spatial audio systems focus on multi-channel formats or ambisonics, SmartDJ bridges these areas by combining declarative language control with stereo-specific manipulation. The taxonomy's scope notes clarify that template-based or direct synthesis approaches belong elsewhere, positioning SmartDJ's reasoning-based editing as a distinct methodological contribution within the broader language-guided audio landscape.

Across three core contributions, the analysis examined 30 candidate papers total, with 10 candidates per contribution. None of the contributions were clearly refuted by prior work among these 30 candidates. The SmartDJ framework itself, the data synthesis pipeline, and the audio language model as edit planner each showed no overlapping prior work within the examined set. This suggests that among the top-30 semantic matches and citation-expanded candidates, no single paper appears to provide the same combination of declarative reasoning, stereo editing, and atomic operation decomposition. However, the limited search scope means more exhaustive exploration could reveal closer precedents.

Given the sparse taxonomy leaf and absence of refuting candidates among 30 examined papers, SmartDJ appears to occupy a relatively novel position within the analyzed literature. The work's integration of language model reasoning with stereo diffusion editing distinguishes it from both template-driven audio tools and purely generative text-to-audio systems. Nonetheless, the analysis is constrained by top-K semantic search and does not cover the full breadth of audio processing or multimodal research, leaving open the possibility of related work in adjacent domains not captured here.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Declarative stereo audio editing with language model reasoning. The field encompasses several interconnected branches that together address how audio content can be generated, manipulated, and understood through computational methods. Language-Guided Audio Generation and Editing focuses on systems that accept natural language instructions to create or modify soundscapes, often leveraging diffusion models or autoregressive architectures. Spatial Audio Generation and Understanding tackles the challenge of producing and interpreting multi-channel or binaural audio with realistic spatial cues, essential for immersive experiences. Audio Representation Learning and Codecs develops efficient encodings and latent representations that enable high-fidelity reconstruction and manipulation. Multimodal Audio-Visual Systems integrate audio with visual or other sensory modalities to support richer content generation and cross-modal reasoning. Finally, Audio Processing Frameworks and Architectures provides the foundational tools and design patterns that underpin practical implementations across these domains. Within Language-Guided Audio Generation and Editing, a particularly active line of work explores declarative editing paradigms where users specify high-level intentions rather than low-level parameters. Declarative Audio Editing[0] exemplifies this trend by combining language model reasoning with stereo audio manipulation, enabling users to describe desired transformations in natural language. This approach contrasts with earlier methods that required manual parameter tuning or domain-specific scripting. Nearby works such as Guiding Audio Editing[1] and ThinkSound[4] similarly emphasize reasoning-driven workflows, though they may differ in how they balance generative flexibility versus precise control over spatial attributes. Meanwhile, systems like Immersediffusion[3] and SALM[6] push toward richer spatial representations, highlighting an ongoing tension between declarative simplicity and the complexity of multi-channel audio scenes. Declarative Audio Editing[0] sits at the intersection of these themes, prioritizing interpretable language-based commands while addressing stereo-specific challenges that distinguish it from mono-focused or purely generative counterparts.

Claimed Contributions

SmartDJ framework for declarative stereo audio editing

10 retrieved papers

The authors propose SmartDJ, which uses an Audio Language Model (ALM) to decompose declarative user instructions into atomic edit operations and a Latent Diffusion Model (LDM) to execute these operations sequentially on stereo audio. This enables users to specify desired outcomes declaratively rather than procedurally.

10 retrieved papers

Scalable data synthesis pipeline for declarative audio editing

10 retrieved papers

The authors develop a data generation pipeline that uses GPT-4o as a designer to create declarative instructions and atomic operations, and audio signal processing as a composer to render paired examples of instructions, operations, and before-and-after audio for training and evaluation.

10 retrieved papers

Audio Language Model as edit planner

10 retrieved papers

The authors introduce the use of an Audio Language Model to serve as a planner that interprets both the original audio and declarative instructions, generating a sequence of atomic editing steps such as adding, removing, or spatially relocating sound events.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Guiding audio editing with audio language model PDF

lan zitong, Zhao Ming-min (2025)

[4] ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing PDF

Liu, Huadai, Huadai Liu, Wang Jia-lei, Jialei Wang, WANG Wen, Kaicheng Luo, Chen Qian, Wen Wang, Zhao Zhou, Qian Chen, Xue Wei, Zhou Zhao, Wei Xue (2025)

[17] ThinkSound: Chain-of-Thought Reasoning in Multimodal LLMs for Audio Generation and Editing PDF

H Liu, K Luo, J Wang, W Wang, Q Chen, Z Zhao (0)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SmartDJ framework for declarative stereo audio editing

[1] Guiding audio editing with audio language model PDF

Cannot Refute

[2] Fast timing-conditioned latent audio diffusion PDF

Cannot Refute

[12] Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation PDF

Cannot Refute

[15] High Fidelity Text-Guided Music Editing via Single-Stage Flow Matching PDF

Cannot Refute

[21] Fast Text-to-Audio Generation with Adversarial Post-Training PDF

Cannot Refute

[22] MoÃ»sai: Efficient text-to-music diffusion models PDF

Cannot Refute

[23] Improving musical accompaniment co-creation via diffusion transformers PDF

Cannot Refute

[24] Diff-A-Riff: Musical Accompaniment Co-creation via Latent Diffusion Models PDF

Cannot Refute

[25] StereoSync: Spatially-Aware Stereo Audio Generation from Video PDF

Cannot Refute

[26] Controllable music production with diffusion models and guidance gradients PDF

Cannot Refute

Contribution

Scalable data synthesis pipeline for declarative audio editing

[1] Guiding audio editing with audio language model PDF

Cannot Refute

[29] Instructspeech: Following speech editing instructions via large language models PDF

Cannot Refute

[33] SAO-Instruct: Free-form Audio Editing using Natural Language Instructions PDF

Cannot Refute

[34] Recomposer: Event-roll-guided generative audio editing PDF

Cannot Refute

[35] Open-Amp: Synthetic Data Framework for Audio Effect Foundation Models PDF

Cannot Refute

[36] The age of synthetic realities: Challenges and opportunities PDF

Cannot Refute

[37] AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models PDF

Cannot Refute

[38] Deep learning and synthetic media PDF

Cannot Refute

[39] Audio-flan: A preliminary release PDF

Cannot Refute

[40] Arrange, inpaint, and refine: Steerable long-term music audio generation and editing via content-based controls PDF

Cannot Refute

Contribution

Audio Language Model as edit planner

[1] Guiding audio editing with audio language model PDF

Cannot Refute

[4] ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing PDF

Cannot Refute

[6] SALM: Spatial Audio Language Model with Structured Embeddings for Understanding and Editing PDF

Cannot Refute

[17] ThinkSound: Chain-of-Thought Reasoning in Multimodal LLMs for Audio Generation and Editing PDF

Cannot Refute

[27] Audio-agent: Leveraging llms for audio generation, editing and composition PDF

Cannot Refute

[28] Wavjourney: Compositional audio creation with large language models PDF

Cannot Refute

[29] Instructspeech: Following speech editing instructions via large language models PDF

Cannot Refute

[30] Wavcraft: Audio editing and generation with large language models PDF

Cannot Refute

[31] Wavcraft: Audio editing and generation with natural language prompts PDF

Cannot Refute

[32] MMEDIT: A Unified Framework for Multi-Type Audio Editing via Audio Language Model PDF

Cannot Refute

Declarative Audio Editing with Audio Language Model

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Guiding audio editing with audio language model PDF

[4] ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing PDF

[17] ThinkSound: Chain-of-Thought Reasoning in Multimodal LLMs for Audio Generation and Editing PDF

Contribution Analysis

SmartDJ framework for declarative stereo audio editing

[1] Guiding audio editing with audio language model PDF

[2] Fast timing-conditioned latent audio diffusion PDF

[12] Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation PDF

[15] High Fidelity Text-Guided Music Editing via Single-Stage Flow Matching PDF

[21] Fast Text-to-Audio Generation with Adversarial Post-Training PDF

[22] MoÃ»sai: Efficient text-to-music diffusion models PDF

[23] Improving musical accompaniment co-creation via diffusion transformers PDF

[24] Diff-A-Riff: Musical Accompaniment Co-creation via Latent Diffusion Models PDF

[25] StereoSync: Spatially-Aware Stereo Audio Generation from Video PDF

[26] Controllable music production with diffusion models and guidance gradients PDF

Scalable data synthesis pipeline for declarative audio editing

[1] Guiding audio editing with audio language model PDF

[29] Instructspeech: Following speech editing instructions via large language models PDF

[33] SAO-Instruct: Free-form Audio Editing using Natural Language Instructions PDF

[34] Recomposer: Event-roll-guided generative audio editing PDF

[35] Open-Amp: Synthetic Data Framework for Audio Effect Foundation Models PDF

[36] The age of synthetic realities: Challenges and opportunities PDF

[37] AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models PDF

[38] Deep learning and synthetic media PDF

[39] Audio-flan: A preliminary release PDF

[40] Arrange, inpaint, and refine: Steerable long-term music audio generation and editing via content-based controls PDF

Audio Language Model as edit planner

[1] Guiding audio editing with audio language model PDF

[4] ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing PDF

[6] SALM: Spatial Audio Language Model with Structured Embeddings for Understanding and Editing PDF

[17] ThinkSound: Chain-of-Thought Reasoning in Multimodal LLMs for Audio Generation and Editing PDF

[27] Audio-agent: Leveraging llms for audio generation, editing and composition PDF

[28] Wavjourney: Compositional audio creation with large language models PDF

[29] Instructspeech: Following speech editing instructions via large language models PDF

[30] Wavcraft: Audio editing and generation with large language models PDF

[31] Wavcraft: Audio editing and generation with natural language prompts PDF

[32] MMEDIT: A Unified Framework for Multi-Type Audio Editing via Audio Language Model PDF

Table of Contents