Declarative Audio Editing with Audio Language Model

ICLR 2026 Conference SubmissionAnonymous Authors
Audio editingLatent diffusion modelAudio language model
Abstract:

Audio editing plays a central role in VR/AR immersion, virtual conferencing, sound design, and other interactive media. However, recent generative audio editing models depend on template-like instruction formats and are restricted to mono-channel audio. These models fail to deal with declarative audio editing, where the user declares what the desired outcome should be, while leaving the details of editing operations to the system. We introduce SmartDJ, a novel framework for stereo audio editing that combines the reasoning capability of audio language models with the generative power of latent diffusion. Given a high-level instruction, SmartDJ decomposes it into a sequence of atomic edit operations, such as adding, removing, or spatially relocating events. These operations are then executed by a diffusion model trained to manipulate stereo audio. To support this, we design a data synthesis pipeline that produces paired examples of high-level instructions, atomic edit operations, and audios before and after each edit operation. Experiments demonstrate that SmartDJ achieves superior perceptual quality, spatial realism, and semantic alignment compared to prior audio editing methods. Demos are provided in the supplementary file. Code and data will be released upon acceptance.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

SmartDJ contributes a framework that decomposes high-level user instructions into atomic stereo audio edit operations via language model reasoning, then executes these operations using a diffusion model. The paper resides in the 'Declarative Audio Editing with Reasoning' leaf, which contains only four papers total within a 20-paper taxonomy. This indicates a relatively sparse and emerging research direction, suggesting the work addresses a problem space that has not yet attracted extensive prior exploration. The small sibling set implies limited direct competition in this specific paradigm of reasoning-driven stereo editing.

The taxonomy reveals that SmartDJ's nearest neighbors include 'Text-Conditioned Audio Generation' (covering music and general audio synthesis) and 'Spatial Audio Generation and Understanding' (addressing immersive 3D soundscapes and stereo positioning). While text-to-audio models generate content from prompts without explicit reasoning decomposition, and spatial audio systems focus on multi-channel formats or ambisonics, SmartDJ bridges these areas by combining declarative language control with stereo-specific manipulation. The taxonomy's scope notes clarify that template-based or direct synthesis approaches belong elsewhere, positioning SmartDJ's reasoning-based editing as a distinct methodological contribution within the broader language-guided audio landscape.

Across three core contributions, the analysis examined 30 candidate papers total, with 10 candidates per contribution. None of the contributions were clearly refuted by prior work among these 30 candidates. The SmartDJ framework itself, the data synthesis pipeline, and the audio language model as edit planner each showed no overlapping prior work within the examined set. This suggests that among the top-30 semantic matches and citation-expanded candidates, no single paper appears to provide the same combination of declarative reasoning, stereo editing, and atomic operation decomposition. However, the limited search scope means more exhaustive exploration could reveal closer precedents.

Given the sparse taxonomy leaf and absence of refuting candidates among 30 examined papers, SmartDJ appears to occupy a relatively novel position within the analyzed literature. The work's integration of language model reasoning with stereo diffusion editing distinguishes it from both template-driven audio tools and purely generative text-to-audio systems. Nonetheless, the analysis is constrained by top-K semantic search and does not cover the full breadth of audio processing or multimodal research, leaving open the possibility of related work in adjacent domains not captured here.

Taxonomy

Core-task Taxonomy Papers
20
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Declarative stereo audio editing with language model reasoning. The field encompasses several interconnected branches that together address how audio content can be generated, manipulated, and understood through computational methods. Language-Guided Audio Generation and Editing focuses on systems that accept natural language instructions to create or modify soundscapes, often leveraging diffusion models or autoregressive architectures. Spatial Audio Generation and Understanding tackles the challenge of producing and interpreting multi-channel or binaural audio with realistic spatial cues, essential for immersive experiences. Audio Representation Learning and Codecs develops efficient encodings and latent representations that enable high-fidelity reconstruction and manipulation. Multimodal Audio-Visual Systems integrate audio with visual or other sensory modalities to support richer content generation and cross-modal reasoning. Finally, Audio Processing Frameworks and Architectures provides the foundational tools and design patterns that underpin practical implementations across these domains. Within Language-Guided Audio Generation and Editing, a particularly active line of work explores declarative editing paradigms where users specify high-level intentions rather than low-level parameters. Declarative Audio Editing[0] exemplifies this trend by combining language model reasoning with stereo audio manipulation, enabling users to describe desired transformations in natural language. This approach contrasts with earlier methods that required manual parameter tuning or domain-specific scripting. Nearby works such as Guiding Audio Editing[1] and ThinkSound[4] similarly emphasize reasoning-driven workflows, though they may differ in how they balance generative flexibility versus precise control over spatial attributes. Meanwhile, systems like Immersediffusion[3] and SALM[6] push toward richer spatial representations, highlighting an ongoing tension between declarative simplicity and the complexity of multi-channel audio scenes. Declarative Audio Editing[0] sits at the intersection of these themes, prioritizing interpretable language-based commands while addressing stereo-specific challenges that distinguish it from mono-focused or purely generative counterparts.

Claimed Contributions

SmartDJ framework for declarative stereo audio editing

The authors propose SmartDJ, which uses an Audio Language Model (ALM) to decompose declarative user instructions into atomic edit operations and a Latent Diffusion Model (LDM) to execute these operations sequentially on stereo audio. This enables users to specify desired outcomes declaratively rather than procedurally.

10 retrieved papers
Scalable data synthesis pipeline for declarative audio editing

The authors develop a data generation pipeline that uses GPT-4o as a designer to create declarative instructions and atomic operations, and audio signal processing as a composer to render paired examples of instructions, operations, and before-and-after audio for training and evaluation.

10 retrieved papers
Audio Language Model as edit planner

The authors introduce the use of an Audio Language Model to serve as a planner that interprets both the original audio and declarative instructions, generating a sequence of atomic editing steps such as adding, removing, or spatially relocating sound events.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SmartDJ framework for declarative stereo audio editing

The authors propose SmartDJ, which uses an Audio Language Model (ALM) to decompose declarative user instructions into atomic edit operations and a Latent Diffusion Model (LDM) to execute these operations sequentially on stereo audio. This enables users to specify desired outcomes declaratively rather than procedurally.

Contribution

Scalable data synthesis pipeline for declarative audio editing

The authors develop a data generation pipeline that uses GPT-4o as a designer to create declarative instructions and atomic operations, and audio signal processing as a composer to render paired examples of instructions, operations, and before-and-after audio for training and evaluation.

Contribution

Audio Language Model as edit planner

The authors introduce the use of an Audio Language Model to serve as a planner that interprets both the original audio and declarative instructions, generating a sequence of atomic editing steps such as adding, removing, or spatially relocating sound events.