SpeechOp: Inference-Time Task Composition for Generative Speech Processing

ICLR 2026 Conference SubmissionAnonymous Authors
speech generationTTSenhancementdiffusionlatent diffusion
Abstract:

While generative Text-to-Speech (TTS) systems leverage vast "in-the-wild" data to achieve remarkable success, speech-to-speech processing tasks like enhancement face data limitations, which lead data-hungry generative approaches to distort speech content and speaker identity. To bridge this gap, we present SpeechOp, a multi-task latent diffusion model that transforms pre-trained TTS models into a universal speech processor capable of performing a wide range of speech tasks and composing them in novel ways at inference time. By adapting a pre-trained TTS model, SpeechOp inherits a rich understanding of natural speech, accelerating training and improving S2S task quality, while simultaneously enhancing core TTS performance. Finally, we introduce Implicit Task Composition (ITC), a novel pipeline where ASR-derived transcripts (e.g., from Whisper) guide SpeechOp's enhancement via our principled inference-time task composition. ITC achieves state-of-the-art content preservation by robustly combining web-scale speech understanding with SpeechOp's generative capabilities.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces SpeechOp, a multi-task latent diffusion model that adapts pre-trained TTS models for universal speech processing with inference-time task composition. It resides in the 'Latent Diffusion Models for Speech-to-Speech Processing' leaf, which currently contains only this paper as a sibling. This positioning indicates a relatively sparse research direction within the broader taxonomy of 36 topics and 50 papers, suggesting the work explores a less crowded niche focused on latent diffusion for speech-to-speech tasks rather than text-to-audio or multi-modal generation.

The taxonomy reveals neighboring directions including foundational audio transformers (e.g., Fugatto) and instruction-following audio-language models, which pursue unified multi-task learning through extensive pretraining rather than inference-time composition. Another adjacent branch covers LLM-based task decomposition frameworks (e.g., Wavjourney, Audio-agent) that orchestrate specialized modules via language models. SpeechOp diverges by embedding compositional control within a single latent diffusion backbone, avoiding both the modularity overhead of LLM orchestration and the fixed task sets of end-to-end multi-task models.

Among 30 candidates examined across three contributions, none were found to clearly refute the proposed methods. For the SpeechOp model, 10 candidates were reviewed with no refutable overlap; similarly, Task-Composition Classifier-Free Guidance and Implicit Task Composition each examined 10 candidates without identifying prior work that directly anticipates these mechanisms. This limited search scope suggests that within the top-30 semantic matches, the specific combination of latent diffusion adaptation, inference-time task composition, and ASR-guided enhancement appears relatively unexplored, though the analysis does not claim exhaustive coverage of the broader literature.

Given the sparse taxonomy leaf and absence of refutable candidates in the limited search, the work appears to occupy a novel intersection of latent diffusion, TTS adaptation, and compositional inference. However, the analysis is constrained by examining only 30 candidates from semantic search, leaving open the possibility of relevant prior work outside this scope. The taxonomy context suggests the approach is distinctive within the surveyed field, though broader validation would strengthen claims of novelty.

Taxonomy

Core-task Taxonomy Papers
16
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: inference-time task composition for generative speech processing. The field encompasses diverse approaches to building flexible, multi-task generative systems for speech and audio. At the highest level, one branch focuses on compositional audio generation via LLM-based task decomposition, where large language models orchestrate sequences of specialized modules to handle complex user requests (e.g., Wavjourney[1], Audio-agent[6]). Another major direction involves unified generative models for multi-task speech and audio, which train single architectures—often diffusion or flow-based—to handle multiple tasks end-to-end without explicit decomposition (e.g., Fugatto[3], Lumina[2]). Cross-modal synchronization and alignment addresses the challenge of keeping audio, visual, or textual streams coherent during generation, while flexible multi-modal generation with flow-based architectures explores continuous-time modeling for richer expressiveness. Additional branches cover compositional cognitive architectures that integrate generative models with symbolic reasoning (e.g., Autotelic AI[9], Vygotskian AI[11]), specialized applications targeting domain-specific optimization (e.g., Gaming Soundtrack[12], Power Optimization[13]), and data processing infrastructure for foundation model training (e.g., Data-Juicer[8]). Within the unified generative models branch, a particularly active line of work explores latent diffusion models for speech-to-speech processing, balancing expressiveness with computational efficiency by operating in learned latent spaces. SpeechOp[0] sits squarely in this cluster, emphasizing inference-time composition of multiple speech transformations within a single diffusion framework. This contrasts with LLM-orchestrated pipelines like Wavjourney[1] or Audio-agent[6], which decompose tasks across separate modules at inference time, and with end-to-end multi-task models like Fugatto[3] that rely on extensive multi-task pretraining rather than flexible runtime composition. A key trade-off emerges between the modularity and interpretability of LLM-based decomposition versus the compactness and potential for emergent capabilities in unified architectures. SpeechOp[0] occupies a middle ground by enabling compositional control within a unified latent diffusion backbone, offering a pathway to flexible task combination without the overhead of multiple specialist models or the rigidity of fixed multi-task training.

Claimed Contributions

SpeechOp multi-task latent diffusion model

The authors introduce SpeechOp, a multi-task framework that adapts pre-trained text-to-speech models to handle diverse speech-to-speech processing tasks such as enhancement, separation, and acoustic matching. This adaptation not only enables versatile speech processing but also improves the underlying TTS quality through multi-task learning.

10 retrieved papers
Task-Composition Classifier-Free Guidance (TC-CFG)

The authors develop TC-CFG, a principled method for composing speech tasks at inference time without requiring joint training. This approach decomposes score functions to isolate discriminative guidance from generative priors, enabling effective combination of operations like enhancement and TTS while maintaining acoustic quality and providing tunable control over content restoration versus acoustic fidelity.

10 retrieved papers
Implicit Task Composition (ITC) pipeline

The authors propose ITC, which integrates ASR-derived transcripts from models like Whisper with their TC-CFG method to guide speech enhancement. This pipeline achieves state-of-the-art content preservation by robustly combining web-scale speech understanding from discriminative models with SpeechOp's generative capabilities, without requiring paired noisy-clean-transcript training data.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SpeechOp multi-task latent diffusion model

The authors introduce SpeechOp, a multi-task framework that adapts pre-trained text-to-speech models to handle diverse speech-to-speech processing tasks such as enhancement, separation, and acoustic matching. This adaptation not only enables versatile speech processing but also improves the underlying TTS quality through multi-task learning.

Contribution

Task-Composition Classifier-Free Guidance (TC-CFG)

The authors develop TC-CFG, a principled method for composing speech tasks at inference time without requiring joint training. This approach decomposes score functions to isolate discriminative guidance from generative priors, enabling effective combination of operations like enhancement and TTS while maintaining acoustic quality and providing tunable control over content restoration versus acoustic fidelity.

Contribution

Implicit Task Composition (ITC) pipeline

The authors propose ITC, which integrates ASR-derived transcripts from models like Whisper with their TC-CFG method to guide speech enhancement. This pipeline achieves state-of-the-art content preservation by robustly combining web-scale speech understanding from discriminative models with SpeechOp's generative capabilities, without requiring paired noisy-clean-transcript training data.