SpeechOp: Inference-Time Task Composition for Generative Speech Processing

Justin Lovelace, Rithesh Kumar, Jiaqi Su, Ke Chen, Kilian Q. Weinberger, Zeyu Jin

Abstract

While generative Text-to-Speech (TTS) systems leverage vast "in-the-wild" data to achieve remarkable success, speech-to-speech processing tasks like enhancement face data limitations, which lead data-hungry generative approaches to distort speech content and speaker identity. To bridge this gap, we present SpeechOp, a multi-task latent diffusion model that transforms pre-trained TTS models into a universal speech processor capable of performing a wide range of speech tasks and composing them in novel ways at inference time. By adapting a pre-trained TTS model, SpeechOp inherits a rich understanding of natural speech, accelerating training and improving S2S task quality, while simultaneously enhancing core TTS performance. Finally, we introduce Implicit Task Composition (ITC), a novel pipeline where ASR-derived transcripts (e.g., from Whisper) guide SpeechOp's enhancement via our principled inference-time task composition. ITC achieves state-of-the-art content preservation by robustly combining web-scale speech understanding with SpeechOp's generative capabilities.

The following examples demonstrate SpeechOp's speech enhancement and foreground/background separation capabilities. Compare the enhancement results with and without TTS guidance (Implicit Task Composition), which uses an ASR-derived transcript to improve content preservation via our TC-CFG inference-time task composition strategy.

Noisy Speech (Example 1)

Noisy Input

Enhancement

SpeechOp Enhancement

Enhancement + TTS Guidance

Foreground / Background Separation

Foreground

Background

Noisy Speech (Example 2)

Noisy Input

Enhancement

SpeechOp Enhancement

Enhancement + TTS Guidance

Foreground / Background Separation

Foreground

Background

Reverberant Speech

Clean Reference (DAPS)

Reverberant Input (DAPS)

Enhancement

SpeechOp Enhancement

ITC Guided Enhancement

Personalized Enhancement

Foreground Separation

Foreground Separation + TTS Guidance

Speaker Separation

SpeechOp separates overlapping speakers from a mixture, using a reference speech sample to identify the target speaker. We show results on LibriMix with both clean and noisy backgrounds, with and without TTS guidance for improved content preservation.

Clean Background

Mixture

Speaker 0

Ground Truth

SpeechOp Separation

SpeechOp Separation + TTS Guidance

Speaker 1

Ground Truth

SpeechOp Separation

SpeechOp Separation + TTS Guidance

Noisy Background

Mixture

Speaker 0

Ground Truth

SpeechOp Separation

SpeechOp Separation + TTS Guidance

Speaker 1

Ground Truth

SpeechOp Separation

SpeechOp Separation + TTS Guidance

Zero-Shot Text-to-Speech

SpeechOp generates speech in a target speaker's voice given only a short speech prompt. Multi-task training on speech processing tasks actually improves TTS quality compared to a TTS-only baseline.

Example 1

Original Prompt

SpeechOp Result

Example 2

Original Prompt

SpeechOp Result

Speech Editing

SpeechOp edits a specific segment of an utterance while preserving the surrounding speech, maintaining speaker identity and natural prosody across the edit boundary.

Example 1

Original Audio

SpeechOp Edit

Example 2

Original Audio

SpeechOp Edit