Speak in the Scene: Diffusion-based Acoustic Scene Transfer toward Immersive Speech Generation

Speak in the Scene:
Diffusion-based Acoustic Scene Transfer toward Immersive Speech Generation

Miseul Kim$^{1,2}$, Soo-Whan Chung$^{1}$, Youna Ji$^{1}$, Hong-Goo Kang$^{2}$, Min-Seok Choi$^{1}$

$^1$Naver Cloud, South Korea, $^2$Yonsei University, South Korea

AST-LDM transfers input speech (content prompt) to target (reference prompt) acoustic scene!

[Paper]

Abstract

This paper introduces a novel task in generative speech processing, Acoustic Scene Transfer (AST), which aims to transfer acoustic scenes of speech signals to diverse environments. AST promises an immersive experience in speech perception by adapting the acoustic scene behind speech signals to desired environments. We propose AST-LDM for the AST task, which generates speech signals accompanied by the target acoustic scene of the reference prompt. Specifically, AST-LDM is a latent diffusion model conditioned by CLAP embeddings that describe target acoustic scenes in either audio or text modalities. The contributions of this paper include introducing the AST task and implementing its baseline model. For AST-LDM, we emphasize its core framework, which is to preserve the input speech and generate audio consistently with both the given speech and the target acoustic environment. Experiments, including objective and subjective tests, validate the feasibility and efficacy of our approach.

AST with LLM Text reference prompt

Case 1 (Clean-to-Clean)

Reference prompt

Content prompt

Generated audio

"A male speaks in a quiet room."

"A female speaks in a quiet room."

"A male speaks in a quiet room."

Case 2 (Clean-to-Env)

Reference prompt

Content prompt

Generated audio

"A female speaks in a forest with birds singing melodiously."

"A female speaks in a large library with silence enveloping."

"A male speaks in a small concert hall with instruments tuning."

Case 3 (Env-to-Clean)

Reference prompt

Content prompt

Generated audio

"A female speaks in a quiet room."

"A male speaks in a quiet room."

"A male speaks in a quiet room."

Case 4 (Env-to-Env)

Reference prompt

Content prompt

Generated audio

"A female speaks in a large bedroom with thunder rumbling."

"A female speaks in a small concert hall with instruments tuning"

"A male speaks in a hilltop with birds chirping happily."

AST with Simulated Audio reference prompt

Case 1 (Clean-to-Clean)

Reference prompt

Content prompt

Generated audio

Case 2 (Clean-to-Env)

Reference prompt

Content prompt

Generated audio

Case 3 (Env-to-Clean)

Reference prompt

Content prompt

Generated audio

Case 4 (Env-to-Env)

Reference prompt

Content prompt

Generated audio

AST with AudioCaps Text reference prompt

\

Reference prompt

Content prompt

Generated audio

AudioLDM

"A man speaks into an echoing, a crowd laughs and cheers"

N/A

VoiceLDM

"A man speaks into an echoing, a crowd laughs and cheers"

"Nature of the effect produced by early impressions"

AST-LDM

"A man speaks into an echoing, a crowd laughs and cheers"

Reference prompt

Content prompt

Generated audio

"Crumpling paper noise with female speech"

N/A

"Crumpling paper noise with female speech"

"Is there not a median everywhere"

"Crumpling paper noise with female speech"

AST with AudioCaps Audio reference prompt

\

Reference prompt

Content prompt

Generated audio

AudioLDM

N/A

VoiceLDM

"To be or not to be that is the question whether its nobler"

AST-LDM

Reference prompt

Content prompt

Generated audio

N/A

"Again again"

LLM Text reference prompts

Here, we share text reference prompts generated using LLM.

Please feel free to download text files if you need it!

LLM_prompts.json