This paper introduces a novel task in generative speech processing, Acoustic Scene Transfer (AST), which aims to transfer acoustic scenes of speech signals to diverse environments. AST promises an immersive experience in speech perception by adapting the acoustic scene behind speech signals to desired environments. We propose AST-LDM for the AST task, which generates speech signals accompanied by the target acoustic scene of the reference prompt. Specifically, AST-LDM is a latent diffusion model conditioned by CLAP embeddings that describe target acoustic scenes in either audio or text modalities. The contributions of this paper include introducing the AST task and implementing its baseline model. For AST-LDM, we emphasize its core framework, which is to preserve the input speech and generate audio consistently with both the given speech and the target acoustic environment. Experiments, including objective and subjective tests, validate the feasibility and efficacy of our approach.
Reference prompt
Content prompt
Generated audio
"A male speaks in a quiet room."
"A female speaks in a quiet room."
"A male speaks in a quiet room."
Reference prompt
Content prompt
Generated audio
"A female speaks in a forest with birds singing melodiously."
"A female speaks in a large library with silence enveloping."
"A male speaks in a small concert hall with instruments tuning."
Reference prompt
Content prompt
Generated audio
"A female speaks in a quiet room."
"A male speaks in a quiet room."
"A male speaks in a quiet room."
Reference prompt
Content prompt
Generated audio
"A female speaks in a large bedroom with thunder rumbling."
"A female speaks in a small concert hall with instruments tuning"
"A male speaks in a hilltop with birds chirping happily."
Reference prompt
Content prompt
Generated audio
Reference prompt
Content prompt
Generated audio
Reference prompt
Content prompt
Generated audio
Reference prompt
Content prompt
Generated audio
Reference prompt
Content prompt
Generated audio
"A man speaks into an echoing, a crowd laughs and cheers"
N/A
"A man speaks into an echoing, a crowd laughs and cheers"
"Nature of the effect produced by early impressions"
"A man speaks into an echoing, a crowd laughs and cheers"
Reference prompt
Content prompt
Generated audio
"Crumpling paper noise with female speech"
N/A
"Crumpling paper noise with female speech"
"Is there not a median everywhere"
"Crumpling paper noise with female speech"
Reference prompt
Content prompt
Generated audio
N/A
"To be or not to be that is the question whether its nobler"
Reference prompt
Content prompt
Generated audio
N/A
"Again again"
Here, we share text reference prompts generated using LLM.
Please feel free to download text files if you need it!