Speech Stimuli Creation using Deep-Learning Based Voice Conversion

Authors: Anders Bargum1, Cumhur Erkut1, Stefania Serafin1

1Aalborg University

Background: In the domain of computational audiology, progress in deep-learning-driven voice conversion offers exciting prospects for improving auditory interventions. While current techniques adeptly maintain linguistic content during speaker transformations, mainly focusing on aspects like pitch and formant adjustments through WORLD and phase vocoding, there’s a potential interest in exploring parametric speaker manipulation using deep learning. The modifications encompass unique attributes such as distinct timbre, pitch variations, and emotional nuances, opening up new avenues for innovation and customization in auditory stimuli creation.

Method: Drawing inspiration from generative timbre transfer models used in music generation, we propose a novel approach to voice conversion that operates directly in the time domain at high sampling rates. By integrating speech representation learning and conditioning on external speaker information, our method aims to guide an auto-encoder network towards linguistically relevant representations, free of any speaker information.

Results: Through objective metrics and subjective assessments, including mean opinion scores, we demonstrate the effectiveness of our method in separating and inserting voice-specific attributes using disentanglement techniques. Our results show naturalness, quality, and intelligibility comparable to state-of-the-art techniques. We illustrate that our techniques successfully disentangle speaker timbre from linguistic content and hypothesize that similar approaches could be applied to rhythm and emotion modification.

Conclusion: In conclusion, we provide a pipeline for effectively disentangling speaker-specific attributes from linguistic content, enabling the creation of parametric voice stimuli at high sampling rates. The innovation holds promise for substituting current techniques, offering enhanced flexibility and precision in auditory interventions tailored to individual needs and preferences.