Having choices for enhancing voices: Target speaker extraction in noisy multi-talker environments using deep neural networks

Authors: Iordanis Thoidis1*; Tobias Goehring2

1Aristotle University of Thessaloniki

2University of Cambridge

Background Following and understanding speech in noisy environments, especially in situations with two or more competing speakers is a challenging task, in particular for hearing-impaired listeners but also for normal-hearing listeners. Despite their popularity and ongoing improvement, most assistive listening devices and speech enhancement (SE) approaches still do not perform well enough in noisy multi-talker environments, as they fail to restore the ability to focus on one source of interest among competing sources.

Method A quasi-causal SE model was developed based on the dual-path recurrent neural network architecture and trained to extract the voice of a target speaker, indicated by a short enrollment utterance, from a noisy mixture with an arbitrary number of speakers (1, 2, or 3). To quantify the effect on speech perception, a double-blind sentence recognition test was conducted with a group of 14 normal-hearing Greek listeners. Stimuli comprised mixtures of one, two, and three speakers in restaurant noise, in the following conditions: (a) unprocessed, (b) processed by a speaker-uninformed SE model, and (c) processed by the proposed speaker-informed SE model.

Results Objective evaluation based on computational metrics demonstrated that the proposed speaker-informed SE model effectively extracts the target speaker from a mixture. Preliminary results of the subjective evaluation showed that both SE models increased word recognition scores by about 20% at -3 dB SNR, but not at 0 dB SNR due to ceiling effects. In the multi-speaker conditions, preliminary results indicate larger improvements with the speaker-informed model than for the other two conditions.

Conclusion We demonstrate that target speaker-informed SE approaches can enhance speech perception in noisy multi-talker environments over uninformed SE approaches in various conditions. The proposed method is feasible for real-time processing and requires only a short utterance of the target speaker.