Leveraging Cochlear Implant-Based Spatial Cues for Location-Guided Target Speaker Extraction in Dynamic Environments

Authors: Feyisayo Olalere1, Kiki van der Heijden2, Christiaan Stronks3, Jeroen Briaire3, Yagmur Guecluetuerk1, Johan Frijns3, Marcel van Gerven1

1Radboud University
2Donders Institute
3Leiden University Medical Center

Background: Selective attention, effortless for those with typical hearing, poses challenges for cochlear implant (CI) users. In everyday auditory environments with multiple noises and reverberations, CI users struggle to follow conversations, compromising speech comprehension. Advanced CI technologies use beamforming to reduce background noise, while another approach involves CI users tagging their target speaker with a microphone. Recent deep learning advancements have improved speech enhancement, with speaker extraction models surpassing traditional CI methods. This study investigates a compact time-domain deep neural network (DNN) using the spatial information of moving targets to extract speech from noisy backgrounds while retaining spatial cues.

Method: A time-domain DNN was trained on a realistic speech corpus containing speech from two speakers plus a random noise. The clean speech was convolved with head-related impulse responses (CI-HRIR) from a CI, and simulated room impulse responses (RIR). The model learned to extract speech using implicit bilateral spatial cues, aided by the initial location of the target speaker. Subsequently, optimization efforts incorporated explicit spatial cues like inter-microphone phase difference (IPD) and interaural level difference (ILD) to enhance model performance. The model’s output consisted of the extracted bilateral target speech with preserved spatial information.

Results: Preliminary findings indicate that the model can learn representations of the target talker and reconstruct speech while retaining spatial cues as the target moves in near real-time. The model demonstrated the ability to extract target speech using implicit cues, similar to the separation abilities of individuals with typical hearing.

Conclusion: Initial results suggest the feasibility of training DNNs for target talker extraction in real-world listening scenarios, enabling CI users to localize speakers naturally and enhance speech perception.