A WaveNet-based cochlear filtering and hair cell transduction model for applications in speech and music processing

Authors: Anil Nagathil1*; Ian C. Bruce2

1Ruhr-Universität Bochum

2McMaster University

Background Computational models of the auditory periphery help to understand hearing mechanisms and can lay the foundation for bio-inspired speech and audio enhancement algorithms in hearing devices. Such models simulate cochlear processing and neural transduction in the hair cell and auditory nerve. While they can provide accurate descriptions of auditory processing, they often entail a high computational complexity, preventing their application in real-time signal processing algorithms or machine-learning tasks. To circumvent these restrictions auditory models can be approximated by deep neural networks (DNNs), which learn the non-linear and time-varying relationship between an input signal and its neural response. Advantages of such approximations are accelerated execution and full differentiability of the DNN models, making them applicable in the context of DNN-based speech and audio enhancement.

Method In this work we present a WaveNet-based approximation of the normal-hearing cochlear filtering and hair-cell transduction stages of the widely used auditory model. The WaveNet model was trained using a large (noisy) speech and music data set at a wide range of sound pressure levels. It was evaluated with previously unseen speech and music signals and, additionally, with pure tones and click sounds.

Results The WaveNet model exhibits accurate approximations for all test signals, reproduces cochlear excitation patterns and the DC/AC components of IHC receptor potentials. Even if the original auditory model implementation is executed on four CPUs in parallel, the WaveNet model performs 5 times faster on a single CPU and up to 250 times faster on a GPU.

Conclusion The proposed WaveNet-based auditory model is accurate and computationally efficient. Future work will extend the model towards hearing-impaired auditory processing, faciliating time-efficient DNN-based hearing loss compensation for speech and music signals.