Binaural prediction of speech intelligibility based on a blind model using automatic phoneme recognition

Jana Roßbach1,3, Saskia Röttges1,2, Christopher F. Hauth1,2, Thomas Brand1,2, Bernd T. Meyer1,3

1 Communication Acoustics, Carl von Ossietzky University, Oldenburg, Germany; 2 Medical Physics, Carl von Ossietzky University, Oldenburg, Germany; 3 Cluster of Excellence Hearing4all

Background: Models for speech intelligibility (SI) prediction are important tools in the development of signal processing algorithms and could be used to estimate the benefit when using hearing aids. We explore the use of using binaural information for modelling SI in reverberant conditions. We propose a model that is blind w.r.t. the separate speech and noise signals, which is in contrast to prior information required by intrusive models.

Methods: The model borrows an algorithm from automatic speech recognition (ASR) and is referred to as BAPSI (for binaural ASR-based prediction of speech intelligibility) and was first introduced in (Roßbach et al. (2021). “Non-intrusive binaural prediction of speech intelligibility based on phoneme classification,” in Proc. ICASSP). The model receives a stereo signal (speech in noise) and uses a binaural frontend based on an equalization-cancellation mechanism (Hauth et al., 2020). The result is used as input for a deep neural network for phoneme classification. The uncertainty about this classification is used for the prediction. The model is evaluated using data from normal-hearing listeners in three different room conditions (anechoic, office, and cafeteria) and several azimuth angles of the noise. We compare the speech recognition threshold (SRT) of listeners to BAPSI and two intrusive baseline models: the binaural SI model (BSIM06) (Beutelmann and Brand (2006) and HASPI (Kates and Arehart (2014)) combined with better ear listening (HASPI+BE).

Results: The root mean squared errors (RMSEs) of BAPSI (0.6-2.1 dB) are similar to the RMSEs of BSIM06 (0.3-1.8 dB) and lower than the RMSEs of HASPI+BE (3.1-3.7 dB). Additionally, the correlation coefficients are high (0.71-1.00) for all three models.

Conclusion: The consideration of classifiers based on deep learning seems to be promising for predicting speech intelligibility in binaural acoustic scenes.

Rossbach et al
(A) We process signals with a binaural processing stage. The signal is converted to phoneme probabilities using a DNN. The degradation of these is used to predict binaural SI. (B) Results for subjective data, our model (BAPSI) and baseline models in terms of speech recognition threshold (SRT).