Predicting fusion of dichotic vowels in normal hearing listeners with a physiologically-based model

Langchen Fan1, Michelle R. Molis2, Lina A. J. Reiss1,2

1 Department of Otolaryngology, Oregon Health & Science University, Oregon, US; 2 National Center for Rehabilitative Auditory Research, VA Portland Health Care System, Oregon, US

Background: Fundamental frequency (F0) is an important cue for speech segregation. Listeners with normal hearing are more likely to correctly identify both vowels of a dichotically-presented pair if the F0s of the two vowels are different; however, they overwhelmingly identify two vowels with the same F0s as a single vowel (i.e., fuse the two vowels) (Reiss & Molis, 2021). A physiologically based model was applied to explore the underlying neural mechanism. 

Method: Six subjects with normal hearing participated. First, to obtain individualized vowel identification maps, subjects identified 90 single synthetic vowels with first and second formant values varied evenly across the vowel space. Next, subjects listened to dichotic vowel pairs (selected from four exemplars), and identified the vowel(s) heard. Vowel identification was modeled with a phenomenological auditory nerve (AN) model (Zilany et al., 2014), a relay cochlear nucleus model, and a same-frequency-inhibition-excitation inferior colliculus (IC) model (Carney et al., 2015). AN model output was used to determine whether the F0s of the dichotic vowel pair were the same or different. For the same F0s, the IC responses of the two vowels were averaged to simulate a fused vowel percept and predict a single vowel response. Otherwise, the IC responses of the two vowels were used to predict two separate vowel responses. Template responses were obtained for the 90 vowels used for the vowel identification map. Predicted vowel identification was based on the similarity between those templates and model output (e.g., compare Fig. 1D to Fig. 1A-C).

Result: Consistent with previous studies, subjects often fused dichotic vowel pairs with the same F0, but not different F0s. The model predictions for fused vowels were similar to those of human subjects, especially for low F0s.

Conclusion: A physiologically-based model utilizing binaural averaging can simulate some dichotic vowel fusion percepts in human listeners.

Fan et al