Machine Recognition of Overlapping Speech Encoded in the Midbrain

Authors: Samuel Smith1, Mark Wallace1, Ananthakrishna Chintanpalli2, Michael Akeroyd1, Michael Heinz3, Christian Sumner4
1Univ. of Nottingham, Nottingham, United Kingdom;
2Birla Inst. of Technol. & Sci., Rajasthan, India;
3Purdue Univ., West Lafayette, IN; 4Nottingham Trent Univ., Nottingham, United Kingdom

Background: Humans are able to identify a conversation partner’s speech after it has been obscured by an interfering talker. A common conceptual model of this is that the auditory system performs speech segregation, utilising physical ‘bottom-up’ cues (e.g. pitch, onset asynchronies, glimpsing) prior to recognition. We instead question whether ‘top-down’ recognition better describes perception. Do listeners appear to be predicting and recognising the combined neural representation of overlapping speech?

Methods: 3 speech identification paradigms were explored: concurrently presented vowels with differing pitches, overlapping syllables with varying onsets times, syllables amongst a variety of clean and modified sentence segments. For each, neural responses were recorded from the midbrain of anaesthetised guinea pigs. A naïve Bayes classifier was trained to identify neural responses to combinations of speech sounds. Machine recognition was then compared with the performance of human listeners.

Results: The neural classifier accurately predicted human recognition of overlapping speech sounds. Improved identification of concurrent vowels with increasing pitch differences was quantitatively predicted. Improved identification of syllables as a function of temporal onset lag was quantitatively predicted. Improved identification of syllables amongst sentence segments with increasing spectro-temporal glimpses was quantitatively predicted. Further the probabilistic classifier was able to predict listeners’ specific micro-decisions.

Conclusions: Machine recognition of overlapping speech encoded in the midbrain mimicked human perception. Advantages of basic auditory cues emerged from a general prediction driven strategy which had no explicit knowledge of such auditory cues.


  • I have one clarification question: What is the input to the naive Bayesian classifier and what are the output classes? I also did not understand how does the model predict human performance… Thanks in advance!

    • Hi Joanna. Thank you for your question.

      Generally speaking, the classifier was trained/tested on neural responses (PSTHs) to overlapping speech tokens.

      For example, in the concurrent-vowel identification task, the output classes were all pairwise combinations of vowels, e.g. /i,a/, /i,u/… . Each class was defined by a distribution of neural activity in response to the according concurrent-vowel pair. To predict human recognition of a concurrent-vowel pair, the naïve Bayes classifier was stochastically fed (cross-validated) the associated neural responses (PSTHs) and tasked to identify the class which had the maximum a posteriori estimate.

      If you would like to follow up on this response or have other questions, please email me at