Modeling Auditory Attention with Deep Neural Networks

Authors: Ian Griffith1, R. Preston Hess2, Josh McDermott2

1Harvard University
2Massachusetts Institute of Technology

Background: Attention enables communication in settings with multiple talkers, selecting sources of interest based on prior knowledge of their features. Neurophysiology experiments implicate multiplicative gains in selective attention, but it is unclear whether such gains are sufficient for human attention-mediated behavior.

Methods: We optimized a deep neural network (DNN) to report words spoken by a cued talker in a multi-source mixture, using binaural audio input (a “cocktail party” setting). Audio was spatialized within simulated reverberant rooms using head-related transfer functions. Attentional gains, implemented as learnable logistic functions operating on the time-averaged representation of the cued talker, were applied to the representation of the mixture, scaling its activations up or down. Gain functions were optimized along with the DNN to maximize word recognition. Task performance was measured by word recognition accuracy as a function of target-distractor ratio (SNR) and target-distractor spatial proximity.

Results: The model learned to correctly report the words of the cued talker and ignore the distractor talker(s). Similar to humans, the model benefitted from both spatial separation and voice timbral differences between target and distractor, and showed higher accuracy with single-talker distractors than with multi-talker distractors. The model’s internal representations revealed that attentional selection occurred only at later model stages.

Conclusions: We introduce a framework to quantitatively model feature-based auditory attention using machine learning. The model provides hypotheses for how attention might be expected to modulate neural responses at different stages of the auditory system, and can help understand the conditions in which attentional selection is intrinsically difficult.