Computational modeling of attentive voice tracking

Authors: Joanna Luberadzka1, Hendrik Kayser1, Volker Hohmann1,
1Auditory Signal Processing and Cluster of Excellence “Hearing4all” Department of Medical Physics and Acoustics University of Oldenburg, Germany

Background: Humans are able to follow a chosen speaker even in challenging acoustic environments. The perceptual mechanisms underlying this ability remain unclear. The research field known as ‘computational auditory scene analysis’ investigates this topic by developing computer models, mimicking the abilities of the human auditory system. In this study, we contribute to the CASA work by presenting a model of attentive voice tracking.

Methods: For our model, auditory scene is arranged in two attention-dependent streams: attended foreground, corresponding to the target, and unattended background, comprised of the remaining clutter. This acoustic mixture is processed using four main computational blocks, organized on a scale between bottom-up and top-down processes: glimpsed feature extraction, foreground-background segregation, state estimation, and top-down knowledge. Algorithmically, the model combines the salient periodicity- based feature extraction, sequential Monte Carlo sampling and statistical models of voice properties.

Results: We evaluated the model by comparing it with the data obtained in the psychoacoustic experiment, which measured the ability to track of one of two competing voices with time-varying
parameters (fundamental frequency (F0) and formants (F1,F2)). We tested three model versions, which differed in the segregation stage: version 1 segregates foreground and background based on oracle F0, version 2 uses estimated F0 and version 3 uses estimated F0 and oracle F1 and F2. Version 1 outperformed human listeners, version 2 was not sufficient and version 3 was closest to explaining human performance.

Conclusions: Results of version 1 show that optimally segregated salient periodicity-based features convey more information than needed to explain human performance. Hence, they are suitable for modeling attentive voice tracking. Results of version 2 and 3 showed that several parameter dimensions must be considered to successfully this task, supporting the idea that combination of features is used by the auditory system to track a chosen voice through acoustic space.