Cortical speech tracking with different lip synchronization algorithms in virtual environments

Authors: Juergen Otten * 1 ; Volker Hohmann 1 ; Giso Grimm 1 ; Mareike Daeglau ; Stefan Debener 1 ; Bojana Mirkovic 1

Affiliations:
1 Carl von Ossietzky University Oldenburg

Background: The comprehension of speech in challenging listening scenarios can be aided by facial expressions and lip movements. In this study, we utilized virtual environments (VEs) and mobile electroencephalography (EEG) to investigate the cortical tracking of ongoing speech. Specifically, we aimed to compare cortical tracking between videos of real speakers and their virtual avatars animated with two different lip-synchronization algorithms. We predicted that real speakers would result in better cortical speech tracking, particularly in the presence of background noise, and that visible lip movements would provide an additional benefit.

Methods: 18 participants were presented with audio-visual scenes comprising one of six speakers at a time, telling unscripted stories. The videos showed either real speakers or their virtual avatars with visible lip movements. To manipulate listening difficulty, we presented speech with and without babble noise. Conditions changed every 30 seconds in a pseudo-randomized order while the story naturally unfolded. Concurrent mobile EEG recordings were used to measure cortical tracking of speech. The first algorithm is rule-based, controlling three blendshapes depending on the energy present in different frequency bins, the second is an image-based network using images as input to generate seven blendshapes as output.

Results: Preliminary results showed that cortical speech tracking was higher for audio-visual scenes showing real speakers compared to virtual speakers. No significant difference was found between the algorithms.

Conclusion: Our study confirms that cortical speech tracking measures are useful for the development and validation of realistic VE communication scenarios. Natural lip movements aid cortical speech tracking, particularly in challenging listening environments. Lip movements of virtual avatars may not yet provide the same benefit as those of real speakers.