Authors: Sigrid Polspoel1, Sophia E. Kramer1, Bas Van Dijk2, Cas Smits1
1Amsterdam UMC, Vrije Universiteit Amsterdam, Otolaryngology – Head and Neck Surgery, Ear & Hearing, Amsterdam Public Health research institute, De Boelelaan 1117, Amsterdam, Netherlands
2Cochlear, Advanced Innovation – Algorithms and Application – Cochlear Technology Centre Schaliënhoevedreef 20i, 2800 Mechelen, Belgium
Background The digit-in-noise (DIN) test is a successful hearing test that is used as a screening instrument, a diagnostic tool in clinics, as well as a self-administered home test for CI users. The current limitation of the test is that, since the speech stimuli are language specific, it needs to be developed separately for each language. This makes the development time consuming, expensive and subject to improvement. Another limitation is that the DIN test is not customized for CI users, yielding less accurate test results in this group. These issues will be tackled in this project by applying artificial intelligence techniques to automate the entire development procedure.
Goal The aim of the Automatic LAnguage-independent Development of the Digits-In-Noise test (Aladdin)- project is to create a test development procedure for the automatic generation of digits-in-noise tests. This procedure will employ text-to-speech (TTS) and automatic speech recognition (ASR) systems to design DIN tests in various languages and for different target populations such as CI users. As all new DIN tests will have the same development procedure, the test results will become more comparable across languages than what is currently the case. Moreover, this project has the potential to make the DIN affordable for low and middle income countries by drastically reducing development costs.
Method Multiple studies will be conducted to assess whether the current development procedure (Smits at al.) can be replaced by an automatic one. First, we will evaluate if the speech produced by a TTS system can replace a human voice in the context of hearing tests. Next, speech recognition functions of the speech items are obtained to have a future reference for the ASR system for three target groups: normal hearing listeners, listeners with hearing loss and CI users. Finally, ASR systems are trained to construct speech recognition functions of synthesized speech material, including stimuli that have been processed by a CI processor. The speech recognition functions of the ASR systems are compared to the ones obtained in the study with human listeners. The ultimate result is a system where the TTS system creates the spoken digits and the ASR system equalizes recognition of the individual digits resulting in accurate DIN tests in any language (Figure 1). We aim to have the Aladdin project accomplished by the end of 2023.

References
Smits C., Theo Goverts S., Festen J. M., The digits-in-noise test: Assessing auditory speech recognition abilities in noise. J. Acoust. Soc. Am. 133, 1693–1706 (2013).
Dear Sigrid,
Very nice video-clip! Did you get help by some charming robots? 😉
I wondered if your procedure would also work for regional differences within a language? Such as Dutch versus Flemisch or the many varieties in the Swiss-German dialects. For the subject being tested, it might give the ecologically most relevant results if the test is performed in the native dialect. Do there exist digital speech corpora for those dialects?
Another question I had is about prosody. Are current speech to text systems able to convey prosody? Maybe less important for the DIN-test, but I guess it becomes an issue if you create synthetic sentences in noise tests.
Dear Jan-Willem,
Thank you for your question. The robots were useless but utterly flattered that you find them charming.
Which dialects/languages we can use in the Aladdin procedure depends on the text-to-speech (TTS) system we select. Some TTS systems have many variants in one language (e.g. Indian, American, British, Australian, NZ… English), some systems only focus on some less common languages/dialects (like Frisian or Flemish for example). Ideally, we want to select a system that supports as many languages as possible so that the DIN can be created in each dialect. Big players like Amazon Polly and Google TTS are continuously working on making more languages and dialects available for their TTS systems.
Regarding your question about prosody: Most TTS systems don’t explicitly model prosody, but current systems based on DNN are often trained on expressive data sets like audiobooks which often contain character voices with significant variation. In addition, some systems allow the user to alter the prosody manually with a markup language to create very natural sounding speech. As you said, for the creation of DIN tests this is not very relevant, but it definitely opens up opportunities for creating different speech tests in the future. This is something we’re also looking at.
Dear Sigrid,
Sounds to me like no TTS system can be the best in all variants of all languages. Would that be an argument for making the Aladdin procedure TTS vendor-agnostic? So that depending on the language you select the best TTS for that specific language.
Dear Jan-Willem,
That’s a valid point. For the time being, we’re looking for one TTS system that can produce high quality speech in different languages. If we were to select a different system for each language, the procedure would still be rather time-consuming, which we want to avoid.
Dear Sigrid,
very nice video! I wonder if you have seen our paper in Trends in Hearing: https://journals.sagepub.com/doi/full/10.1177/2331216519862982
We used TTS for the German matrix test and compared it to the natural speaker. It worked out very well.
Success for your research,
Inga
Hi Inga,
Thanks you very much! Yes, I already read your paper and it was very interesting and encouraging for our research. Thanks for reaching out.
Best regards,
Sigrid