Machine Access to Audiology Content on English Wikipedia

Authors: Hector Gabriel Corrale de Matos1, Priscila Carvalho Cruz1, Thais Catalani Morata2, Kátia de Freitas Alvarenga1, Lilian Cássia Bornia Jacob1

1Department of Speech Language Pathology and Audiology of the University São Paulo Campus Bauru
2National Institute for Occupational Safety and Health of the Centers for Disease Control and Prevention

Background: Wikipedia is an open encyclopedia available in 326 languages with more than 6.8M articles in the English domain, which received 11.5B page views in March 2024. Wikipedia serves as a training ground for Large Language Models and Retrieval Augmented Generation, and as a reference for search engines. Wikipedia’s usage is steadily increasing to include evidence-based information through educational activities and global awareness campaigns to improve hearing health content. Our aim was to map the availability and accessibility of hearing health information on Wikipedia, which is being used for machine-generated content.

Methods: We used open-access tools PetScan and Pageviews to gather data from English articles categories on the Wikipedia. The analyzed categories were Sound, Hearing, Audiology, Acoustics, Otology, Ear, and Noise. PetScan listed the articles within the main categories to subcategories at three levels of content relationship depth: directly related (D0), closely related (D1), and more distantly related (D2). Pageviews supplied data on page views by Users (humans’ visualization) and Machines (search engines, retrieved models and computer-based programs). The collected data were tabulated and descriptively analyzed from 2023 (Pageviews) and 2024 (PetScan).

Results: The English Wikipedia contains 932 (D0), 2.341 (D1), and 6.988 (D2) articles pertaining to the selected categories. The total visits to these categories reached 52.6M, with users accounting for 37.7M and machine processes for 14.8M. The categories most accessed by Machines were Sound (5.3M), Acoustics (4M), and Audiology (1.6M). These were also the categories most accessed by Users. Less accessed categories by Machines were Noise (482K) and Hearing (997K).

Conclusions: The access metrics and the number of articles accessed by both individuals and machines justify activities aimed at promoting the curation of hearing health information on Wikipedia.