Seattle, May 14
Author: Elle O’Brien, Data Scientist at DVC (Twitter)
Recently, there is considerable excitement about the role of big data and machine learning (ML) in the future of audiology as a clinical practice and an area of research. With the rise of new digital tools for collecting data on scales never before seen in our field, coupled with new modeling techniques from deep learning, the enthusiasm is certainly warranted.
If the trajectory of ML in industry is any indication, though, we may find ourselves in a peculiar situation within a few years: using ML to model data becomes trivially easy, while managing a lab that uses that big data and ML becomes profoundly hard. This could occur because adding ML and big data to a project is often associated with a large increase in project complexity- and creating the infrastructure to support the complexity of this new breed of scientific experiments will not be trivial.
What do I mean by infrastructure? Very generally:
- Hardware- every experiment has some hardware requirements. In the simplest case, all that is required is a modern laptop. But there are frequent exceptions:
- A deep learning experiment requires specialized computing equipment (GPUs) to train and evaluate a model
- A neuroimaging dataset requires several terabytes of disk space for storage
- A new signal processing algorithm for cochlear implants may need to be studied on the device
- Software- what programs must be installed on a researcher’s computer. Some common software hurdles include:
- An active license is required to run a MATLAB script
- Several libraries must be installed first to run an R script
- A script is written so it will only run on a Windows machine
- Protocols- a lab’s cultural practices, including:
- How labs name, organize, and store datasets, including how they meet patient privacy regulations
- Whether ongoing projects are tracked with a version control system
- How digital materials are shared between teammates
Before big data and ML, organizing a lab’s computing infrastructure in a way that ensures replicability and auditability was a significant burden for many research groups. A recent report by the American National Academies of Sciences, Engineering & Medicine laid out the critical role of computing practices in ensuring the replicability of scientific work; many of their guidelines are still aspirational and will take time to become entrenched in the culture of academic science.
If we are to encourage large scale data collection and modeling, we can expect those challenges to become even more apparent: a 2019 case study from research & development teams using ML across Microsoft found that data engineering-”discovering, managing, and versioning datasets”- was consistently rated as the most difficult part of the ML workflow (see Figure 1 from their manuscript below, giving an overview of the typical ML workflow from start to finish). This study also reported serious challenges sharing and reusing ML models across teams, as well as keeping track of the complex dependencies between models, code and data within teams.
That these problems exist in an organization like Microsoft, with a seemingly limitless budget for engineering, is a sure sign that we’ll also have to contend with them. Several likely scenarios come to mind:
- A team designs an app for monitoring hearing longitudinally. Anonymous user data is aggregated in a large dataset that changes every day. The dataset must be monitored for quality control, and because it is dynamic, any analysis of this data will be highly dependent on the exact version of the dataset used.
- A research group creates a deep learning model for denoising speech in background noise. Another research group wants to use transfer learning to re-train the model on background noise from a particular environment. They will have to get access to the model, as well as the software and hardware required to re-train it.
- A translational research center builds a machine learning model to predict patients at risk of cognitive decline based on their hearing status. In order to receive clearance for clinical usage, they need to provide a complete and transparent record of the dataset, software used to build and train the model, and the trained model itself for further evaluation by regulatory agencies.
There are some encouraging signs that researchers are meeting these challenges. Consensus is growing around “best practices” for releasing papers about ML research– meaning, with code, data, and even computing resources to replicate an experiment. Computational researchers have created online ecosystems for sharing trained models, as well as detailed specs about those models, to facilitate further work (see, for example, PyTorch Hub and NASA’s Community Coordinated Modeling Center). Similarly, data catalogs are growing in popularity (e.g. National Cancer Institute Data Catalog and NASA’s Planetary Data System). There are also increasing numbers of open source software tools for managing the complexity of ML workflows and large datasets (I work on one, Data Version Control, others include Quilt Data and MLFlow).
However, rather than proposing an immediate technological fix or a community initiative, it seems prudent to first take stock of the priorities and considerations unique to audiology. In other words, I propose discussing and planning how we will balance strong infrastructure with creative, energetic research directions. There is no shortage of great ideas in the audiology and hearing science communities for ways to scale up healthcare and research, or begin taking advantage of ML methods. How to do this sustainably is something that warrants thoughtful discussion, while computational audiology is still young enough to be intentionally shaped. In particular, I think we need to ask two questions:
- What are the minimum standards for data, code, and ML model management that we require to advance as a discipline?
- What are the practical and cultural barriers that prevent research groups from reaching these standards? For example:
- Career incentives. Are researchers disincentivized from investing sufficient time/resources in computing infrastructure?
- Literacy. Is training in this skill unavailable?
- Resources. Do groups lack financial resources to develop their infrastructure?
At the upcoming Computational Audiology meeting, I’ll be co-organizing a workshop about infrastructure, starting with data management and engineering. In order to create the most useful and relevant discussion, I’d like to ask for some feedback now- how do you handle data now, and what do you see as the most substantial barriers to rigorously managing data collection, sharing, and long term maintenance? Your comments now will help shape the workshop. We’ve provided a Google Form where you can share your thoughts.
Thank you for taking the time to read and contribute thoughts. I hope to see you at the workshop in June!
Further reading suggestions:
- The importance of transparency and reproducibility in artificial intelligence research– a pre-print about the difficulties replicating an ML approach published by Google to address breast cancer screening
- A practical taxonomy of reproducibility for machine learning research– a review of software and hardware barriers to sharing ML studies, and some practical suggestions for researchers