Predictive data models are only as useful as the data they incorporate.
Epidemiologists, journalists, citizen data scientists and anyone else working with COVID-19 data sets realize this truism. While artificial intelligence using deep learning is exciting, it appears that overcoming data integration and data sharing obstacles is the best way to maximize most COVID-19-related analytics efforts. Indeed, “data quantity, quality and availability” rivals “business and process challenges” as the greatest constraints on AI usage, according to over 1,000 surveyed for MIT Technology Review Insights’ The Global AI Agenda.
Integrating unstructured data is the top data challenge encountered when developing AI according to 57% of the survey, with interfacing with open-data platforms mentioned by 53% of the MIT study. Only 10% think there is not enough data; instead, the problem is what do with data that is already available.
Readers of The New Stack know there are countless technical challenges and business risks associated with sharing data. Despite these obstacles, two-thirds of surveyed executives say their firm is at least somewhat willing to share internal data with third parties for the purpose of building new value chains, products or services. Speed and visibility across supply chains and improved product development are seen as the top benefits of sharing data with peers.
Of course, not all industries are as keen on the idea as others, with government and financial services least likely to be open to data sharing. Plus, from a consumer perspective, there is deep concern about how data used for COVID-19 initiatives will impact privacy and human rights.
Progress towards making data more open and accessible is a long-term trend in government, academia and even the private sector. Yet, despite the adage “sharing is caring,” challenges abound for both bottom-down and top-up efforts to make data more open and accessible. If you want to learn more about data collaborative responses to COVID-19, check out a 70-page repository maintained by the Data Stewards Network.
In addition to the importance of data sharing, our review of news coverage and “The Coronavirus Tech Handbook” led us to come up a few additional takeaways about data models and COVID-19:
- R, not Python is the preferred tool for COVID-19-related data analysis. The R programming language is particularly strong among people involved with life sciences and has a wider variety of model types to choose, according to DataCamp. The data science training company says that a majority of deep learning is done in Python and leading tools like Keras and PyTorch have “Python-first” development. The R Epidemics Consortium (RECON) is a resource to review for more information.
- Deep learning and high-powered computing have their place. Training data sets to handle x-rays and other images are being utilized. High-performance computing can speed up testing of drugs and genome sequencing, so free computing resources from the COVID-19 HPC Consortium can advance some initiatives. Six projects have already been approved, most of which are related to testing drugs.
- Skeptism abounds about the value of many initiatives. As explained in by Shay Palachy in InfoQ, companies that specialize in predicting disease outbreaks are not demonstrably better than conclusions made by human experts. Furthmore, while Big Query is at the center of much citizen scientist activity, at least one of the developer advocates at Google thinks “amateur COVID-19 predictions are worthless.”
- This article is just the tip of the iceberg. Here are two more organizations worthy of your attention:
- Covid Act Now: A distributed team of volunteers working with some of the nation’s preeminent epidemiologists and public health experts to develop the U.S. Intervention Model, which is a data platform that projects COVID infections, hospitalizations, and deaths across the United States, as well as model how public health interventions contain the spread of COVID.
- COVID-19 Hospital Impact Model for Epidemics (CHIME). The estimates are generated using a SIR (Susceptible, Infected, Recovered) model, a standard epidemiological modeling technique.