Data / Monitoring

The Challenges to Building a Predictive COVID-19 Model

4 May 2020 12:00pm, by

If we took a shot every time we heard a model referenced in the news, well, the managerial myths about decreased productivity when working from home would be finally proven true. As it is, most of the modeling around the COVID-19 pandemic — grim predictions from where the next hotspot will be to when infection and death rates will plateau — are mostly a mystery. They could be coming from deep learning models or artificial intelligence or they could just be coming from Excel spreadsheets.

And they are all so different. Predictions can range from 200,000 to more than two million deaths in the U.S. alone. There are a lot of multi-table variables intersecting and conflicting. There’s no simple way to identify all these unknowns.

Even if we knew the prediction modeling that went into them, there’s a huge problem of explainability. So much that goes into machine learning and AI is locked away in a black box. It’s what keeps machine learning hidden in innovation labs, siloed from the rest of enterprises. It’s not that AI isn’t exhibiting benefits, it’s just it’s all usually too hard to explain to the rest of an organization.

Today we don’t pretend to know all the answers, but we give a basic primer on the most likely prediction models that are being used. And we try to break apart that black box to hopefully share some understanding in a time of so much uncertainty.

COVID-19 Models: Where to Even Begin

Epidemiologists, governments and non-governmental organizations are all trying to use modeling to figure out how to respond to and treat the pandemic. And then, as polling analyst site FiveThirtyEight notes, we lay folk are grasping at models to answer two questions:

  • How bad will this really get?
  • Seriously, how long am I going to have to live cooped up like this?

The problem is that all these models are so different. So are the different data sets, methodologies and reporting per country. Plus all the socioeconomic intersectionality thrown in there. Let’s not even get started on testing inconsistencies. Nor touch on the ethical debate around contact tracing.

The only things we do know for sure is that COVID-19 is highly infectious and that some people die from it. While factual, those are not exactly quantifiable statements.

For COVID-19, so much of the predictions is dependent upon human behavior. The video below by mathematics educational site 3blue1brown offers a series of mathematical models illustrating the main variables that individuals can affect.

Of course, there are many more systemic, cultural and political variables to factor in, however, this is a good example of explaining the effects of complete social distancing versus mostly social distancing — like still going to a market or school once a week — and of widespread testing. It’s mostly based on manim, an open source math animation engine. And it’s a good way to start diving into the modeling of COVID-19.

What Could the COVID-19 Models Be?

While there’s no way to know yet which mathematical models are being applied to this pandemic, there’s a likelihood of predicting most of the kinds.

“There’s all sorts of different models in play,” said Johnny Kelsey, deep learning engineer for Hazy, a provider of synthetic data.

Kelsey has a doctorate from University College London in mathematics, and has researched and worked heavily with nature-inspired computation and immune-system related organisms.

He predicts that these are unlikely to be deep learning models because of the explainability issue. There can be millions of model parameters configured into a single model, many of which don’t come from the data scientist, but from the learned model itself. It makes it impossible to explain to a layperson like a policymaker why these models are outputting those figures.

Kelsey posits that these are most likely epidemiologist models, which he says vary from the fairly simple to “very, very complicated infectious disease models.”

Deterministic and SIR Models

The most common model is probably SIR which is short for susceptible, infections, recovered (or diseased). Each of these three components represents a number of people in a given area at a given time. At the beginning of a pandemic, the susceptible begins high, but rapidly falls as people move into the other two compartments. He calls SIR the original mathematical infection model, and says that it’s actually nonlinear which usually makes it harder to solve, but that the equations are “quite simple.”

Kelsey says the popular SIR “takes the point of view that once you’ve actually had this illness you won’t have it again. There’s no reason to think that,” posing that this coronavirus could reemerge in different forms like the flu or common cold.

Stochastic Models

Stochastic models, like the SIR model, are also based on systems of equations. However, these use random variables and output probabilities.

Kelsey explained, “We look at the probability of an infectious disease spreading out over a population, not incidence rates.”

Deterministic models are used to predict massive numbers of people like whole countries while stochastic models are likely involved when people reference things like the likelihood of a particular population or demographic to get infected.

Time Series Analysis

Time series analysis is certainly being applied to track pandemics in real-time.

He explained: “You have a data point such as yesterday’s rate of registered infections, and today’s. If you put all these together — yesterday and the day before that and the day before that,” a prediction can be made for the future, while the actual outcome won’t be verified yet.

Time series data are a form of sequential data, a list of data points that occur in a certain sequence, one after another. Another example of sequence data would be languages, spoken or written. An archetypal example of deep learning models is their wildly successful application to natural language processing, where, for example, a text given in one language is translated into another. Deep neural networks take a sequence in, for example, English as input and train itself to output a sentence in, for example, French.

“And that output has to make sense. It takes an enormous amount of training. If you abstract that at a high enough layer it effectively looks at sequences and outputs sequences,” Kelsey said.

As the name time series suggests, these models only make sense in the correct time order and not in reverse. Deep learning networks have been applied to time series problems such as stock market predictions and economic forecasting, with increasingly good results.

Kelsey calls these models “very good,” but that they suffer from explainability or lack of transparency — how do we get that particular value?

“Trying to justify your model can be difficult except if you’re trying to do it in an empirical way — ‘A prediction has been right and it’s been right for the last three years, therefore, it must be right’,” he explained.

However, with COVID-19, he says we simply don’t have the masses of data yet to train these sequential deep learning models.

“For deep learning and neural networks, the more data you give them the better they tend to perform,” Kelsey said.

That’s why language translation is improving rapidly because you can export all of French and English Wikipedia to train a pretty good translation model.

Autoregression, Moving Average and ARMA

The ARMA is the amalgamation of autoregression and moving average models. It combines the variables with the moving average in time series of the today, yesterday and the day before. Moving average says that the process that you’re trying to model is likely to be similar to and influenced by previous time-series events, so today’s stock price is going to be similar to yesterday’s stock price and yesterday’s is going to be similar to the day before that.

Of course, much of what we are going through has no modern data history.

ARMA models are much more simplistic and have fewer parameters to train, which makes them potentially less accurate than machine learning or deep learning models.

On the other hand, they are usually more transparent and easier to explain, which matters in the 24-hour news cycle.

In the end, Kelsey predicts most epidemiologists are using mathematical models or time series models. If they are more statistically biased, he says they’ll be using the econometric time series models. And if they are more biology-based, they’ll be using infection rate models like SIR.

Without Enough Data, No Model Will Work

Kelsey surmises that we probably don’t have enough data points to make a deep learning model yet, except maybe in South Korea who have been more open with data and their approach.

“At the moment we only have very, very approximate figures in the UK. Mainly because we aren’t testing enough people. Mortality rate is suspect — only people that die in hospital are being registered as victims of COVID-19,” Kelsey said.

If they were testing at least 100,000 “random” British people a day, he believes the models would be more accurate. He goes on to call the UK’s data itself “probably dodgy” because the sample rate is too low. Less than 400,000 people have been tested in the UK at the time of writing this article.

“We know that COVID-19 is very infectious with a latency period for symptoms — a week or longer. If you aren’t testing, or the tests aren’t very accurate, or both, then we actually have no idea who is actually infecting whom,” Kelsey said.

“If your sampling rate is too low it’s very difficult to draw any conclusions about the data. Your basis is too small to draw any great precise conclusions from it,” he continued.

Disclosure: The author of this post does consulting work for Hazy.

Feature image: 3blue1brown

A newsletter digest of the week’s most important stories & analyses.