Data / Development / Security

Why COVID-19 Contact Tracing Requires a Distributed Database

23 Apr 2020 2:15pm, by

As the COVID-19 global pandemic continues to work its way through the population, a discussion has arisen over the best of tracking the virus. One proposition floated for the United States anyway has been contact tracing, by use of everyone’s cell phones. This approach of tracking people’s movements worked well in East Asia. The idea is that if a user comes in contact with someone else who has the virus, they will be notified by the phone and can take appropriate precautions, such as staying at home and wearing a mask.

So what would the database requirements building such a data collection system? At first glance, it would be an easy checkbox for any RFP, given the variety of open source distributed databases that can easily do the job, noted Patrick McFadin, Datastax‘s chief evangelist for Apache Cassandra.

Of course, keeping a central database with this information has raised privacy concerns, so when Google and Apple proposed incorporating Bluetooth-driven tracking software for their respective mobile devices — set to roll out next month — they were smart to design it so the phones themselves would do the heavy lifting, rather than relying on some centralized database.

They were careful not to state what government agency or other entities would be actually collecting and managing data. Rather they would provide the conduit to the phones so other entities can parse the data.

Apple and Google’s architecture does not involve tracking user locations or identifying data. Instead, each device will create for itself a “beacon” number, generated by the phone’s hardware. As phones come into proximity with other devices during their owner’s travels, they will share their beacons, each phone keeping a list of all the other phones they came in contact with.

Someone contracting the virus would then volunteer their status to the app, which would relay its 64-bit beacon number — (generated by the Bluetooth chip) and associated timestamp — to the backend database, which would be the first contact with the organization doing the tracking. The database would periodically send out a list of infected keys, so to speak, and the user’s phone itself would check the infected keys against its own list.

In other words, “This is this is not really a database problem,” McFadin said. The database would only store the key, and the time and date it arrived. Even for the whole United States, with a population of 328 million would probably, at the most have only a few hundred thousand keys in any given time. Easy-peasy work for any enterprise-ready database management system.

Where it gets interesting, however, is in who will collect the data, and if it is multiple entities, how will they will coordinate their info? States may collect data for their citizens, or perhaps insurance companies could do the same. In order for the system to work, however, its coverage should be as comprehensive as possible. In other words, this approach would only work through a federated model of multiple data collection agencies.

This is where having a distributed database, such as Cassandra, will come in handy, McFadin noted.

Multiple parties could each keep the data in their own Cassandra database, but all the databases can share a single keyspace, and each database can keep a copy of all the keys gathered by all the databases.

“The way that Cassandra works is that a cluster can have multiple data centers, and each data center stores all the data. Let’s say you wanted to run a database that spans two clouds, like Amazon and Google. You would set up a ‘data center’ in Amazon, and a ‘data center’ in Google. And when you create the cluster, in the Cassandra world, you would create one keyspace and that keyspace is for all the tables that span both data centers. So when you input when you insert data into the Google Data Center, it will appear in the Amazon data center within a few milliseconds,” he said.

The other requirement for the database would be to generate a list of all those identifies that have been submitted within the last 14 days, or whatever the time window would be chosen by officials. A simple query based on the timestamps could do this job, though McFadin admits that a pretty robust service layer would be required, as millions of phones would be periodically requesting this data. Obviously a Content Delivery Network (CDN) would be handy here.

Rich Data Sets

Another distributed database, the open source documented-oriented MongoDB, has been used in similar data collection duties, albeit at a smaller scale, according to a spokesperson for that company.

Boston Children’s Hospital used the MongoDB Atlas to build CovidNearYou, a website where users can report current symptoms and are identified only by zip code.  This app provides health officials real-time updates of virus hotspots in the area.

MongoDB is also hosting a copy of the Johns Hopkins University COVID-19 dataset, which is used in the university’s widely-consulted case tracker for U.S infections. Hopkins itself offers a copy of the data as a flat CSV file which must be downloaded as a single file each time the data is updated. MongoDB’s own copy can be queried for only updates, making it much more practical to be used as the basis for additional applications.

MongoDB’s document data model comes with certain advantages for this type of work, according to the company. Such data collection requires a flexible data mode, so that new types of data can be tracked and easily handled by developers. Support for GeoJSON objects, which would record location information, would be instrumental in tracking the physical spread of the virus.

In terms of replication, MongoDB’s document model is distributed “by nature,” according to the company. Each case would be captured in a discrete document which then could be shared across multiple locations. Native sharding can ensure the database can scale to whatever size needed, as well as provide an easy path to data distribution.

Graph Traversal

One of the big drawbacks of the Apple/Google approach is that it can’t offer any sort of COVID-19 transmission data to health care providers. The database lists only the number of people who volunteered their status, not the places they’ve visited, nor the people they’ve interacted with. This a big limitation for any type of contact tracing, which should not only notify users when they’ve been in the range of the virus but also the health care community, noted Aron Szanto, CEO of a new contact tracing project called Zerobase.

To date, Szanto has counted 140 different contact tracing proposals thus far, and most suffer from the same weakness: They require that everyone to download an app. “Not everyone is going to download the same app,” he pointed out. A sizable portion of the U.S. doesn’t even have the latest iPhone or Android needed to make use of the app. It also doesn’t work in other cases where location is vital, such as an ATM that gets used by multiple people during the day.

Zerobase’s proposed approach does not require an app. Instead, it operates on a QR code mesh-style network. It does ask that all the “essential businesses,” such as pharmacies and grocery stores that stay open during periods of lockdown, to post a QR code at their entrances, which then everyone entering the business can scan. Scanning the code will open a Web page that deposits a unique identifier, like a cookie, onto the user’s phone. It also pairs that number to that location. As the user goes about their day, they check into other sites, creating an anonymous overview of where that user has been, and what other users have been to those locations as well.

Identifying someone carrying Coronavirus could be done where the person is diagnosed. As health care professionals collect patient information during a screening they can use the same QR code to identify the person — again anonymously — to the system.

In this approach, the back-end database would be a graph database, a database designed to map out a network of connections. Zerobase uses the Neptune graph database from Amazon Web Services, which would easily scale to large-scale use.

In terms of a graph network, users and the places they visit are “nodes.” When a user visits a place, an “edge” between the two is created.

“When someone gets tested positive for COVID-19, you can think about their node as turning bright red. And then you just follow the edges of that node to all the different places they’ve been, and then you go forward in time and figure out all the people who have been to those places, maybe within two days, and then the others that those people may have contacted,” Szanto explained. “And so just by creating and building out this graph, you have immediate access to the fundamental bit of contact tracing, which is ‘who else checked into this location after the time that a sick person did.’”

Amazon Web Services, DataStax and MongoDB are sponsors of The New Stack.

Feature image: Johns Hopkins Coronavirus Resource Center.

A newsletter digest of the week’s most important stories & analyses.