Biotechnology startup Grail is building its own infrastructure to support the kind of data science work it’s doing at the same time it’s building its models. It recently open sourced two big data-focused serverless projects — Bigmachine and Bigslice — as it doubles down on Go as its language of choice.
“In this situation where we’re doing a lot of R&D and discovery and model-building, we need to be able to turn around experiments very, very quickly. It’s essential that we have agile and fast infrastructure,” said Marius Eriksen, software engineer at Grail.
The Menlo Park, Calif., startup is focused on detecting cancer early, when it can be effectively treated. It is working on developing a single blood test for multiple types of cancer by looking at cell fragments in the blood. It’s using machine learning technology to detect cancer and to determine where in the body the cancer originates.
The Food and Drug Administration recently granted the Breakthrough Device designation — fast-tracking development, assessment, and review — for Grail’s Circulating Cell-free Genome Atlas (CCGA) study. Overall, the company aims to enroll 165,000 participants. It has enrolled 100,000 women undergoing mammograms for its STRIVE substudy and expects to report results from it next year.
Because genome sequencing for each person in the trials can generate more than a terabyte of data, the company deals with massive datasets. Its approach to building models relies on quick feedback cycles in which researchers build models to test their ideas on real data and learn from results.
“We believe that computation is really core to what we do. We need to be very good at running large-scale computing jobs,” Eriksen said.
“Grail was founded on the idea of taking technology infrastructure very seriously. We decided we wanted to follow modern software engineering practices, and we really wanted to focus on scalability and speed.”
It runs two major types of data processing pipelines: bioinformatic data processing to make biological sense of raw sequencing data and machine learning models — doing ad hoc analyses, building classifiers that ultimately determine the performance of the product.
The company started using Go for these two types of workloads and decided it needed to build infrastructure around the code, rather than the other way around.
“It’s a very simple language. It’s very approachable for newcomers,” Eriksen said. “A lot of bioinformatic processing can be very performance intense; there are parts of the code that just needs to run very, very fast. Go provides tools that let you reason about the performance of your code and write very highly performant code. We found it to be a good fit.”
It found it needed to distribute workloads more widely.
“ …We needed some way of doing cluster computing. Rather than pouring our code into some existing cluster-computing system, like Spark, for example, we thought it would be easier to bring infrastructure to the code we already had. And also at the same time, kind of rethink how these cluster computing systems could actually work from a modern cloud-computing context,” Eriksen said.
“Systems like Spark, for example, were built before the modern cloud era and in a sense not cloud-native. We wanted to build infrastructure that fully embraced the kinds of abstractions that are made available to you from a cloud-computing provider.”
Cloud native, in this case, means rather than making use of third-party infrastructure like Kubernetes, the company built directly against the APIs provided from the various cloud providers, he said.
Bigmachine provides a library and mechanism to write programs that are self-distributed across clusters. It provides an API that lets a driver process data from a cluster of machines running a distributed workload. User code is exposed through services, which are stateful Go objects associated with each machine.
“I can write a program using the Bigmachine APIs that will distribute itself to as many machines as I like,” Eriksen said. “It’s entirely specified in the program you write. There’s no external configuration or operations required to do so.
“…Bigmachine takes care of everything from managing clusters of machines that are added on an ad hoc basis to how they securely communicate together to how they’re monitored. It takes care of all the operational aspects of being able to handle a large cluster of machines on a cloud provider. It’s almost like having Kubernetes, but as a library that’s used directly for a single program instead of shared infrastructure.”
Bigmachine so far supports clusters on EC2, but can be implemented on other systems with a Go interface.
“The result of that is that you can build a Bigslice program simply by having credentials to your cloud computing environment. It’s able to make direct use of those resources without any other operational considerations,” Eriksen said.
“When you run a Bigslice program, it will go and directly allocate machines on your cloud provider, distribute itself completely transparently and distribute its workload to machines that have been allocated. For the user, there’s no additional setup or things to consider. You literally just have to have some credentials. … It doesn’t rely on other software to intermediate that.”
You compile it just like a regular Go binary. When you run that binary, it performs that computation by that binary itself, allocating nodes, distributing itself, handling all the operational aspects of running such a cluster, and then when it’s done or it needs fewer resources, the cluster is automatically downsized accordingly, he said.
Bigslice exposes a composable API that lets you express data processing tasks in terms of a series of data transformations that invoke user code. Using sequential computations that spell out step by step how data is to be transformed, Bigslice parallelizes the process, splitting datasets into many smaller pieces that they can fit in memory where transformations can be performed in parallel across many machines. It also can rearrange data as needed for operations like join or reduce.
“We could achieve a lot of the same things by using off-the-shelf systems like Spark. With Go as our language of choice, using Go on top of something like Spark is fairly unwieldy,” he said. “We believe it’s simpler to build the infrastructure around the code we already have rather than pour that code into a system not native to how that code functions.”