Machine Learning for Drug Discovery Using the Google Kubernetes Engine
Traditional pharmaceutical development is a slow, costly process. The long delay from research to market makes it hard for start-ups to enter the market, and the cost of discovering potential therapies and testing them can discourage companies from researching therapies for rare diseases. Recursion Pharmaceuticals sees opportunities to accelerate this process by using machine learning, especially neural networks and representational learning.
“Recursion’s goal as a company, broadly, is to use artificial intelligence methods to accelerate and change the entire pipeline of the drug discovery process,” explains Ron Alfa, senior vice president of translational discovery at Recursion.
This is done through a machine learning pipeline that runs on Google Kubernetes Engine deployed both in Google Cloud and on-prem that Recursion uses to run hundreds of thousands of experiments every week.
The first step in the process is to isolate parts of cells and manipulate them in different ways and take measurements and images of the cell components and individual cells before and after the manipulation. “We can manipulate cells in hundreds of thousands of different ways and probe their biology by looking at their responses using algorithms,” Alfa explains. This, he says, allows Recursion to quickly see how cells are responding to different types of changes, molcules or potential drugs without having to understand why the cell is responding in that way.
Recursion uses the open-source CellProfiler to measure the cells and cell component images. “At the end of the day we get back measurements at the individual cell level,” explains Ben Mabey, vice president of engineering at Recursion. “Those measurements then become feature vectors that get piped into our machine learning algorithms.”
From there, Recursion then uses one of two approaches. The first one is a more traditional artificial-intelligence-powered analysis. The second one is to use convolutional neural networks for representational learning. This creates a network where the more “similar” cells are to each other, based on certain measurements or features, the closer together they are in the network. One of the most common applications of this technique is facial recognition—what Recursion does is like facial recognition for cells.
“We’ve found that machine learning algorithms can detect things that humans going through images one by one just can’t,” Mabey explains.
“One of the big challenges we have is we have petabytes worth of data that we need to process and train on,” Mabey says — not an uncommon problem for any machine learning application. At first, Recursion handled this with on-prem GPU clusters, but increasingly the company has been moving its machine learning to Google Cloud. A major rationale was to take advantage of tensor processing units (TPU), which Google developed and first released in 2016. TPUs are application-specific integrated circuits that are designed specifically for neural network machine learning.
“TPUs right now are exclusive to Tensorflow, so right now the majority of our research and deep learning is on Tensorflow,” Mabey says, adding that he expects Recursion will stick with using TensorFlow for most research and deep learning in the future.
All of the clusters are orchestrated using Kafka Streams on top of the Google Kubernetes Engine, regardless of whether the experiment is running in the cloud or on-prem. “All the bookkeeping of if we’re done with an experiment, that all gets handled in Kafka Streams,” Mabey says. “It abstracts away all of the complexities. With Kafka Streams, if something breaks, it just picks up right where it left off. We can write distributed applications without having to worry about all the typical worries that engineers at Confluent have already dealt with.”
In addition to the usual machine-learning challenges, one of the challenges Recursion is trying to solve is cultural / organizational. “One of the things that makes us really unique, especially in the biotech space, is that we’re very collaborative and cross-function in our teams,” Mabey says. “Our data scientists are working side-by-side with our software engineers and our biologists.”
This creates technical challenges, too — and technical solutions. “In some stacks, you’ll see engineers taking data scientists’ work in Python and recoding it in Java. I don’t think that’s the right call,” Mabey says. “We want to give data scientists the flexibility to use the tools they want. We’ve done that by leveraging Kubernetes. As long as you can package it up in a container, you can distribute it and our production pipeline can run it.”
In the future, Recursion will also likely start using Kubeflow to operationalize the training and model building, Mabey says. The ability to easily spin up Jupyter Notebooks with Kubeflow and create training jobs on the Tensorflow Processing Unit (TPU) with a few lines of YAML using GKE has the potential to dramatically simplify the process of setting up the distributed training models Recursion relies on.
Recursion’s business model involves turning a biological problem into a machine-learning problem, and leveraging a cross-functional team of software engineers, data scientists and biologists to do so. The technology stack they use is designed to bridge the gaps in workflow and methodology between data science and software engineering while using artificial intelligence to see subtle changes in cell features that humans would miss. At the moment, the company has two drugs — both for relatively rare brain-related diseases — that are currently in clinical trials.
Image by rawpixel from Pixabay.