Federated Learning Lets Data Stay Distributed
Not all data can be moved — which can make it difficult to train artificial intelligence models, especially in regulated industries like health care.
That can be a problem when trying to train models that might benefit from more data, but regulatory issues restrict that data’s movements, according to Steve Irvine, co-founder and CEO of integrate.ai.
“[For] a lot of industries, like health care, it’s prohibited moving the data across jurisdiction, and so some of the most meaningful use cases that you and I would hope could come into the world — and developers want to bring into the world — are blocked because the data can’t move,” Irvine said.
This is where federated learning can help. Federated learning allows for the training of AI models by shifting the paradigm to bring the training function to the data, Irvine told The New Stack.
“Instead of data having to come to a central location to train the machine learning model, versions of the model gets sent out to the location where the data resides,” he explained. “Assuming the data is distributed, sitting across different silos, it pushes the model out to train locally in all of those environments, and then small updates come back; and it averages it, so you replicate a centralized model.”
Early Use Cases
integrate.ai made its federated learning platform generally available in August after running an early access program that attracted over 100 users across eight industries. The company has raised $50 million in funding to date, according to a spokesperson.
One of the companies that leveraged integrate.ai’s federated approach is DNAstack, which offers software that helps scientists analyze genomic and biomedical data. Genetic and health datasets are large, sensitive and often globally distributed, which makes them difficult to pool — a perfect for federated learning.
DNAstack is using integrate.ai to support federated learning in its work on autism and the Autism Sharing Initiative, an international collaboration to create the largest federated network of autism data. It’s a use case integrate.ai recently highlighted in its recent announcement about its general availability.
“Federated learning will empower us to ask new questions about autism across global networks while preserving privacy of research participants,” Marc Fiume, co-founder and CEO of DNAstack, is quoted as saying.
Prior to federated learning, a researcher would have literally had to fly to another country to access health data that contained sensitive data, Irvine said.
“The problem is that people want to collaborate and they don’t have the infrastructure to do it in a way that meets a lot of the regulatory bans,” he said. “The researcher literally needs to fly to London, England, sit inside the physical location where the data exists, and run their model in that compute environment, get the learnings to understand what happens, and then come back and understand if they’re compatible.”
Accessing Distributed Data
DNAstack is able to leverage integtrate.ai’s APIs in the background to train its AI models. The tool spins up a Docker virtual environment that’s preset with the library and tools needed to train the model locally. Once the training is done, the virtual environment is “torn down,” he said. It then coordinates across the nodes to optimize the model, he added.
“What you end up with is a fully optimized model, as if it had trained on all of the data, except it’s averaging the weights of the model to be able to get there, as opposed to just centralizing the data,” Irvine said. “All of those actions behind the scenes — setting up the network, training the model, averaging the model, and ensuring that’s all done in a privacy-safe way — that’s all controllable through APIs.”
While training AI on distributed research data is an obvious use case, other use cases include training models on Internet of Things (IoT) data. One application that Irvine specifically said he’s seen is training models for predictive maintenance of IoT-enabled equipment.
“There’s a lot of data coming back from it, but it’s not networked; so you can look across that equipment and be able to understand the patterns more effectively,” he said.
It’s still early days for this technology market — Irvine said integrate.ai is in a “first wave of commercial applications” being used in the wild. Competitors primarily are open source solutions, including:
- Nvidia’s Federated Learning Application Runtime Environment (FLARE), a standalone python library designed to enable federated learning;
- Google AI, which allows the building of federated models and has a cartoon to explain it;
- Alibaba’s FederatedScope framework;
- IBM’s federated library;
- The charity OpenMined, which offers its PYSYFT library for accessing federated data stored on PyGrid; and
- Flower, a federated learning framework.