For all the advances in the development of artificial intelligence algorithms and models, the majority of potential applications never make it to production because of the time and expense of labeling data to train the model. That’s a problem Snorkel.ai has set out to automate.
“The not-so-hidden secret about AI today is that despite all the technological and tooling enhancements, but 80 to 90%, of the cost, for many use cases, just goes into manually collecting and labeling and curating this data, this training data that the model learns from,” said company co-founder and CEO Alex Ratner.
Ratner concedes that this is not the first field or even the first decade in which appropriately labeled data has been considered paramount. In a contributed post to TNS last year, Vikram Bahl outlined the challenges of preparing data for machine learning and AI.
“There’s been just amazing progress in the machine-learning models, the algorithms. There’s so much of that out in the open source, they’re becoming so powerful, but also so data-hungry. And, really, that’s what motivates everything we do is that data is what’s blocking people now,” Ratner said.
A 2020 Cognilytica report found that data preparation and engineering tasks represent more than 80% of the time spent on most AI and machine learning projects. It projects the market for third-party data-labeling solutions to reach more than $4.1 billion by 2024.
In many settings, the data requires experts to label it, it is private and/or it changes all the time, requiring constant re-labeling, so projects are just never deployed. Those are the companies — banks, healthcare organizations, government agencies — that are Snorkel clients, many organizations that for compliance reasons can’t send the data out to a third party to prepare it.
Born in Academia
The Snorkel project began in the Stanford AI lab in 2016, with work on a DARPA project on fighting human trafficking. It has been co-developed and deployed with leading organizations like Google, Intel, Apple, and Stanford Medicine, and part of more than 40 peer-reviewed research papers.
“We started off with a bunch of NIH-sponsored user studies and saying, ‘Hey, how could we get a subject matter expert, like a doctor or a biomedical data scientist to write these labeling functions? Is it actually going to be faster than then labeling by hand?’ And that was actually one of the big motivations of why we spun out of Stanford in 2019 was to maybe spend a little bit less time proving new theorems about modeling or weak supervision, and a little bit more time getting back to this original [idea]. The very first DARPA funding of the project was, ‘how do you get the subject-matter expert in the loop with modern AI technologies?’” Ratner said.
“Rather than having this just be a manual ad hoc process, if you want to train a medical triage model or doctor assist model, can you ask your doctor friend to sit down for, [as in] one of our papers with standard radiology and the VA healthcare system, eight to 14 person-months, just to get a first model up and running? Can we allow these subject-matter experts to teach the model programmatically — say I’m looking for these features, or these words, or these blobs in the image and teach the model that way?”
The company came out of stealth last July and recently announced $35M in Series B funding, led by Lightspeed Venture Partners. Its total funding has reached $50.3 million.
Stressing Iterative Workflow
The Snorkel approach relies on weak supervision, the concept of using general guidelines rather than strict rules to guide a model. It’s considered less precise, but faster than super-expensive hand labeling.
Users start by writing labeling functions, which produce a label such as (SPAM = 1 or NOT_SPAM = 0) or abstains from labeling (ABSTAIN = -1). These don’t have to be exactly precise. After being run on sets of data, comparing the areas where the labels agree or disagree helps refine the probability that, say, an image is a stop sign or a cat.
The company maintains that using this general approach is faster, more flexible, enables use of larger training data sets to improve accuracy — and is accurate enough — to offer companies vast savings in labeling and time to get AI applications into production. And if the data changes, users can update just a portion of the training data, not re-label everything from scratch.
AI technology is trying to bridge the brittle, rigid rules-based systems of the ‘80s with machine-learning models that are good at generalizing. While those rules-based systems enable users to codify the knowledge of subject-matter experts precisely, users can’t go back and audit these systems for bias and errors — a huge problem, according to Ratner.
It recently announced a no-code UI as part of its Application Studio visual builder to go along with its Python SDK as part of its push to enable non-developers — legal analysts, nurses or even journalists — to manage test data for modeling. Application Studio offers labeling templates or the ability to custom-build one based on industry-specific use cases such as contract intelligence, news analytics, customer interaction routing or common AI tasks such as text and document classification, name recognition and information extraction.
It logs the entire workflow pipeline and versioning that can easily be audited. Though the industry has much to do yet in the field of explainability, he said, handling bias still requires human intervention, but with Snorkel, users can go into the code and revise it to eliminate bias.
“Snorkel.ai addresses key points of pain for enterprises that need to digitally transform their businesses with production ML. Their data teams struggle to build, train and deploy accurate models at scale because the coding is complex and data volumes keep rising. They need to optimize their use of existing code, accelerate model development and organize training data more efficiently. They also need to collaborate on a common platform that supports the full ML lifecycle,” said Kevin Petrie, vice president of research at Eckerson Group.
A host of startups in the AI space are focused on a particular AI pain point, including Fiddler, taking on explainability; Iterative.ai on versioning; DoIt and TerminusDB on collaboration; Seldon and Determined AI for management.
The company has been focused on the iterative process of AI application development and deployment in an effort to make it closer to the workflow process of any other type of software development. It has embraced integration with other open source options, allowing users to mix and match their choice of tools.
With data volumes rising, not every company has the resources in time and expertise to tackle Google-scale challenges, Ratner said.
“People are blocked more on just getting the data labeled in the settings that we service, and then just getting something up and running and deployed much more than they need to have the most sophisticated point solution, so we’re very much focused on bringing a certain workflow to bear, where you’re able to kind of very iteratively develop as quickly as possible. And we have found a lot of pull from customers on that,” he said.