Data / Machine Learning

Predibase Takes Declarative Approach to AutoML

2 Jun 2022 10:36am, by

It’s no secret that creating and deploying machine learning models takes too long. In Algorithmia’s “2021 Enterprise Trends in Machine Learning,” 25% of respondents said creating a model took one week to one month, while 24% put that time at one month to one quarter. And 37% said it took one quarter to one year to deploy a model.

• At Uber, an intent classification system that involved 1,500 lines of TensorFlow took five months to create and seven months to deploy.

• A second machine learning project, fraud detection, with 900 lines of PyTorch, took five months to create and four months to deploy.

• A product recommendation tool with 1,200 lines of PyTorch took six months to create and seven months to deploy.

San Francisco-based startup Predibase is out to change that by providing a low-code declarative ML platform that both data scientists and non-experts can use, easing the pressure on organizations to hire more scarce and expensive data scientists. Users can just state what they want to do — starting with just six lines of Python code — and let the system figure out how to do it and the infrastructure required.

It’s built atop two machine learning technologies created by the Predibase founders at Uber: Ludwig and Horovod. Ludwig is an open source, declarative machine learning framework that provides the simplicity of an autoML solution with the flexibility of writing your own PyTorch code. Horovod, an open source component of Uber’s Michelangelo deep learning toolkit makes it easier to start — and speed up —distributed deep learning projects with TensorFlow.

“The experience is that data science organizations have to basically reinvent the wheel and create a bespoke solution for every single one of these products, and there’s not much in common among them. Because of that, the whole organization becomes a bottleneck for machine learning adoption, said Predibase CEO, Piero Molino. The result is that it just takes too long for machine learning models to bring value to an organization.

In contrast, he compares a declarative configuration system to what Kubernetes has done for infrastructure.

“Our vision is to make machine learning as easy as writing a SQL query,” Molino said.

The basic idea is to let users specify entire model pipelines as configurations — the parts they care about — and automate the rest.

“Traditional machine learning projects involve a complicated ML life cycle that spans feature and data engineering; model development and training; and model production and governance. Cross-functional data science teams struggle to manage these phases in a coherent and sustainable way,” said Kevin Petrie, vice president of research at Eckerson Group.

“Predibase represents a level of innovation to simplify the ML life cycle. Predibase proposes to let data science teams specify the desired inputs and outputs for their ML model. That is, they create configuration files that Predibase then figures out how to implement. Data science teams still can customize as many parameters, etc. as they like by making modular changes to meet new or changing customer requirements.

“In short, Predibase proposes to minimize the complexity of the ML life cycle, which is the biggest barrier to success with data science projects.”

It’s easy to get started. That Uber intent classification system could be created, for example with six lines of code. You get something that is readable and reproducible and shareable, he said.

But one of the advantages is that you retain all the flexibility and control that an expert needs. So you can specify through the configuration all the details about the models — choosing among different model architectures, training parameters, about the preprocessing of the data. It’s all accessible through a parameter in the configuration, which makes it easy to iterate and improve models. Make changes with just a new configuration.

It’s also extensible. So if you’re an expert developer, you can add your own keys to the configuration. You can extend this by adding your own piece of PyTorch, for instance, and then it can be referenced from the configuration.

Highly Knowledgeable Team

The company has deep expertise in machine learning.

Molino, the creator of Ludwig, previously was staff research scientist at Stanford University and co-founder and senior research scientist at Uber AI.

Fellow Predibase co-founders are:

  • Travis Addair, previously senior software engineer and tech lead manager of Uber’s Deep Learning Training team in Seattle. He was co-author of the Horovod project and author of the Elastic Horovod
  • Devvret Rishi, formerly product manager on Google Cloud AI and other projects. He was the first product manager for the Kaggle machine learning community and an Artificial Intelligence Teaching Fellow at Harvard University.
  • Chris Ré, associate professor in the Stanford AI Lab and the Machine Learning Group. He created Overton, a proprietary system similar to Ludwig at Apple.

Predibase enables users to easily connect to structured and unstructured data stored anywhere on the cloud data stack; write model pipeline configurations and run on a scalable distributed infrastructure to train models as easily as on a single machine; deploy model pipelines with the click of a button and query them immediately.

“Predibase is building the first declarative ML platform that enables enterprises to develop and operationalize models, from data to deployment, without having to choose between simplicity and the power of fine-grained controls. The rapid success of both the open source foundations and the beta of its commercial platform in the Fortune 500 has been incredibly exciting,” Greylock Partner Saam Motamedi said at the recent announcement of a $16.25 million Series A round.

Still in private beta with Fortune 500 customers, Predibase is looking toward a general release in the second half of this year.

Fine-Grained Control

Customers have been using datasets of about 1 billion to 2 billion rows — about 100 to 200 columns and several hundred gigabytes. Internal benchmarking has run up to 2 terabytes. Ludwig and Horovod, however, have been tested on much larger data set sizes even than that, according to Rishi.

The company maintains it takes a different approach than other automated machine learning products.

“Thinking of something like DataRobot or Google Cloud AutoML, for example, [they] provide these interfaces where you kind of bring in data, click a button and you get models out,” explained Molino. “We found that that’s actually pretty unsatisfying for a lot of users and customers because they tend to be black boxes that don’t have any configurability or control. So the minute that the platform doesn’t give you a good out-of-the-box model, you’re kind of stuck, and you end up graduating out.”

Users can access the capabilities in Predibase purely through Python, through the UI or through PQL (Predictive Query language), an extension of SQL.

The PQL extension includes predicates that allow you to bring machine learning and data together, Rishi explained. Its flexibility puts machine learning in the language of, of data users, so they can use “filter”, “group by aggregate”, “join” or any other commands that they’re familiar with in SQL. It’s extensible. Simply add new features as an additional predicate. Predibase makes it just as easy to use text and image and other types of fields as standard tabular fields.

“This is really simple. It brings machine learning into the hands of a broader set of users that are familiar with SQL, but at the same time, behind the scenes, the power and flexibility of the Ludwig configuration system provide state-of-the-art performance on both structured and unstructured data, and the combination of the two,” Molino said.

“And finally, we also abstract away the infrastructure … based on Horovod, they can train and deploy models at scale. And it’s basically a big-tech-level infrastructure without the need to have a big-tech-level engineering team to build it, right. It’s already built for you.”

Models can be queried as REST APIs, through the Python SDK and through the PQL language. Though the entire process is encapsulated in the platform, the models also can be exported, should the user need to run them elsewhere.

A model repositories page summarizes the models just as configuration, making comparing model versions easy.

The company is spending the first half of this year making that product enterprise-ready with robustness, enterprise-grade security and enabling multicloud deployments, Molino said. After a GA launch, it wants to pursue integrations with the wider ML ecosystem, with tools like dbt, for instance, and eventually make Predibase self-service.

Feature image via Pixabay