This has been called the year of the feature store, with Databricks and Google among the most recent vendors announcing this technology to smooth the path for harnessing machine learning models in production. Twitter, Facebook, Comcast, Netflix, Pinterest and others also offer feature store platforms.
Not to be confused with Tekton, the open-source framework for creating CI/CD systems, the commercial enterprise feature store Tecton aims to standardize and automate the management of features in production machine learning (ML) applications.
Before Michelangelo, data scientists at Uber would create models, then pass them on to engineers who cobbled together open source tools to manage them, Del Balso said. The company had no standardized system for building reliable and reproducible pipelines for creating ML models. Models could not be larger than what would fit on a data scientist’s desktop, there was no centralized storage for training experiments and no way to compare experiments.
“That data management side of machine learning is really the unique thing that we built. And that’s what really inspired us to build Tecton, because we saw how useful that was at catalyzing this explosion of machine learning [that] enabled the company to go from zero to tens of thousands of models in production,” he said.
“We’re trying to bring that same change to the rest of the industry by bringing that same kind of data layer for machine learning, especially for real-time machine learning applications, to other organizations who are trying to figure this stuff out.”
Del Balso, who before his work at Uber helped build the machine learning system for Google’s ad division, notes Tecton is focused on operational machine learning — applying the data the company already has into decision-making for its products, rather than more research-based or analytical uses for data.
“Data scientists often work locally, training models and building the pipelines of data that feed them. But taking that local model into at-scale production is an arduous, time-consuming process, subject to constraints that just aren’t present in the training environment. Furthermore, models trained offline have to be pushed online, and operate on the same type of data (called features) in order to give sensible results. But the tooling to standardize, govern and collaborate around ML data is still incredibly immature,” Martin Casado, general partner at the venture capital firm Andreessen Horowitz, wrote of its investment in Tecton. The company has raised $60 million to date.
Full ML Lifecycle
The technology is more than just a database of features, those variables or attributes such as name, age, sex used in machine learning models.
“Tecton allows for the data scientists to be empowered throughout that machine learning lifecycle, and allows them to both build the prototype. But then in the process, the data pipelines are automatically productionized,” Del Balso said. “So the engineering teams, they have a much easier job because there’s not a lot of cumbersome and error-prone rebuilding of different pipelines along the way. …There’s this is kind of like prototyping transformation, the productionization, and there’s an element of monitoring and quality management along the way.
The Tecton platform consists of:
- Feature pipelines for transforming raw data into features or labels
- A feature store for storing historical feature and label data
- A feature server for serving the latest feature values in production
- An SDK for retrieving training data and manipulating feature pipelines
- A web UI for managing and tracking features, labels, and data sets
- A monitoring engine for detecting data quality or drift issues and alerting
It includes the transformation of features; storage, which consists of an online and an offline store for fast retrieval and slow retrieval; feature serving and then a governance layer, “to help ensure, ‘Hey, these features are only accessible to these teams,’ ‘Help me understand the lineage of different features,’ all the metadata and collaboration that’s needed in building these machine learning applications. And then a data quality and monitoring layer for features to understand the debugging processes that you have with data in your machine learning applications,” he said.
Features are defined as code for any Python environment using the Tecton SDK. The platform can pull existing features from external data sources, but also to compute features on raw data using PySpark, Spark SQL or Python transformations on batch and streaming data.
The offline store contains historical feature values across time and is used to generate training data in batch. The offline feature store is configurable but defaults to Delta Lake. The online store uses AWS DynamoDB to provide the latest feature values for low-latency retrieval.
You can specify configurations like the date in the past to backfill features to, the schedule for future jobs, a time to live and more.
Training datasets are delivered as pandas or Spark dataframes. Once you have your dataset, you can use your existing tools such as XGBoost, TensorFlow, PyTorch to deploy models.
Tecton enables data scientists to use in their models more data that they already have by bringing data sources together in real-time, Del Balso said, and using that real-time data in their applications.
In April, the San Francisco-based company announced it was hiring Willem Pienaar, founder of the open source feature store Feast, and becoming a major contributor to the project. Feast was created while Pienaar led the data science team at Chinese ride-hailing startup Gojek and in conjunctionå with Google. Feast recently released version 0.10.
“It’s just like something that allows people to get started really easily with feature stores. And we expect to have a lot of additional elements like compatibility between the Feast user experience and the Tecton user experience over time,” Del Balso said. “Today, they’re separate platforms; tomorrow, they may not be. Our goal is to make it really easy for there to be a bridge between them.”
Going forward, the company plans deeper integrations with the data warehouse ecosystem and to add other clouds beyond Amazon Web Services. It plans first-class integrations with Snowflake and Redshift this year. It wants to help users generate better features for their models, find the data most relevant to their decision-making, and to help people figure out how to piece together the ML infrastructure into an architecture that makes sense for their use case, he said. It wants to be able to offer users a template for building a fraud application, a recommendation template, a prediction template, “and have all of the data flows be pre-built for that organization, so they just plug us into their data, this is a pretty big thing that we are spending a lot of time on,” Del Balso said.
Amazon Web Services is a sponsor of The New Stack.