Q&A: Bridging Data and ML Models with Feast, the Open Source Feature Store
Raw data fuels the training and predictive power of today’s machine learning platforms. But all that raw data needs to be transformed by data scientists first before it can be used effectively. This practice of extracting useful features from data, as part of the process of feature engineering, helps to avoid serious problems down the line when a machine learning (ML) application is scaled up and complexity increases. Enter the concept of the feature store: a tool that automatically manages and serves up features.
Launched back in 2019 as a collaboration between Google and Indonesian startup Gojek, Feast (Feature Store) is one such open source feature store for ML. Created as an operational data system that acts as a bridge between data engineering and machine learning, Feast helps to automate some of the key challenges that arise in producing machine learning systems. We caught up with Feast’s creator, Willem Pienaar, to get a better idea of how feature stores work and what kind of impact they have on the evolution of MLOps. Initially inspired by Uber’s Michelangelo ML feature store, Feast has since grown considerably. With the aim of making feature stores more widely accessible to the greater ML community, Pienaar will now be joining Tecton, the same company that created Michelangelo.
What is Feast, and what was the motivation behind the creation of this system?
Basically, we had a bunch of teams (10+) building and deploying ML systems at Gojek. These were business-critical systems for pricing, match-making, fraud detection, recommendation systems. All these teams needed to serve features to their models in production, yet they were manually deploying and managing the necessary data infrastructure themselves. Simultaneously all these projects and systems were siloed, so a lot of work was duplicated, and feature reuse was non-existent. Feast was created to address these issues.
What are features, feature stores, and how do feature stores fit into the overall process of integrating machine learning into today’s applications?
A feature is an individual measurable property of an entity. For example, a feature could be the average amount of daily transactions that a customer (the entity) makes. Feature data is the input both for training models and for models served in production.
Feature stores are operational data systems. They provide a central interface through which teams can create and publish new features, and from which other teams (or systems) can consume features. A typical flow is for data scientists to either push data into a feature store for storage or to register transformations with a feature store that will generate data to be stored within the feature store. Once the data is available within the feature store, another team can consume those features for training a model, and can also retrieve features from the feature store for online serving.
So essentially, a feature store is the layer between models and data.
What kinds of challenges or problems do feature stores solve?
To summarize, feature stores:
- Accelerate the feature lifecycle. In the pre-feature store scenario, data scientists would typically implement features in individual silos, then hand over their code to a data engineering team to reimplement production-ready pipelines. This process of reimplementing pipelines can add months to a project and requires complex coordination between teams. With a feature store, data scientists can build production-ready features and self-serve them to production.
- Increase feature accuracy, and hence prediction accuracy. A feature simplifies the process of building features, and enables ML teams to use all of their available data to make predictions. They can more easily incorporate batch, streaming, and real-time data, to extract more predictive value. Additionally, a feature store ensures that there is a single source of truth for feature data across an organization. Training data is consistent with serving data. By eliminating potential data skew, models can be more accurate.
- Sharing and re-use of features. Features are shared between all the data scientists in an organizations, facilitating the sharing and re-use of features across models.
- DevOps-like engineering processes. Data scientists are by nature not software engineers. They typically don’t build production-ready code, and instead rely on separate teams like data engineers to reimplement production-ready code. A feature store allows data scientists to build production-ready features and self-service them to production. This in turn allows DevOps-like best practices to be implemented, with data scientists owning their “code” all the way from development to production, similarly to the way software engineers own their code.
Where, why and how did feature stores first take shape?
Feature stores were originally coined and designed by Tecton’s founders at Uber, who created the Michelangelo machine learning platform.
Most tech companies have data infrastructure that tries to address ML use cases to some degree, but feature stores are unique in that they attempt to bring these tools together specifically for ML use cases. For example, feature stores unify the way features are retrieved in both training and online serving.
To some degree, the emergence of features stores are a result of teams attempting to deploy more and more machine learning systems. Previously, many open source and enterprise software solutions attempted to address model serving, but ML teams realized that data was often the harder (and more rewarding) problem to solve. Not just feature engineering, but also the operational side of making the data available at scale to production systems.
Large technology companies ran into these problems first, and which led to them creating feature stores to address their use cases. You can have a look here for some other companies doing the same.
Why is this recent emergence of feature stores significant at this point in time for machine learning?
It shows that the space is maturing. The focus up until this point was heavily in generic data tooling and infrastructure, and also on ML-specific tools that were limited to development, but not operations. But given the number of organizations wanting to run business-critical systems on ML, we are seeing a new wave of interest in operational ML. Feature stores are one of these MLOps tools. It’s significant because these problems are yet unsolved for many teams, but we believe the technology will be mainstream in a few years.
What other data management methods are/were being used previously by data scientists and data engineers, and how do they compare to tools like Feast?
You don’t have to deploy a feature store in order to engineer features or serve data to models. For example, you can use Spark for data processing, train models from your data warehouse, push data into a Redis, and have your model serving layer interact directly with the Redis.
The problem is that teams have:
- No centralized infrastructure, meaning each system requires a new deployment of data infrastructure like online stores and ETL pipelines
- No structured way to define and publish features that other teams can use
- No way to browse and use features from other teams
- No way to retrieve features for training and serving in a consistent way (temporarily or shape of output data)
- Hard coupling between models and data infrastructure
- No means of monitoring data processing or data being served to models
So the natural progression of disparate ML infrastructure is toward a feature store design.
What are some examples of companies or organizations that are using Feast, how has Feast improved their workflow or overall operations?
One example is Postmates, an American company that offers local delivery of restaurant-prepared meals and other goods. It uses Feast for fraud detection. They benefit by just being able to publish features to a certain topic and automatically have offline/online stores persist their data for model training and serving.
Agoda, an online travel agency and metasearch engine, is yet another example: it uses Feast as the online serving layer throughout their data centers. They specifically value the ability to publish features stream-first through Kafka, since the ML teams have a centralized Kafka team that provide a “managed” Kafka setup for them. The ML team also values having a standardized definition of features in production, and being able to serve feature data to models at scale for online use cases.
What improvements are there in the future for Feast?
The plan is to make Feast the best, full-featured open source feature store. On the immediate roadmap:
- Python-first: First-class support for running a minimal version of Feast entirely from a notebook, with all infrastructural dependencies becoming optional enhancements.
- Production-ready: A collection of battle-tested components built for production.
- Composability: Modular components with clear extension, integration, and upgrade points that allow for high composability.
- Cloud-agnostic: Removal of all hard coupling to cloud-specific services, and inclusion of portable technologies like Apache Spark for data processing and Parquet for offline storage.
And with the Tecton contributions to Feast, we’ll look to add contributions in the following areas:
- Feature transformations. Currently, Feast ingests data from externally-managed pipelines. We’ll look to add the ability to orchestrate data pipelines in Feast itself.
- Monitoring of operational service levels and data quality.
- DevOps-like management capabilities for features.
- Simple migration between Feast and Tecton, with the compatibility of serving APIs to be transparent to models and applications consuming features.
What can we expect in the future from this joining forces between Feast and Tecton?
Tecton is backing Feast in a meaningful way. Tecton is becoming a core contributor, and I will be joining Tecton full-time. This is significant because Tecton is the leading commercial vendor in feature stores. The company was founded by the original creators of Uber’s Michelangelo, which was the first instantiation of a feature store. Up to now, feature stores have only been accessible to leading technology companies like Uber, Airbnb, Netflix, and Gojek. Yet any company that is putting ML into production has a need for this technology. Tecton’s backing means that companies will now have a choice of using Feast or the Tecton commercial cloud service, both offering advanced feature store capabilities. Users will have more options and this will advance the state of operational ML.