Instacart Speeds ML Deployments with Hybrid MLOps Platform
Instacart began developing its machine learning infrastructure in 2016 with Lore, an open sourced framework. After years of rapid growth leading to an increase in the amount, diversity, and complexity of ML applications, Lore’s monolithic architecture was increasingly becoming a bottleneck.
This bottleneck challenge led to the development of Griffin, a hybrid, extensible platform that supports diverse data management systems, and integrates with multiple ML tools and workflows. Sahil Khanna’s recent blog post goes into great detail about Griffin, including its benefits, components, and workflows.
Instacart relies heavily on machine learning for product and operation innovations. Such innovations don’t come easy as multiple machine learning models often must work together to provide a service. Griffin, built by the machine learning infrastructure team, now plays a foundational role in supporting the following machine learning applications and empowering innovations.
In short, Griffin offers the following benefits to the service:
- Aids customers with locating the correct item in a catalog of over 1 billion products.
- Supports 600,000+ shoppers with the delivery of products to millions of customers in the US and Canada.
- Incorporates AI into Instacart’s support of their 800+ retailers across 70,000+ stores in 5,000+ cities in the US and Canada.
- Enables 5,000+ brand partners to connect their products to potential partners.
Griffin: Instacart’s MLOps Platform
To allow Instacart to stay current with innovations in the state of the art of ML operations (MLOps) while also deploying specialized and diverse solutions, Griffin was designed as a hybrid model. Griffin allows Machine Learning Engineers (MLE) to utilize third-party solutions such as Snowflake, Amazon Web Services, Databricks, and Ray to support diverse use cases and in-house abstraction layers to provide unified access to those solutions.
Griffin was created with the main goals of helping MLEs quickly iterate on machine learning models, effortlessly manage product releases, and closely track production applications. With that in mind, the system was built with these major considerations:
- Scalability It needs to support thousands of machine learning applications.
- Extensibility It needs to be flexible enough to extend and integrate with a number of data management and machine learning tools.
- Generality It needs to provide a unified workflow and consistent user experience despite broad integration with third-party solutions
The diagram below illustrates Griffin Systems Architecture.
The considerations are clearly illustrated in the diagram above. Griffin integrates multiple SaaS solutions including Redis, Scylla, and S3 demonstrating extensibility which supports growth at Instacart showing its scalability. The integrated interface for the MLEs shows Griffin’s generality.
Instacart can develop specialized solutions for distinct use cases (such as real-time recommendations) as a result of the four foundational concepts introduced below which are also considered distinct elements.
- MLCLI: The in-house machine learning command-line interface that develops machine learning applications and manages the model lifecycle.
- Workflow Manager and ML Launcher: The orchestrator that schedules and manages machine learning pipelines & containerizes task execution.
- Feature Marketplace: This uses third-party platforms for real-time and batch feature engineering.
- Training and Interference Platform: The framework-agnostic training and inference platform for adopting open-source frameworks.
MLCLI allows MLEs to customize and execute tasks such as training, evaluation, and inference in their applications within containers (Docker for example). Containerization eliminates bugs caused by variations in execution environments and provides a unified interface.
The diagram below illustrates MLCLI features used by MLE’s during ML application development.
Workflow Manager and ML Launcher
Workflow Manager handles the scheduling and managing of the machine learning pipelines. It leverages Airflow to schedule containers and utilizes ML Launcher, an in-house abstraction, to containerize task execution.
ML Launcher integrates third-party compute backends such as Sagemaker, Databricks, and Snowflake to perform container runs and meet unique hardware requirements for ML. Instacart chose this design because it allows for the scaling up to hundreds of Directed Acyclic Graphs (DAGs) with thousands of tasks in a short period without worrying about Airflow run time.
The diagram below illustrates the Architecture Design of Workflow Manager and ML Launcher.
Feature Marketplace (FM)
With data being the center of any MLOps platform, Instacart developed its FM product to support both real-time and batch engineering. FM manages feature computation, provides feature storage, supports feature discoverability, eliminates offline/online feature drift, and allows feature sharing. This product uses third-party platforms such as Snowflake, Spark, and Flint and integrates multiple storage backends, Scylla, Redis, and S3.
The diagram below illustrates the Architecture Design of Feature Marketplace.
Inference and Training Platform
The Inference and Training Platform allows MLEs to define the model architecture and inference routine to customize applications which allowed Instacart to triple the number of ML applications in one year. Instacart standardized package, metadata, and code management to support diversity in frameworks and ensure reliable model deployment. Some of the frameworks already adopted were Tensorflow, XGBoost, and Faiss.
The diagram below illustrates the Architecture Design of the Inference and Training Platform.
A Few Key Learnings
Some valuable lessons were learned during the development of Griffin.
- Buy vs. Build Utilizing third-party solutions is important when it comes to supporting a quickly growing feature set and in avoiding reinventing the wheel. In order to benefit from seamless switching between solutions while keeping migration overhead costs down, careful platform integration is key.
- Make Incremental Progress Prioritizing regular onboarding sessions streamlined feedback and kept the design simple. Regular hands-on codelabs and onboarding sessions encouraged early feedback and collaboration. This environment prevented engineers from going down the rabbit hole of wanting to design the “perfect” platform.