Machine learning models typically come in two flavors: those used for batch predictions and those used to make real-time predictions in a production application. These are known as offline and online models, respectively. Offline models, which require little engineering overhead, are helpful in visualizing, planning, and forecasting toward business decisions.
On the other hand, online models require substantial engineering effort and are used to personalize a customer’s experience via recommendations. Understanding which model to use based on project needs is critical because it not only dictates the deployment process, but also influences how the model is trained. In this post, I will discuss some of the challenges we faced when deploying our first online machine learning model in a real-time production application, as well as how we addressed those challenges.
When we started to create our model, we did so with little regard for the actual deployment of the model to our application. We built our model in a Jupyter notebook that loaded clean data models from our data warehouse, transformed individual columns, and produced a model. While this approach worked very well for quick iteration and experimentation, it led to several difficulties in the deployment process — particularly in the areas of feature extraction, feature transformation, and scalability. Instead, we needed to find a way to perform these operations accurately while being scalable and flexible enough to re-use for future deployments. With a little more foresight, we could have greatly reduced the engineering effort required to move the model from the notebook to a deployed application.
It’s common knowledge in the industry that a data scientist should expect to spend at least 80% of their time preparing data. That’s because creating a good training data set typically requires us to gather data from multiple raw data sources and then use that data to create new features that are likely to be predictive of our target variable. At Clearcover, we have a very talented team of data analysts that have already built complex data models from our raw data in order to track our business objectives. We’ve found that these pre-transformed features are critical for modeling. While it can be tempting to export data models directly to use in predictive modeling (we were even guilty of this at first!), skipping out on pre-transformed features can lead to serious challenges with deployment to a real-time application.
Data models that are restricted to a data warehouse for reporting purposes are typically optimized to run on large amounts of data over a predictable cadence. This is fundamentally different from how feature extraction in a production machine learning application should be approached. We found that we needed to extract features from a small amount of data over an unpredictable cadence, and that meant we could not use the data models “as is.” Further, we found that we needed to replicate the feature extraction to meet our different needs in Python instead of SQL. That’s because our deployed model does not receive data in the tabular format provided by our data models. Python provides us with the tools necessary to parse semi-structured data in an easily maintainable way. Python’s logic also lays the groundwork for future work in which features are not sourced from traditional data models.
Feature transformation is another important step in feature engineering for a training dataset. This can include missing value imputation, binning, and one-hot encoding — among other processes. Feature transformation is particularly tricky because production data is unpredictable. For example, there may be a column in the training data which has no missing data, and so the training data does not impute missing values for that column. However, that does not necessarily mean that the field will never have missing values. Overlooking an edge case like this could mean errors or delays in the best-case scenario, or it could lead to invisible errors in the model in the worst. When we realized this, we had to completely rewrite our feature transformation code to ensure that each feature had a full set of transformations defined — even if those transformations resulted in no change to the training data.
At first, we used different pandas methods to perform these operations. But we quickly found that when it came to deployment, these methods could not meet our needs. For example, pandas has a method known as “get dummies,” which is extremely useful for one-hot encoding but can only be used on training data because it requires the knowledge of all possible categories for a feature. In a real-time environment with only one individual record to score at a time, this method would only know of one possible category, which means the one-hot encoded columns would not reflect the appropriate categories for our needs.
The scikit-learn models proved to be the best solution to this problem, even though they are slightly less user friendly. Scikit-learn provides an API that allows these feature transformations to be trained and applied in the same way a predictive model would be trained and applied. That way, knowledge from the training data can be “remembered.” Using models like these require us to treat feature transformation the same way we treat the model itself: each feature transformation is trained on the training data, pickled, and then deployed alongside our model.
Solution: the Feature Store
While we have been able to overcome these challenges in the short term, it is apparent that we need to rethink our feature engineering infrastructure at a fundamental level with two thoughts in mind:
- Feature engineering should begin exclusively with data that is available at run time.
- Product applications and experimental Jupyter notebooks should be able to share logic for feature extraction and feature transformation.
At Clearcover, our vision for this infrastructure is a feature store. In this context, a feature store is a centralized database that curates machine learning-ready features from the raw data produced by our applications. The feature store needs to maintain not just the data itself, but also the logic used to create the data. In that way, it allows for the transformation of raw data by our production models in real-time. By centralizing the feature creation process, we ensure feature consistency between training models and production models, while at the same time reducing the engineering resources required for deployment.
As a startup, we are often confronted with the build-or-buy dilemma, and this case is no exception. We are actively experimenting with both open source and paid feature stores, even though we may find that none of these meet our needs. The feature engineering infrastructure we choose is one of the most important decisions for determining the future success of data science at Clearcover — and it’s not a decision we take lightly. We are excited about the implications of this project for machine learning in the insurance industry as a whole, and we hope it may help others innovate as well. As we progress along this path, we’ll continue to share our challenges, successes, and lessons learned.
Feature image via Pixabay.