The TransmogrifAI library can be used to build highly-automated machine learning workflows that run on Apache Spark, using the relational data that all organizations keep on hand. TransmogrifAI (pronounced trans-mog-ri-phi) addresses one of the major challenges of setting up machine learning (ML) in production settings, that of establishing a workflow for quickly developing and testing models, which can be used to predict future outcomes.
“It automates the entire machine learning workflow. You can build a good machine learning model on a given dataset in a couple hours, instead of weeks or months,” Shubha Nabar, Salesforce’s senior director of data science for Einstein. “It significantly reduces the time and expertise to use machine learning.”
Salesforce first developed the library for its own Einstein AI-as-a-service, to help customers customer churn, sales forecasts, lead conversions, equipment failures, and late payments. Because each customer’s needs and data was different — not to mention private — the company had to build and deploy thousands of machine learning models on a case-by-case basis.
“Every customer’s data is so different — different schemas, different shapes, different biases that are introduced by their processes,” Nabar said. “For a machine learning model to do a good job, it has to be built on the customer’s data. But at Salesforce-scale, if you have to build a model for every single use case, it just doesn’t scale.”
So the company built automation tools to do as much of the grunt work as possible. Today, models built with the library power over three billion predictions a day.
Given structured data from a relational database, TransmogrifAI can work through the steps of producing a model that can be used to make predictions about future behavior. This includes data preparation (“feature inference’), converting data into numerical representation (the “transmogrification” or “feature engineering”), and removing any data with no predictive power (“feature validation”).
Finally, the software runs several different machine learning algorithms (or “models”) on the data and picks the best one, offering a summary of each algorithm’s performance. The software also offers hyperparameter optimization, or the ability to tune the algorithms’ relative reliance on variables for best performance (A blog post from Nabar explains each of these steps in greater detail).
Besides the benefit of tying all of these tasks together in the same package, TransmogrifAI also could save considerable time and even open up ML work for those organizations that may not have an in-house model-building expertise, Nabar explained. Currently, there appear to be many more data scientist job openings than bodies to fill them. Data scientist may very well be the hottest job in today’s market — IBM has predicted that we’ll see 2,720,000 job listings for data scientists by 2020, up from 364,000 today.
Written in Scala, TransmogrifAI builds on top of Spark ML Pipelines, using Transformers and Estimators abstractions for transforming DataFrames, as well as its own DataFrame abstraction, called Features. Features is a type-safe pointer to a column in a DataFrame with all the information about that column, allowing developers to define, work with, and share features in much the same way they’d work with standard variables.
The software is available under a BSD-3 Clause, and the company welcomes outside contributions to further refine the library.