PredictionIO, one of the latest Apache Software Foundation projects elevated to top-level status, like many such projects, grew out of frustration.
“Every time we needed to build something intelligent — doing a news recommendation or resume analysis, product matching, something like that — we had to build the machine learning tech stack from scratch every time,” said Simon Chan, co-founder of PredictionIO, along with Donald Szeto, Kenneth Chan and Thomas Stone.
“If you don’t have to build your database from scratch, why should you have to build your machine learning server from scratch? … The idea was to build something developers could use without a Ph.D.”
They were inspired by the model of MySQL at that time, he said, and determined to offer an open source version of it.
Salesforce acquired their company in early 2016 and donated an open source version to ASF later that year. It’s since become the backbone of Salesforce Einstein, which enables customers to build their own AI-powered tools.
One of the project’s biggest assets is a concept called template gallery. It allows users to share their projects as a template. Companies building a similar application can simply download your template and run it as is or modify some of the components, said Chan, now senior director of product management for Salesforce Einstein.
The templates include a recommendation, classification, natural language processing and other tools. Regression templates include forecasting energy use, electric load and Boston housing prices.
Building a recommendation engine, a project that typically could take a team of team of expert data scientists months to do can be accomplished in a couple of weeks with one or two engineers with PredictionIO, he said in a blog post.
More than Algorithms
PredictionIO originally was built on Hadoop, but has switched to Spark as it has become more mature, Chan explained. It uses HBase as the data store; the Spark ML library MLlib; Spray, the JSON implementation in Scala; HDFS in some models and Elasticsearch to store metadata.
An event server continuously collects data from your application, in real-time or in batch, for model training and evaluation. The event server also can unify data from multiple sources for analysis.
The PredictionIO engine then builds models with one or more algorithms using the data. After it is deployed as a web service, it listens to queries from the application and responds through a REST API with predicted results in real-time.
While ASF has machine learning libraries such as Mahout and MLlib, machine learning remains in early stages and needs researchers, developers and companies to contribute to it to make it successful, according to Chan.
“If you’re not really [knowledgeable] about the machine learning stack, you can’t really do much with just algorithms,” Chan said. There’s a whole pipeline that PredictionIO handles, including preparing data, building algorithms, training, managing and evaluating models.
SystemML, another ASF project created at IBM, was also built on Spark and MLlib. It provides the ability to scale data analysis from a small laptop to large clusters without the need to rewrite the entire codebase.
PredictionIO has been extended to support any JDBC (Java Database Connectivity) data store and multiple other algorithms, Chan said.
He sees more inter-project collaboration with ASF projects such as Spark and Beam as a result of the top-level status.
“Usability is still a big problem in machine learning. We need to make the development process even easier, and that’s going to require a lot of resources from the open source community,” he said.
It’s been encouraging to see how companies have been using both the open source and internal Salesforce implementations of PredictionIO, Chan said.
At the recent Dreamforce ’17 conference, for instance, staff from Ulster Bank in Scotland talked about how it uses PredictionIO and Einstein for their Next Best Offer predictions — using past customer behavior to know which banking services to suggest next.
And the University of Palermo, Argentina, has used PredictionIO to be able to predict dropout rates for specific courses.