SystemML, the ‘SQL for Machine Learning,’ Is Now a Top-Level Apache Project
SystemML, the machine-learning technology created at IBM, has reached top-level project status at the Apache Software Foundation.
IBM developed SystemML to provide the ability to scale data analysis from a small laptop to large clusters without the need to rewrite the entire codebase.
Designed to be used with Apache Spark and the machine-learning library MLlib, it makes it possible to write one code base that applies to multiple industries and platforms, allowing developers to customize applications and integrate deep intelligence into their specialized processes.
“SystemML is like SQL for machine learning.”
Rob Thomas, vice president of development for IBM Analytics, explained the relationship between the three in a blog post:
“Think of Spark as the analytics operating system for any application that taps into huge volumes of streaming data. MLib … provides developers with a rich set of machine learning algorithms. And SystemML enables developers to translate those algorithms so they can easily digest different kinds of data and to run on different kinds of computers.”
Run on top of Spark, SystemML automatically scales data, line by line, to determine whether code should be run on the driver or an Apache Spark cluster.
The IBM Watson Health VBC service is using SystemML with Spark on a very large set of electronic health record data to predict emergency department visits among high-risk patients. The idea is to intervene in ways that improve care for those patients and prevent costly trips to the ER, according to Steve Beier, IBM vice president of Value-Based Care Platform.
“SystemML is like SQL for machine learning; it enables data scientists to concentrate on the problem at hand, working in a high-level script language like R, and all the optimizations and rewrites are handled by the very powerful SystemML optimizer that considers data and available resources to produce the best execution plan for the application,” said Luciano Resende, architect at the IBM Spark Technology Center and Apache SystemML incubator mentor.
While many data science problems can be addressed on one machine, SystemML takes on those that won’t fit on one machine, explains Fred Reiss, staff member at IBM Research Almaden in San Jose, Calif., where the project originated.
It sets itself apart with its ability to run at scale while retaining the ability to still work on small data efficiently, he said in a video on the project site.
And because it uses Python and R, data scientists don’t have to learn new languages to use it.
SystemML simplifies the development and deployment of ML algorithms by separating algorithm semantics from underlying data representations and runtime execution plans. This gives data scientists the flexibility to create and customize ML algorithms independent of data and cluster characteristics, explained an IBM Research paper on how it uses compressed linear algebra to fit larger datasets into memory. To wit:
“SystemML’s language is expressive enough to cover a broad class of ML algorithms: descriptive statistics, classification, clustering, regression, matrix factorizations, dimensions reduction, and survival models for training and scoring. Generally, algorithms that can be expressed using vectorized operations are a good fit for SystemML.
“The SystemML cost-based compiler automatically generates hybrid runtime execution plans that are composed of single-node and distributed operations depending on data and cluster characteristics such as data size, data sparsity, cluster size, memory configurations, while exploiting the capabilities of underlying data-parallel frameworks such as MR [MapReduce] or Spark.”
The Apache Software Foundation provides funding and support for top-level projects. Achieving top-level status involves demonstrating that the project has a sufficient community of committers to drive the project forward and a diversity of committers so that it’s not under the thumb of just one company.
SystemML was accepted into the Apache incubator in November 2015, the same year IBM it announced it would commit $300 million over the next few years to Apache Spark.
Around the same time, Google also open-sourced its TensorFlow machine learning software, Facebook donated artificial intelligence and machine learning tools to the existing Torch open-source project and Microsoft released its machine learning toolkit Distributed Machine Learning Toolkit to the open source community.