Machine Learning

Apache Incubates IBM SystemML for Scalable Machine Learning

25 Nov 2015 8:32am, by

IBM’s machine learning technology SystemML has been accepted as an Apache Incubator open source project. The company announced plans to open source the technology in June at the same time it said it would commit $300 million over the next few years to Apache Spark.

When used in conjunction with Spark and MLlib (Spark’s machine learning library), SystemML sets the stage for dynamic real-time data analysis, according to Rob Thomas, vice president of development for IBM Analytics.

“If you’re a bank today, and you build an algorithm around risk management, you build it one time for a particular data set. If new data comes in, you’ve basically got to refactor, rebuild that algorithm. If you want to use it on a different data set, you’ve got to rebuild that algorithm. It’s a real laborious process, which is why you lose any real-time engagement,” Thomas said.

IBM developed SystemML to provide the ability to scale data analysis from a small laptop to large clusters without the need to rewrite the entire codebase. The library sets the base  for domain – or industry – specific machine learning, allowing developers to customize applications and integrate deep intelligence into their specialized processes, according to the company.

“We’re saying you can build the model like the old Java proposition – write once, run anywhere. You can build your model, it can federate data wherever it is, bring it into SystemML to be analyzed. It’s an engine and optimizer – it’s not an algorithm itself,” Thomas said. This is a general-purpose platform that anybody can pick up.”

As Thomas further explained it in a blog post:

“Think of Spark as the analytics operating system for any application that taps into huge volumes of streaming data. MLib …  provides developers with a rich set of machine learning algorithms. And SystemML enables developers to translate those algorithms so they can easily digest different kinds of data and to run on different kinds of computers.”

The technology has been on GitHub since August. Since then, contributors have added more than 320 patches including APIs, data ingestion, optimizations, language and runtime operators, additional algorithms, testing, and documentation.

Overall, SystemML is the latest in a growing set of tools that allow organizations to build machine-learning systems.

Google recently open-sourced its TensorFlow machine learning software, and Facebook donated artificial intelligence and machine learning tools to the existing Torch open-source project. Meanwhile, Microsoft released its machine learning toolkit DMTK to the open source community.

Those toolkits are more narrowly focused on neural networks, according to Thomas, while SystemML is general purpose.

“The reason we’ve done this in conjunction with Spark is that’s a huge community. The minute we open source this, there were thousands of users already coming into this via the Spark community,” he said.

The incubator status means the Apache Software Foundation will provide support for the project including infrastructure and financial support, Thomas said.

IBM has set up the Spark Technology Center in San Francisco that it announced earlier and has been “hiring like crazy,” Thomas said. Its engineers have made more than 90 contributions to the Apache Spark project since then, according to the company.

IBM is a sponsor of The New Stack.

Feature Image via Pixabay.

A newsletter digest of the week’s most important stories & analyses.