Apache Ignite provides the ability to efficiently train machine learning models due to its ability to eliminate extract, transform and load (ETL) and minimize scalability issues.
Ignite removes some of the usual basic pain points: namely having to move data to a separate system (extract, transform, load or ETL) and data sets that are too big to run on one machine, according to Denis Magda, vice president of product management at GridGain and Ignite Project Management Committee chair, speaking at the recent ApacheCon North America in Las Vegas.
The project has created algorithms for the most common use cases — linear regression, decision tree classification, nearest neighbors, K-Means — designed specifically to be run on its distributed system.
Speed and Scale
GridGain open sourced the technology in late 2014; it was accepted into the Apache Incubator program that same year. GridGain provides a commercial offering of the open source technology.
Ignite emerged as a memory computing platform, storing data in cache memory, then providing APIs for transactions to enable highly-scalable distributed systems, Magda explained. A number of vendors are providing that now.
In effect, Ignite is a hybrid transaction/analytical processing (HTAP) database, touting that it provides in-memory speeds at the petabyte scale. It supports transactions and analytics as well as streaming data. Because it runs code on its host memory-centric distributed database, it can train, deploy, and update a machine learning model without having to move the data to another system, such as like Apache Mahout or Apache Spark.
Spark, for example, has to pull data from other data sources.
“For developers, one of the primary bottlenecks in distributed systems is the network. If you can eliminate moving data over the network, you can achieve huge performance benefits,” he said.
Smaller data sets can be stored solely in memory, and data sets that are too large larger can be stored on disk, using memory as a caching layer.“Even if everything fits in memory, doing so might not be affordable. Keeping everything in RAM might not be reasonable,” Magda said.
“As a developer, you don’t have to know how all the data is distributed, how much data is in RAM. If Ignite sees something is missing in RAM, it will go to disk and pull it.”
The model can then be stored, with Ignite supporting an update interface enabling retraining of the model as new data comes in, providing near-real-time updates.
It fully supports SQL, making it compatible with other Big Data tools.
Partitioning of data is foundational to Ignite. Data is stored in key-value records.
To add a record, you assigns a primary key and it assigns a partition. Each cache is split into multiple partitions, and each partition belongs to a specific cluster node.
Ignite does not use master and slave nodes. All nodes are equal.
When you add a node, the partition might be redistributed to a different node. As application developers, all this happens in the background, Magda said.
You can run SQL and other operations, and Ignite directs all the queries to the right nodes.
Ignite provides built-in fault-tolerance. It assigns extra copies of data on different nodes for redundancy. It also stores training context on the node along with the data, so in case of failure, you can restart training where it left off.
On your laptop, you might think you’re on a single node, but it can be thousands, he said.
Machine learning models are embedded in the storage, so when you start the training, it knows how the data is distributed. The computation takes place on the server nodes with minimal data movement.
Ignite is written in Java and supports .NET, C++ and Scala, with R and Python in the works. The machine learning algorithms are designed specifically to take advantage of cluster computing.
The project has intentionally stepped back from deep learning, opting to provide algorithms for the most common use cases, such as fraud detection, predictions and recommendations, Magda said.
For deep learning, it’s focusing on integration with TensorFlow. In combo, the two provide “provide a full toolset needed to work with operational and historical data, to perform data analysis and to build complex mathematical models based on neural networks,” according to the documentation.
The integration, called Ignite Dataset, enables it to be a data source for Tensorflow for neural network training, inference and other computations. It can provide Tensorflow with fast access to Ignite’s distributed database, eliminate preprocessing by providing objects from any structure. Ignite also supports SSL, Windows and distributed training.
Yet to come for the project include the ability to import models from Spark and XGBoost, as well as a full Python API for the machine learning features, Magda said.