Spark Closes in on Real-Time Processing with Redis Pairing
Redis Labs has released a connector that would allow the Spark data processing platform to use the Redis in-memory data store.
Using Redis for Spark will allow users to “store a huge amount of data without paying a significant amount of money for infrastructure,” explained Yiftach Shoolman, co-founder and Chief Technology Officer of Redis Labs, noting that Redis can be a lower cost alternative to a full-fledged in-memory database system. “Today we want the big data performance to be as close to real-time as possible. That is what we try to do.”
Specifically, the open source Spark-Redis connector package provides an easy way to run SparkSQL queries against data stored on Redis.
Running Spark against a Redis data store can speed processing by 135 times, compared to using HDFS (Hadoop File System) and is even 45 times faster than using the Tachyon in-memory data store, according to benchmarks from Redis Labs.
Redis Labs is eager to make Redis the de-facto data store for Spark, Shoolman asserted.
The package is a library that provides a library for writing to and reading from a Redis cluster. It exposes all of Redis’ data structures – string, hash, list, set, sorted set, bitmaps, hyperloglogs – as Spark RDDs (Resilient Data Sets) or through the Spark DataSet API.
The library minimizes the overhead that occurs with serialization and deserialization of large amounts of data.
Spark itself has emerged as the chief successor to the Hadoop data processing platform thanks in no small part to an ability to process data in near-real time, rather than the batch processing of ‘big data’ that Hadoop originally offered.
“Apache Spark is becoming a default in-memory engine for high-performance data integration and analytics,” said Matt Aslett, research director, data platforms and analytics at 451 Research, in a statement. “The combination of Redis and Spark should enable high-performance, real-time analytics with extremely large and variable datasets.”