Hazelcast Launches an Open Source In-Memory Stream Processing Engine
This new Apache 2-licensed open source project is designed to enable processing in near real time for data-intensive applications such as smart home sensors, in-store e-commerce systems, social media platforms, log analysis, monitoring and fraud detection.
The company has also released version 3.8 of Hazelcast IMDG, which includes advanced capabilities for managing persistence and multi-data center deployments. The Hazelcast IMDG is used with hundreds of thousands of installed clusters and over 17 million server starts per month, according to the company.
Big Data processing with Hadoop and Spark has been really complicated, according to Hazelcast CEO Greg Luck, making it similar to the old mainframe days when there would be a dedicated staff to load and run your jobs on expensive dedicated technology then provide you with the results.
“We think we’re part of a big shift in taking Big Data back to a stream-processing problem and just running it within the application that you’ve written — embedding it within the application rather than needing to run on separate infrastructure. We think that’s a profound shift,” he said.
“It makes the choice of Big Data platform and how to solve that problem an application concern within the control of developers and architects. Not even Ops needs to necessarily be involved in Hazelcast. Ops only needs to be involved if it’s a separate stand-alone cluster.”
Jet ingests data at high velocity via socket, file, HDFS or Kafka interfaces, and processes the business logic or complex computation on incoming data.
“We do think we can bring something new to this space. We didn’t want to do this unless we could be faster than everybody else. … What it brings to a relatively crowded market at this stage, Hazelcast is famously simple and easy to get started with,” he said.
Though Hazelcast IMDG supports six languages, Jet was launched with a focus on developers using Java 8. It plans to support other languages later on.
It was designed for use within a single virtual machine, though, and with Jet, it’s distributed.
“You can use that same API to express what you want to do, but it will execute on our Jet grid. And our largest compute grid at the moment is 600 servers, so it can scale very, very high. But still, you can program it with this very simple API, and if you’re a Java developer, probably one that you already know,” he said.
It’s built for speed and low latency, using one-record-per-time architecture so it processes incoming records as soon as possible, as opposed to accumulating records into micro-batches like Spark does. And though this process sounds similar to Apache Flink, it doesn’t borrow anything from Flink, Spark or Hadoop, according to Luck.
Like other Big Data frameworks, Hazelcast Jet uses the Directed Acyclic Graph (DAG) abstraction to model computations but takes some novel approaches to boost speed at low latency, including data locality; partition mapping affinity; single-producer, single-consumer (SP/SC) queues and green threads
He explained Jet’s architectural approach this way:
- It keeps computation and data storage in memory by combining Hazelcast Jet with the Hazelcast data grid on the same servers. Depending on the use case, some or all of the data that Jet will process will be already in RAM on the same machine as the computation.
- Jet allows you to define an arbitrary object-to-partition mapping scheme on each edge. This allows reading in parallel with many threads from each partition and member and thus server. It can use this to harmonize and optimize throughput from other distributed data systems whether it be HDFS, Spark or Hazelcast. Thus when performing DAG processing, local edges can be read and written locally without incurring a network call and without waiting.
- Local edges are implemented with the most efficient kind of concurrent queue: the SP/SC bounded queue. It employs wait-free algorithms on both sides and avoids volatile writes by using lazySet method.
- Vertexes are implemented by one or more instances of “Processor” on each member. Each vertex can specify how many of its processors will run per cluster member using the “localParallelism” property so it can use all the cores even in the largest machines. With many cores and execution threads, the key to Hazelcast Jet performance is to smoothly coordinate these with cooperative multithreading is much lower context-switching cost and precise knowledge of the status of a processor’s input and output buffers, which determines its ability to make progress.
- And Hazelcast Jet uses “green threads” to allow very high throughput where cooperative processors run in a loop serviced by the same native thread.
In its internal Word Count benchmarks, reading initial words from the file system, Jet was 20 times faster than Hadoop, 2 to 3 times faster than Spark and 50 percent faster than Flink, Luck said. It should be noted, however, that Luck, prior to Jet’s release, got into a row with rival GridGain last year over benchmarking.
Launched in Turkey, Hazelcast is now headquartered in Palo Alto, California, and maintains offices in Istanbul and London. Its data grid customers include American Express, AT&T, General Dynamics, Ericsson and Domino’s Pizza.
In November, it announced a partnership with Striim, which offers a real-time data integration and streaming analytics platform, to launch the Hazelcast Striim Hot Cache to ensure continuous real-time synchronization between the cache and its underlying database.
It announced in January that the grid is now available as a tile on the Pivotal Cloud Foundry network.