Apache Geode Spawns ‘All Sorts of In-Memory Things’
Geode is a distributed, in-memory compute and data-management platform that elastically scales to provide high throughput and low latency for big data applications. It pools memory, CPU, and network resources — with the option to also use local disk storage — across multiple processes to manage application objects and behavior.
Using dynamic replication and data partitioning techniques it offers high availability, improved performance, scalability, and fault tolerance.
Geode is “a place where all sorts of in-memory things get developed and depending on how you use the building blocks that Geode is producing, you get various types of products,” said Roman Shaposhnik, director of open source at Pivotal and vice president of technology for the ODPi (Open Data Platform initiative) in an interview.
It can be used to build a database, a data grid, and more. It’s being used for transaction processing; event notification and processing; highly available distributed caching. For instance:
- Data Torrent, the company behind the streaming project Apache Apex, is using Geode together with Apex for an in-memory data exchange layer for big data projects.
- Canadian Imperial Bank of Commerce uses Geode used as a data and compute grid to assess financial risk across multiple trading systems.
- French financial services IT risk management software vendor Murex uses Geode and Storm for near-real-time aggregations and notifications. It uses Geode for immutable events and event logging, but also for continuous queries, a distributed, scalable set of in-memory listeners that use data to identify and push event notifications to clients.
- And Southwest Airlines is using it to improve real-time decision-making in its operations, cargo and crew-scheduling systems as well as its website.
Its abstractions are the cache and region. The cache is a unit of data management, basically a collection of regions.
A region is “a key/value store on steroids,” according to Lamba. They’re distributed and highly concurrent. You can store data in memory, but also push it to disk. A region can be replicated — each member in the cluster gets the same data — or you can choose to partition the data across a number of nodes, which is a more scalable solution.
Because it highly values consistency, if you make two copies of the data, it ensures the primary data and the two redundant copies are consistent before returning an acknowledgment back to the user saying the put, for example, has completed.
For a replicated region, you can choose to do asynchronous sends to other members; that’s not allowed for partitioned regions.
A member is a role — a client, a server. It’s a layer where you can store data. It’s embeddable in the application. It can be defined as a client cache. You can have different configurations or topologies that you can configure it to, Lamba explained.
For data that’s looked up frequently, you can create a local copy of that data — a client cache — within your application, that’s synchronized with the main copy.
It also has a discovery service called Locator that helps Geode servers find each other and Geode clients to find the servers.
It’s a peer-to-peer system. If the locator goes down, the other servers still know what the others are doing and will be able to resolve issues among themselves. You can add servers to scale, and they discover and know each other.
Geode can be used anytime you’re having trouble scaling your relational database to meet your application needs, according to Bawaskar.
“We’ve also seen it used to implement the CQRS (Command Query Responsibility Segregation) pattern. The idea is the ability to scale your reads and writes independently. Typically would use a system like Kafka that deals with all the events coming into your system, then because Geode has the query semantics as well as the continuous query functionality, you use Geode on the CQRS pattern.
“Also for applications that need to be deployed across geographies in an active-active manner, Geode can do that, too,” he said, such as a trading firm with operations in London, Tokyo and New York that wants to use its local cluster, but also replicate that data to the other sites to reconcile and do business processing.
He pointed to CAP theorem, which states that on a partitioned system, you have to choose between consistency or availability because you cannot give up partition tolerance in the cloud. Geode allows users of partitioned systems to choose between greater focus on consistency or availability.
The Geode codebase was originally developed under the name of GemFire by Gemstone Systems in 2002, which was later acquired by VMware. The GemFire technology went to Pivotal when that company was spun out in 2013. Pivotal submitted the GemFire technology to the Apache Incubator in April 2015 under the project name Geode. It became a top-level project last month.
It originally captured interest as the transactional data engine used in Wall Street trading platforms. Now more than 600 enterprises use the technology for high-scale business applications that require low latency and 24×7 availability.
Pivotal announced in 2015 plans to open source its entire big data suite, including GemFire, analytical massive parallel processing (MPP) data warehouse Greenplum Database and HAWQ, an SQL-on Hadoop analytic engine.
Geode and Apache Ignite started out from a similar purpose, according to Shaposhnik, but he sees Ignite moving more toward a SAP HANA replacement.
Pivotal is still developing its commercial GemFire offerings, focusing on Spring integration that involves decoupling the application logic from the database logic, making it easier to switch databases in microservices architectures, according to Shaposhnik.
Since the project’s inception, the bulk of the work has been around refactoring the core of Geode to run on the latest version of jgroups; WAN replication and continuous query functionality; and native client functionality.
“I would love to see the project getting more diverse just by getting more companies participating,” Shaposhnik said. “We have different companies, but let’s be honest, it’s still dominated by Pivotal by a number of contributors.”
He said he believes the governance model will allow the project to achieve to the level of diversity that Spark or Hadoop have, in which 30 percent or less of its contributors come from one company.
Integrations with Spring, Spark and a Redis-to-Geode adaptor are in the works.