Splice Machine Hybrid SQL System Fuses Transactions and Analytics
Despite all the talk about big data and analytics, companies still lag in their ability to make decisions based on real-time data. That’s the space that Splice Machine is focused on. The startup has a hybrid database system aimed toward building applications involving transactional operations as well as real-time analytics, according to Monte Zweben, co-founder and CEO of Splice Machine.
Now the company is looking for a few good testers for its hybrid SQL database. The company just released its 1.5 version in October with enterprise features including ETL acceleration and BI tool compatibility. Its 2.0 version, now in beta, adds Spark.
Spark provides in-memory speed and performance as well as isolated resources, separating the CPUs and memory of OLAP (online analytical processing ) and OLTP (online transaction processing) jobs with no interference between them, Zweben said.
Splice Machine has based its supporting stack on HBase, Spark and Apache Derby, a lightweight (<3 MB), Java-based based ANSI SQL database that can be embed into the HBase/Hadoop stack.
HBase provides proven auto-sharding, replication, and failover technology while Spark’s resilience to node failures is a plus. While other in-memory technologies drop all queries associated with a failed node, Spark uses the duplicative nodes to regenerate its in-memory datasets.
As Zweben explained it, when a statement is entered into the system, it’s evaluated based on the size of the result set it’s expected to return. If the result sets are going to be relatively large, it’s sent to Spark. If it’s a transactional action, like a short read of a record, it goes to the HBase computational engine.
“Our system is deployed on a set of distributed nodes, with Spark and HBase running on the same servers, that can quickly interact with each other. So each of the statements is executed on each individual node in parallel, which is where you get speed, then we splice the results back together again from each of the parallel computations, hence the name of the company,” he said.
It’s particularly well-suited for digital marketing, ETL acceleration, operational data lakes, data warehouse offloads, IoT applications, Web, mobile, and social applications, and operational applications, according to the company.
Zweben, Eugene Davis and John Leach founded the Splice in 2012. Collectively, they had backgrounds as serial entrepreneurs and work at NASA.
Earlier this month, Splice Machine was named one of three startups selected for the Wells Fargo Startup Accelerator program, a six-month, hands-on accelerator program that potentially includes a buyout.
Its initial accomplishment was offering ACID transactions on top of Hadoop, combining the scale-out advantages of NoSQL with the traditional benefits of a relational database management system, he said.
It did that by building a transactional semantics or a transaction engine on top of HBase, a MVCC [Multiversion concurrency control system] that provides snapshot isolation, building on the work of Google and Yahoo, so that readers of the database are not locked out during updates, Zweben said.
“We’re the only ones that provide ACID semantics to transactional and analytical workloads like this,” according to Zweben.
Splice Machine is just one of many new NoSQL and NewSQL solutions aimed at addressing applications’ need for real-time data transactions and analysis from high volumes of simultaneous users accessing data around the globe.
In a New Stack podcast, Antony Falco, CEO and co-founder of Orchestrate, a database-as-a-service (DBaaS) provider out of Portland that was acquired by CenturyLink Cloud, said he believes there are 35 to 40 databases in production that didn’t exist 10 years ago.
For instance, there’s CockroachDB, a distributed SQL database built on top of a transactional and consistent key-value store designed for ACID transactions, horizontal scalability and survivability (hence the name).
The Riak KV NoSQL database runs on the Apache Mesos resource manager, an integration that allows for “push button” scaling up and down as Mesos aggregates resources for the Riak nodes.
Yet Zweben doesn’t see startups as Splice Machine’s primary competitors. Instead, he points to the entrenched vendors – Oracle Exadata and SAP HANA – who are tackling the notion that combining OLTP and OLAP on a single platform never works. They’ve been working toward combining them for years.
“The difference between rivals and us is that we’re a scale-out solution, meaning we spread our data across lots of inexpensive commodity servers, and they have big, beefy systems that are highly engineered at every level of the stack from CPUs down to the network, so it’s a radically different approach to the same problem,” Zweben said.
Mark Madsen of consultancy Third Nature remains skeptical. He points to previous less-than-successful attempts to combine OLAP and OLTP processing, such as DATAllegro, which was bought out by Microsoft.
“Any time a company says they do OLTP and OLAP on the same database, one should be skeptical of claims until they can demonstrate otherwise,” he said. “What I find is that most often they are doing basic transaction processing and scaling OK, but the query portion of the system is usually very weak.”
Splice Machine agrees that in the past that databases that claimed to do OLTP and OLAP were just OLTP databases with no special support for OLAP, but maintains it is different because it uses a dual-engine architecture, with HBase engine for OLTP and Spark engine for OLAP.
Madsen finds it concerning that the company refers to OLAP being supported by Spark, not calling it SparkSQL, which is still new and unproven. Splice Machine uses only the more proven Spark Core, not SparkSQL, the company makes clear.
Madsen sees another red flag in that Splice Machine’s website says: “Spark has very efficient in-memory processing that can spill to disk (instead of dropping the query) if the query processing exceeds available memory.”
“The moment you hit the memory wall on Spark, you hit a performance wall,” Madsen says. “It wasn’t intended to work that way. To make the disk read-write path fast like this is to start down the path of conventional database design. Optimizing for disk and memory is a lot different than optimizing for memory only.”
Zweben sees Hadoop and Spark as game-changing technologies in the quest to offer OLAP and OLTP processing simultaneously.
“Spill-to-disk lets you compute to completion and it’s a very powerful feature,” he said. “If your data set spills to disk, it is slower than if it can execute in memory. We’ve done tests in the same version of 1.0 and seen a significant performance improvement.
“Spill-to-disk is a way to allow an in-memory database to complete its computation when there are insufficient memory resources, you certainly pay a small performance hit on that, but overall, the Spark engine is unparalleled in speed.”
If the query does not fit in memory, then you have three choices – add more memory (generally later), drop the query, or spill to disk.
“Spill-to-disk will be slower, but every customer we talked to prefers that to dropping the query,” according to Zweben.
Feature Image via Pixabay.