Google Cloud Spanner: A Resilient Distributed SQL Database Service
Quizlet, a San Francisco-based startup that offers online flashcards and other educational tools, has found Google’s new Cloud Spanner the best solution out there for its issues with scaling and managing a high-throughput database.
Its growth rate and the seasonal nature of its business have posed headaches when preparing to deal with spikes in traffic, which it previously managed through vertical sharding and adding replicas.
Google launched Cloud Spanner earlier this month, billing it as “the first system to distribute data at global scale and support externally-consistent distributed transactions.” It describes it offering “ACID transactions and SQL semantics, without giving up horizontal scaling and high availability.”
It has stretched the boundaries of the CAP theorem to the limit, according to Dor Laor, CEO and founder at NoSQL database ScyllaDB. That theorem states that distributed systems can only simultaneously provide two of these three things: consistency, availability and/or partition tolerance.
In a blog post, Google’s Eric Brewer says that, under his CAP Theorem (which states that a distributed system can have only two of the following traits: consistency [C], availability [A], partitions[P]), users experience Spanner as a CA system, it’s technically a CP system, but can achieve the target of five-nines availability for multi-region use.
Cloud Spanner is the most compelling cloud service we’ve seen for scaling a high-throughput relational workload, MySQL in our case. It has some rough edges as a production system, but its latency and scalability are unique. We have not seen another database, relational or otherwise, that can scale as smoothly and rapidly as Cloud Spanner. And the fact that it is hosted eliminates an entire category of maintenance.
The new cloud service has its pros and cons, of course.
He notes it was developed specifically to take advantage of Google’s massively scaled cloud and uses GPS and atomic clocks to coordinate its distributed nodes. Its global time consistency has been one of its most talked-about features.
Spanner’s timekeeping technology TrueTime is unique in the market right now, like the rest of the project built on top of Google’s previous work, points out Spencer Kimball, CEO of Cockroach Labs. He and two other ex-Googlers set out to build an open source version of Spanner, called CockroachDB, which he says matches most of Spanner’s capabilities, but not all of them yet, and not TrueTime.
“The big difference is that you can’t build software in open source the same way that Google built Spanner,” he said.
Google itself has not released the software as open source.
“The reason Google won’t be able to open source it is that it’s built by layering new technology on top of existing systems. Spanner is built on top of Colossus, a massive distributed file system. Colossus is built on top of “D,” which exports disks to the network so other services can use them. And all these things use Chubby, which is Google’s equivalent to ZooKeeper [key value story]… The architecture relies on all these other complex systems that are continually evolving. That model has a lot of benefits for Google, but obviously [building all those parts is] not practical if you want to bring something to market for open source users.”
Both Spanner and Cockroach tout the ability to stay online even if some systems fail and to balance resources between servers automatically. But they have a different failover story, according to Kimball, recounting a scenario from when he was at payment vendor Square.
Traditionally, when a card is swiped, an authorization is made, the person signs, maybe adds a tip, then it goes as a “capture” to the data center. If something goes wrong at that data center, it sends an error message telling the merchant to re-swipe the credit card. That starts a second authorization that goes to an alternate data center, which without sufficient communication between them results in two transactions on the customer’s card that have to be reconciled — a lot of moving parts, Kimball says.
In this case, Spanner and Cockroach would use three data centers. If one is unavailable when a transaction begins, at least one of the two others will automatically handle it without re-swiping the card.
“There’s no explicit failover step; there’s no error message sent to the merchant. The underlying database guarantees that at least one of the other data centers will know about it. The software developers don’t have to build special capabilities into the software to know that these things can go wrong. That’s one of the fundamental survivability characteristics of Spanner and CochroachDB,” he said.
The two also “allow the SQL database to get arbitrarily large,” addressing the scalability problems of traditional SQL relational databases such as Oracle, MySQL, Postgres, he said.
“The first databases that scale out were really focused only on analytics. Few people even remotely addressed transactional workloads,” he said. “We think it’s a validation that [transactional] databases are going scale-out.”
Its strength is high availability that works across data centers “in a really nice way,” but it’s a feature others in the market will soon catch up with, he said.
On the downside to Spanner, Zweben noted, that the price of Spanner’s high availability is latency at levels unacceptable in some applications.
If you’re executing banking transactions, like in airline reservations, and you can afford a minute of delay while you make sure they’re committed in multiple data centers, Spanner’s probably a good database for that, he said.
Interestingly enough, Quizlet found Spanner queries have higher latency at low throughputs compared with a virtual machine running MySQL, but at a certain point in a scale-out, the latency level remained unchanged while it grew with MySQL.
Zweben says, however, if you’ve got a customer-service app that requires you to be extraordinarily responsive over millions of transactions and at the same time analyze profitability of a customer and maybe train a machine learning model, those simultaneous transactional and analytical requirements are better for a database like Splice Machine, which is built atop a Spark engine.
He maintains that Spanner is not optimized for simultaneous transactional and analytical processes known as Hybrid Transactional/Analytical Processing (HTAP). Kimball argues that both Spanner and Cockroach have HTAP capabilities, though he concedes they’re not at a Hadoop-like level, though they could easily be added, he says.
“SQL is actually great for analytics,” he said. “You can use Spanner for massive workloads in distributed SQL query execution. Cockroach is not as far along as Spanner, but we’ve made strides in the past year,” he said.
Laor, meanwhile, said he doesn’t see Spanner sidelining NoSQL databases like Scylla.
The folks behind Scylla, a C++ implementation of Apache Cassandra, claims it is the world’s fastest NoSQL database that is also offering high availability, horizontal scalability and low latency.
It allows multi data center readers/writers out of the box, and partition tolerance is never compromised, he said.
“Many big data workloads do not need transactions and thus the heavy weight penalty — transactions, joins, MVCC (multi version concurrency control) — aren’t needed, and thus Scylla can easily outperform Spanner,” he said.
However, while it’s true that many companies don’t have Google’s need for global scale, even tiny startups see themselves growing to that point, raising a major concern about Spanner, according to Kimball.
“With Spanner, you’re making a commitment to running the service in Google Compute Engine (GCE) and probably running it there for the service’s lifetime. You’re not going to have an off-ramp if you choose to run your own stack,” he said.
He points to Dropbox as a company that started out on Amazon’s S3 storage service, then later decided to build its technology, a major undertaking. Snapchat, on GCE, will be facing a similar decision soon, he says.
“If you’re just talking about using the compute resources of AWS, GCE or Azure, that’s a fairly low-friction investment. You can relocate your services relatively simply because they’re all using the same kinds of underlying hardware — most of these things are being done with container-based technology now and Kubernetes. It’s all very portable,” he explained.
“But once you sink your data into a proprietary system, the off-ramp is murky at best and probably not possible unless you’re a company like Dropbox that’s willing to invest the resources and effort to provide their own solution.”
Cockroach does not offer a cloud service, though it will at some point, he said. It’s software you can run in your private or public cloud. Its flexibility allows you migrate your service between them with a change in configuration with no downtime to your users.
It’s also working on geo-partitioning, which Spanner doesn’t have. It provides an easy way to relocate data regionally by improving communication between accounts globally.
Feature image from Pixabay.