DataStax sponsored this post.
Just in time for the U.S. tax season’s delayed 2020 deadline, Intuit released the first framework to manage Apache Cassandra clusters. The Data Persistence Platform Team made our first major contribution to the Cassandra community with DSE Pronto. Pronto is an Infrastructure as a Service automation suite used to deploy and manage DataStax Cassandra clusters in Amazon Web Services (AWS).
Pronto provides Cassandra automation framework in AWS, with years of running Cassandra, performance tuning and auto recovery baked in.
Pronto ties together an open source suite that includes Packer, Terraform and Ansible, all built into a Docker image. The widely adopted nature of these tools also means the framework should be also easily extendible to the other two cloud giants, Google Cloud Platform and Azure.
Pronto is the result of nine years of customizations we’ve made to Cassandra — a database system that is highly reliable and scalable, but not always intuitive.
DSE Pronto Abstracts the Complexity out of Managing Cassandra
My colleagues at Intuit Ben Covi and Nancy Li developed the Pronto GitHub repository, after finding no third-party suite of similar tools with the desired configurability for self-managed clusters. It’s not easy to manage your own Cassandra cluster. Pronto solves that problem.
Pronto originated as a project with the Data Persistence Platform team, of which TurboTax is the biggest user. TurboTax is in a well-regulated industry, with hundreds of thousands of integration partners. So TurboTax is anything but simple.
We are supporting over 42,000 Peak TPS in production in AWS, over eight clusters in production. Our largest cluster in production right now is 72 servers in each AWS Region, or 144 across two regions. Cassandra has to process massive amounts of data, such as entitlements, tax returns, filings, user experience, and everything needed to support TurboTax.
There’s an operational learning curve with Cassandra, which is why we decided to open source the Pronto automation framework that’s already being “inner-sourced” across Intuit.
It’s actually not easy to maintain the Cassandra clusters. A lot of people don’t maintain Cassandra well and they end up in a bad state. Our automation framework is popular in Intuit and makes it easier to maintain and keep Cassandra healthy.
When we first started, we implemented Cassandra for each tax year, which meant each tax year had a new cluster. But going back seven years of seven clusters was too expensive and difficult to manage. So we consolidated four years of tax clusters into one.
When we had one year per cluster, it masked a lot of problems. After consolidating, we realized there were a lot of pauses.
I personally went through debugging and optimizing all the way from the kernel, JVM, to Cassandra level. Cassandra doesn’t remove deleted data well. We are running production tests every week to make sure services don’t degrade every time, and have to run Cassandra Garbage Collector to reclaim disk space.
Intuit Has Relied on Cassandra’s Scalability and Reliability for Nine Years Now
Despite our challenges over the years, there has been no consideration of dropping Cassandra. It’s been the main constant for us as our core infrastructure for nine years.
Cassandra is a very scalable system. Today our biggest production clusters to support TurboTax have 72 Cassandra servers per AWS Region. And that’s been performing for us. And that’s the biggest reason we’re sticking with Cassandra all these years even though we’ve changed a lot of technology in these nine years.
TheTurboTax ecosystem is very complex and it’s part of my job to make sure that our team provides persistent services to back all the microservices for almost all of the TurboTax services. We chose Cassandra in the first place for its ability to scale — because these tax services simply cannot have any downtime.
The stability we have today with Cassandra is proving that it is growing with our customer base and services, and we have been growing ever since. While we’ve had to customize a lot, Cassandra has improved notably over this time, the cloud has become easier, and with the automation framework we created, our day-to-day has become much easier.
While Cassandra is very scalable and reliable, it’s not optimized at start. All of the Cassandra optimization the Data Persistence Platform Team has done is included in the Pronto project, including a self-healing feature (when one server goes down, it automatically brings up another server).
Cassandra has remained stable throughout a lot of transition and customization.
Since Cassandra has no schema, clients can define custom schema that allows for some guard-railed flexibility for clients who use the Intuit system on top of Cassandra. Everything is stored as entity relationships in the system. Then clients are required to define their schemas to use that system.
Internally, we support over 500 schemas, implemented using only eight Cassandra tables. Our initial implementation of Cassandra was not scalable. It took a year to implement the new schema, but that schema change is abstracted from clients. They didn’t know we switched to a new schema.
It also wasn’t easy to migrate from the Intuit data centers to AWS. We implemented our own migration tool, which took a year to code and task, and then it took another two years to migrate the eight clusters.
During this time both the data centers and cloud-based stacks needed to full run concurrently.
During these transitions, Cassandra had no downtime and no data leaks.
It took a long time to execute and implement it securely. It was quite scary, but the whole thing executed pretty successfully. In terms of performance and scalability, I think Cassandra is quite unique. It’s very stable. Replication is very impressive.
We tested relocation in AWS from one region to another and what we measured is like 40 milliseconds.
Cassandra maintains high availability even when having stateless microservices active in two regions. It’s very impressive that it can maintain this eventual consistency in a very fast way.
Why Intuit Is All-In for Open Source
It wasn’t just the stability and reliability that drew the Data Persistence Platform Team to Cassandra. Nine years ago, we were only considering more affordable open source software in our move from Oracle database management.
Having a healthy and active open source community is important to us. If you look at Cassandra, it’s pretty young compared with Oracle, established in the 1970s. Having the open source project grows the community a lot faster to what you have today, than if we had kept it proprietary.
Why did we open source the Pronto framework? It is the only option to manage Cassandra right now, and we want help improving on it. We are hoping that other people will help maintain it, not just us. This is how we built Cassandra, and I’m sure there are other Cassandra experts that can improve on our framework.
Pronto currently has about ten contributors, not just from the Data Persistent Platform, but throughout Intuit in what we call our “inner-source model.”
At Intuit, a lot of people are using our framework now because it works and makes their lives easier, as Cassandra is used across the company.
Pronto is perfect for companies that want to use Cassandra and manage their own container clusters, and we welcome new contributors.
Feature image via Pixabay.
At this time, The New Stack does not allow comments directly on this website. We invite all readers who wish to discuss a story to visit us on Twitter or Facebook. We also welcome your news tips and feedback via email: [email protected].
The New Stack is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Docker.