Databases — Finally — Get Containerized
Until now, traditional databases have been difficult to deploy on Kubernetes because they weren’t designed to work in containerized environments. Enterprises were forced to create their own solutions or, more often, rely on the cloud providers’ own cloud database offerings, which locked them into that particular cloud platform.
That’s been changing over the past few months, with increased support for both SQL and noSQL databases on Kubernetes from the open source community and from vendors offering additional support and management tools.
For the past few years, enterprises have been making dramatic moves cloud infrastructure. With the COVID-19 pandemic, in the past year that flow turned into a tsunami.
According to a March survey by Flexera, 99% of companies now have a cloud adoption strategy, with 92% opting for a multicloud option with combinations of different public clouds or public and on-premises clouds.
And the percentage of companies spending at least $1 million a month on cloud services is now at 31%, nearly double the 16% in last year’s survey. Half of all workloads are now in the cloud, and enterprises plan to increase that to 57%.
Container adoption, too, has risen with the flood to the cloud. According to a 2020 survey by the Cloud Native Computing Foundation, 92% of the global cloud user community is using containers in production, up from 84% in 2019, and up 300% from the organization’s first survey in 2016.
Kubernetes accounts for much of that container usage, with 83% using Kubernetes as their preferred container platform, up from 78% in 2019.
Data services are right behind. According to the Flexera survey, 87% of companies are either using, experimenting with or planning to use data warehouses or relational databases as a service.
In fact, 46% of organizations’ data is already in public clouds, and will grow to 54% in the next 12 months.
Gartner predicts that by 2022, more than 75% of organizations will be running containerized applications in production.
According to the Flexera survey, Docker and Kubernetes account for the bulk of the container tools used. Companies are adopting containers to shorten development cycles, simplify cloud migrations and lower cloud costs.
“Containers are all about horizontal scalability,” said Sam Ramji, chief strategy officer at DataStax. “If we could vertically scale everything, we would still be on mainframes or running giant Java virtual machines.”
In those older approaches, organizations would see months between software updates, if not longer.
“Continuous delivery companies can iterate multiple times a day,” he said. “That’s the first driver for Kubernetes — lean flow, the ability to reduce the cycle time between coming up with an idea and implementing it in production.”
This changes the kind of products you can offer to your customers, said Ramji.
“We were able to generate multiple billion-dollar ideas at Google,” said Ramji, who was once vice president of product management there. “We could try out lots of ideas to see if they worked.”
With this approach to development, he said, “you can satisfy your customers, fight for market share and benefit from unit economics. You can take a million lines of code, break it up into a hundred different microservices with different topologies, and each piece can scale independently.”
But as organizations make the move to the cloud, containers and Kubernetes, data has been a challenge.
Containers favor a “stateless” approach to application development. Individual containers can spin up and down on-demand, so the applications can’t be designed to run continuously and remember what they’re working on. If a containerized application has a built-in database, that database disappears when the container shuts down and is recreated from scratch when it boots back up again.
“Single-instance databases are kind of going to die in this environment,” Ramji said.
According to a survey released earlier this year by IDC, 47% of microservices rely on databases, and 32% of respondents said database management is one of their top challenges.
“Many databases were not designed to be cloud native, compatible with containers or orchestrated by Kubernetes,” wrote Carl Olofson, IDC’s research vice president for data management software.
To address this problem, organizations traditionally use a separate, external data store. So, for example, if they’re running their containers on Amazon, they’ll use AWS cloud databases. If they’re running on Azure or GCP, they’ll use the cloud databases available on those platforms.
That poses its own challenges, because moving from one cloud to another requires rewriting the application to work in the new cloud environment.
Another approach, when databases are small enough, is to put a complete copy of each database in every container.
That creates management challenges. For example, if one database is updated, the change has to be sent to a backup database, in case the container shuts down, and duplicated to all the other containers running the same database.
Or a database could be sharded. For example, one container can handle requests for customers whose names start with A through M, and a separate container can handle requests N through Z.
The application now needs to know where to send each request. And if you need to expand the number of containers, you may need to split one of those shards even further, into, say, A through F and G through M.
“And if you change your sharding strategy, you have to change your application,” said Ramji.
Because of the benefits of moving to a containerized environment, there’s a great deal of interest from enterprises in finding ways to run SQL and no-SQL databases on Kubernetes, he said.
New Approaches to Containerized Data
Over the course of the past year, there have been several projects to adapt databases to work in containerized environments.
For example, companies using Cassandra, a popular no-SQL database, have been collaborating on K8ssandra, an open source project from DataStax.
DataStax first unveiled K8ssandra (pronounced “Kate Sandra”) last November, together with the tooling and dashboards required to run the database in a Kubernetes cluster.
It was built on top of a simpler Kubernetes operator for Cassandra released in the spring of 2020.
K8ssandra is based on DataStax’s own experience with running Astra, its managed cloud data service.
Other databases are also being ported to containers. Cockroach Labs, for example, has been working on bringing its distributed SQL database CockroachDB to Kubernetes.
Meanwhile, PlanetScale, for example, uses open source Vitess to horizontally scale MySQL, and also has an operator that lets it work on Kubernetes. The Vitess scaling technology was originally developed at YouTube and now supports Square, Slack, HubSpot and other large Internet sites.
The trick is to provide developers with a data fabric that just works, without forcing developers to struggle with security, auditability or scalability, Ramji said.
That’s true even if developers are only building small-scale applications, with, say, just three nodes.
“You want them [applications] to become popular,” he said. “But once [an application] becomes popular, you don’t want to re-architect the whole thing. You can’t shut it down, so you end up trying to build two parallel systems.”
Companies can avoid that problem by picking a platform that can scale well from the start.
Cassandra, for example, started out at Facebook to power its inbox search feature. It was released as an open source project in 2008. Other companies using it include Instagram, GoDaddy, eBay, Spotify and Netflix. But the single largest deployment is probably at Apple, which is heavily invested in Cassandra. Apple has over three times as many openings for Cassandra-related jobs as it does for HBase, Couchbase and MongoDB combined.
“Apple is reported to run a 200,000 node Cassandra cluster that powers data services on iCloud, including iMessage and many others,” said Ramji.
Cassandra works by automatically sending inbound requests to the least loaded server, he said.
“You can create database clusters that span multiple geographic regions,” he added. “The Facebook inbox, for example, had to be geographically available everywhere.”
Cassandra can default to full copies of the data in every instance, or companies can use intelligent replication and specify which data can go where. Intelligent replication is particularly useful, Ramji said, when there are regulatory requirements about moving sensitive data out of certain regions.
“It’s a good fit for Kubernetes because Cassandra knows how to scale itself horizontally,” he said. “No matter how widely you scale your Kubernetes cluster, you can add Cassandra nodes fluidly. But the challenge is to make Cassandra Kubernetes-native.”
That has taken a few years. “Kubernetes is a very hostile environment for databases,” he said.
To start with, Kubernetes is all about stateless applications.
“With Kubernetes, you can stop and start the service at any moment and then pop [it] up somewhere else,” he said. “And you have no memory of the previous service.”
As a result, developers typically keep their data outside the Kubernetes world.
To move data into Kubernetes, first of all, the platform needed to support stateful applications. The solution, StatefulSets, arrived in 2018 with Kubernetes 1.9. StatefulSets were in beta starting in release 1.5 in 2016.
The approach quickly became popular. Today, 55% of companies use stateful applications in containers in production, according to the CNCF survey.
“StatefulSets lets you tell Kubernetes, ‘I’m actually a database, so be cool,'” said Ramji.
That means that containers have to be extra careful when shutting down, he said. A database has to take its in-memory writes and commit them to permanent storage.
Then there are the issues of synchronization and coordination. Cassandra clusters normally communicate with one another using the gossip protocol. That had to change with a move to a containerized environment.
“Cassandra had to stop gossiping among its own nodes and learn to use a protocol in the Kubernetes control plane,” said Ramji.
Finally, running Cassandra traditionally requires some manual management and control functions.
“In order to scale, repair itself, restore and work in the Kubernetes control plane, it had to get radically automated in a way that Cassandra had never been,” he said.
When K8ssandra was first released in November, it was ready to work on Kubernetes. This month, the project is adding out-of-the-box support for all the major cloud providers so it can work with particular flavors of Kubernetes without any extra configuration required.
“We expect that people will run it on Amazon, Google or RedHat OpenShift,” he said. “We’ve also been able to fix some bugs and dependencies and make the configuration smarter.”
The most important thing about running databases in containers is finding a way to store the data, said Dan Yasny, principal field engineer at MayaData, another company working on deploying Cassandra on Kubernetes.
Storage-area networks (SAN) are one approach, but it’s expensive, he said.
“A typical SAN project is six figures out of pocket right then and there,” he said. “And in five years it will be end-of-life, and you end up having to buy a new one.”
Then there’s the costs associated with managing the platform, he added. “When you’re spending six figures, you need someone who knows Hyperchannel. It’s not simple.”
With Kubernetes, companies can use local attached storage and can scale by adding more nodes with more disks.
That’s for private cloud deployments. Clouds have their version of local storage as well.
“On Amazon, GCP and Azure, you have instance types with local non-volatile memory,” he said. “A single disk can provide 100,000 operations per second, which is insane. A typical SCSI disk will give you 150 at best. So when you’re in those clouds, and you’re using those instances, you have 60 terabytes on a single virtual machine you can provision. It’s huge and it’s insanely fast.”
The downside, of course, is that it’s ephemeral.
“If you stop a virtual machine and start it again, the disks will be empty,” he said. “Running a database like that sounds insane. But think about the database having multiple nodes, with multiple replicas, and if a single note goes down, it comes back up again.”
With a self-replicating Kubernetes stack, companies don’t have to worry about setting up a separate storage functionality for their application because the databases themselves can take care of things being replicated.
And the new container-friendly databases have their own backup solutions, he added.
“You take a snapshot and ship off your current state,” he said. “You can backup just the increments, or the whole thing every time — there are so many possibilities.”
MayaData helps provide the backup and orchestration capability to Kubernetes-based databases with its OpenEBS Mayastor.
OpenEBS is an open source project backed by MayaData that lets stateful Kubernetes applications access dynamic local persistent volumes or replicated persistent volumes of memory.
It takes care of another piece of the container data puzzle — managing data that is spread across multiple Kubernetes storage environments.
“Sometimes you have separate nodes that have the disks, and the workload is on other nodes,” said Yasny.
Previous solutions could offer replication, snapshots and other features, but suffered when it came to performance, he said.
OpenEBS is a storage orchestrator that can connect to both local- and network-attached storage volumes, he said. It’s the most popular open source storage implementation on Kubernetes and has been around for a few years. Mayastor extends that ability across containers.
“In good lab conditions, we got to just a single digit of percent overhead,” he said. “And without too much tuning or working too hard, we can get to 15 percent overhead.”
In March, MayaData released a benchmarking report in conjunction with Intel about its performance tests.
OpenEBS MayaStor is currently in beta.
The official release date will be determined by the broader community and will be based on criteria such as code stability, test coverage and test results, said Evan Powell, chairman and CEO at MayaData. That could be a few more releases, he said, which would suggest that the project will exit beta within a few months.
Indian ecommerce giant Flipkart is currently moving Cassandra workloads to Kubernetes, using different flavors of OpenEBS.
“They will become one of the largest users of Kubernetes as they scale,” said Powell. “It is an honor to be partnering with them.”
Containers Fuel Agility in a Time of Change
Target has been using the Cassandra database since around 2014. In 2018, the company rolled out individual Cassandra clusters in all its stores and needed those clusters to run in Kubernetes.
That was before there was a K8ssandra project, and Target built this infrastructure from scratch.
According to Daniel Parker, Target’s director of engineering, the first challenge was that when new nodes started up, they had to find other nodes to connect to, and if several new nodes were coming online at once in the same cluster, they had to be able to find each other and cluster together.
Then there were issues with setting up backups that don’t get wiped when a container restarts, setting up automated monitoring and alerts.
“We had a lot of hurdles to overcome in deploying Cassandra clusters to all Target stores,” Parker wrote.
But this investment likely paid off, said Patrick McFadin, vice president of developer relations at DataStax.
When the pandemic hit, retail stores around the world had to switch to delivery or curbside pickup.
That meant companies needed to have technology infrastructure in place that let them quickly switch business processes.
“Companies that did not have this in place struggled to adapt. Just look at the changes in retail recently including Gap, JCPenney and Sears,” said McFadin.
Other companies that require high degrees of scalability or agility are entertainment firms, healthcare, finance, industries like retail and logistics that have a lot of seasonal fluctuation, SaaS vendors, companies deploying 5G and edge computing, companies deploying new AI applications and automation.
Today, technology and business agility is a matter of survival, he added. And it’s not just the pandemic. Companies are under extreme market stress in multiple areas.
If it’s not the pandemic, it’s something else. A new startup arrives. A competing company decides to aggressively expand into your territory. There’s disruption in supplies or in market demand. Or, in increasing frequency, Amazon decides to enter a new niche and threatens to put all the incumbents there out of business overnight.
“It’s a matter of survival,” said McFadin. “If you don’t move, you are [in] Chapter 11.”
The containerized, agile approach allows for quick upgrades to applications and fast expansion of capacity.
“We can’t go back to the old traditional waterfall methods,” McFadin said.
Avoid Cloud Lock-in
Adapting databases to work natively on Kubernetes also creates an additional benefit: enterprises are no longer locked into their cloud providers.
According to Gartner, once a company deploys an application on a particular cloud platform, it tends to stay there. And once it’s there, it tends to attract other applications and services, a concept often referred to as data gravity.
“This is due to data lakes being hard — and expensive — to port, and therefore end up acting as centers of gravity,” said Gartner analyst Marco Meinardi in a report last fall.
“Look at what the large clouds are trying to do,” said McFadin. “If cloud providers can convince you to use their proprietary database in their cloud, you may never leave. It’s like taking the blue pill. You’re done. You don’t want to be on your cloud provider anymore? Go ahead, move your data, I dare you.”
But the ability to switch providers is what allows enterprises to shop around for the best deals.
“Commoditization is key,” said McFadin. “Commoditization is how they can negotiate prices and get long-term savings. Clouds are not producing a lot of commodity right now, but Kubernetes is forcing them to become a commodity.”
With portable containers, companies can create virtual data centers across multiple public clouds and optimize for price or performance. “And if I’m not getting a very good price, then I can pick it up and move it somewhere else.”
Having support for data on Kubernetes is that last piece of the portability puzzle that companies have been missing, he said.