The September release of Microsoft’s Azure Kubernetes Service includes an interesting new feature: you can stop a cluster when you don’t need it and restart it again when you do — the way you can stop a VM, pause a video or hibernate a laptop. Scaling a cluster to zero still leaves the system pool running (and running up a bill); turning it off stops the control plane and agent nodes completely so there’s no cost, but you don’t need to create the cluster and reinstall images when you want the cluster back.
The new az aks stop and az aks start commands are a response to the way customers are turning to cloud services to achieve the digital transformation the pandemic demands from organizations but also the cost and efficiency they need hand in hand with that, Kubernetes co-founder and Microsoft Corporate vice president Brendan Burns told the New Stack.
The AKS team noticed that a lot of customers were deleting their agent node VMs at night or over the weekend, to keep service costs down, but that means they were still paying for the control plane — and getting the cluster back wasn’t always straightforward, especially if they were still experimenting with their environment and not ready to automate everything.
“A lot of this is DevTest or batch workloads, and since people actually don’t work 24 hours a day, if you have a DevTest cluster, you can actually just stop it,” Burns said. “A lot of people were doing this in CI/CD and we’re helping them do it more easily through an API. Instead of making them write a script to delete a cluster and create a cluster, just like you’d stop and start a VM rather than deleting and recreating a VM, you can stop the cluster, and then restart it at a later date with all of the state and everything that was already in there.”
That takes advantage of the fact that AKS is already backing up the cluster state for resiliency. “Ultimately, the only state in the Kubernetes system is really the contents of etcd; there are caches but everything else is stateless. So, when you stop a cluster, you’re taking the state of the etcd database, and you’re preserving it out to file. We already do that, because we back it up in case something happens, so we can restore it. This is basically proactively pushing that state file down to storage and then shutting down all the compute resources, so you’re not going to get charged for any of the compute resources that you were using.”
Starting the cluster again reloads the control plane state and the same number of agent nodes, although he warned that it’s not immediate, so you may want to schedule the restart to make sure the workload is ready to respond when you need it.
“When you start the cluster all the stuff you’d installed pops right back up, although it takes a couple of seconds to start running. If you want to serve a web request in a few seconds, you’re not going to be able to start your cluster in time to do that. But say you’re using KEDA and event-driven processing to transcode video files that people upload; most of the time people upload videos when they’re away so there are eight or ten hours when most people are asleep. So, you could stop your cluster during that time and then when the first upload comes in, you could start up your cluster, and just start again, with everything that you need already there so it would immediately go pick up the files, handle the event and do the transcoding.”
Start stop is ideal for bursty and batch scenarios, as well as workloads that are used during the workday, like call centers and doctor’s offices. “We talked to customers who know that at 9 a.m. the app is going to have 10,000 simultaneous users but at 8.55 there are no users or at most one or two. Or there’s a sporting event where you know people are going to tune in at a specific time. Now, you can start your cluster five minutes before and be ready to go.” It’s also ideal for people who need to do demos but don’t need the cluster running until it’s time for the next demo or training session.
The stop start feature is currently in preview and requires the AKS-preview Azure CLI extension (version 0.4.64 or later): you also need to enable the StartStopPreview feature flag on your subscription. Stop start will still work if you’re using pod disruption budgets to make sure that your application remains highly available even if you need to do frequent upgrades, but it will take longer to complete the draining process. But it works with even demanding workloads; Microsoft Most Valuable Professional (MVP) Mohammed Darab tried it out with Big Data Clusters (SQL Server, Spark and HDFS containers running on Kubernetes).
A Sign of Maturity
Start stop doesn’t mean going back to creating snowflakes or having configuration you haven’t documented. The cluster state is only stored for 12 months and won’t be recoverable after that. But it’s a way of getting flexibility and convenience without so much disruption.
“We’d obviously still recommend that people still follow DevOps practices and do infrastructure as code, and still have all that stuff somewhere if they need to restore it,” Burns pointed out. “But just like it’s just easier and more convenient to flip open your laptop when you need it, it’s way more convenient to just flip [your cluster] up.”
Start stop is a small convenience but it also reflects a more mature Kubernetes userbase and much broader adoption. “We’re no longer in the enthusiast space,” Burns points out. “We’re very much in a space where the people who are going to use it because it’s very useful, who are going to use it because it’s easy to use. So we’re going to be adding these sorts of niceties because we’re no longer in a place where people are going to accept rough edges.”