Looking Around Reliability Corners During a Kubernetes Migration
Cloud Native Computing Foundation and Gremlin sponsored this post, in anticipation of the virtual KubeCon + CloudNativeCon North America 2020 – Virtual, Nov. 17-20.
Kubernetes has been a major benefit to the productivity of developers and the scalability of our applications. Where containerizing applications gave us scalable architectures ideal for microservice applications, the number of containers quickly grew and orchestrating them got out of hand. Kubernetes was Google’s open source response to the combinatorial explosion of containers being deployed, operated and maintained. It simplified much of the operational tasks and abstracted away the orchestration detail for individual containers. As a result, adoption rates have skyrocketed to 45% of containerized environments.
However, as with all things in tech, the devil is in the details. Kubernetes is an extremely powerful tool, with enough tunability to be great for a vast majority of workloads, but also complex enough to be difficult to get right and to break in unpredictable ways. Where monoliths have one code base to step through, microservices add complex interrelations that are difficult to debug. Coupled with new, unique issues — such as dealing with latency, dependency failure, services scaling out and in rapidly, and pod failure — it’s a lot to test and fix. For companies new to Kubernetes, this can be daunting; and for companies with years of experience, the complexity only continues to grow.
At Gremlin, we’re working with companies early in their cloud native adoption, all the way to those on the bleeding edge — where vanilla configurations and core components no longer cut it. For those at the beginning of their journey, a method to build confidence that Kubernetes is deployed correctly, drastically speeds up the migration and development process.
Under Armour was able to migrate to Kubernetes four times faster when its engineers applied Chaos Engineering. Meanwhile, Workiva’s orchestration needs outstripped Kubernetes’ native capabilities for years, so the company hand-rolled their own. When Kubernetes caught up, they swapped its system for Kubernetes in order to reap the benefits of the Kubernetes roadmap, and then applied Chaos Engineering to confirm the new platform was as reliable as its previous orchestrator, before switching over.
Remove the Fear of Failure
Chaos Engineering is the methodical practice of adding small amounts of harm to a system in a controlled manner, in order to learn and improve on how a system handles that failure. By injecting a small amount of failure into a system, our teams can be confident that their reliability mechanisms will recover from that failure and that their runbooks will work. This ensures that when inevitable failure happens in the real world, customers will never know.
Kubernetes brings new paradigms; where before we had to scale our system up or out as a whole, now we can scale individual services out independently. To make it even easier, there are tools that manage Kubernetes — like cloud providers’ managed offerings or OpenShift. However, even in these managed environments, you can’t lift and shift applications into Kubernetes. They must be rearchitected to be stateless and scale well. You are building a complex, distributed system and there are things you must test for, or you will suffer an outage.
With orchestration, we gain higher utilization of our resources — only scaling out what is needed — and this means money saved. However, applications must be architected to handle the ephemeral nature of pods scaling up and down, with peaks and troughs of demand. Single-threaded, single path services won’t scale and won’t be able to handle outages in dependent services. If one service receives an excess of demand, ensure the service isn’t inhibitively large and slow to boot, slowing the ability to scale out. Once rewritten, don’t leave it to serendipity or assume Kubernetes manages away the need for thoughtful architecture. To ensure applications can handle scaling, add in resource constraints that kick in when autoscaling and watch how our system behaves. Make sure that pods that share the same host aren’t impacted by noisy neighbors, by increasing resource consumption in individual pods.
Another thing that containerizing our application potentially adds, is networks where there weren’t any before. Microservices architectures open up polyglot programming — picking the best language for each service based on ease of development or performance, with the caveat that now those services must speak to each other over a network. That means our systems need to be prepared for network problems they didn’t necessarily see before, like latency and lost packets. The best way to test that our services are prepared for latency and packet loss? Test them using injected latency and packet loss early and often.
Seeing Past Our Blind Spots
Exposing developers to real-world failures in their code changes how they write code. They are able to take on a new perspective and ask themselves what would happen if their service faced resource constraints, or if the team down the hall had an outage. Developing services independently means the services must truly be decoupled; in other words, one service being up should not require other services being up. Otherwise, we lose the benefit of switching to a containerized, orchestrated environment. Adding in failure to applications with developers as a part of the GameDay, or as a step in their development pipeline, helps them build that intuition for handling failure and building in resiliency mechanisms.
Similarly, our operations teams will see the benefit. As we migrate to Kubernetes, our operations teams will be unfamiliar with how to fix issues in this new environment. Don’t wait for customer-impacting issues to train our teams. Block traffic to a critical service and train our teams to build that muscle memory to quickly find and fix issues. Make sure that alerts and dashboards are actionable, then retest to watch for improvement. If our teams take too long to find the issue, it’s time to revisit our monitoring metric choices. Once both developers and operations teams hone their adaptability and reaction times, we can pull the plug on the old system and boot up the new, microservices system with confidence.
The Next Level
We’ve watched many of our customers go through the Kubernetes adoption process. Once they reach a certain scale, they need to customize Kubernetes to their application and usage patterns. Tuning settings and adding or replacing components increases the difficulty of managing and testing applications orchestrated by Kubernetes, but are necessary to reach the scale and performance demands of some organizations. It’s even more critical that these organizations test for failure. As they attempt to differentiate their offering by adding new features and performance, those benefits are lost if customers can’t access their applications.
These customers apply Chaos Engineering as they make their changes to Kubernetes, ensuring the reliability they gained tuning the traditional setup isn’t lost as they tweak their architecture. For example, applications may react completely differently when pods fail with a new load balancer, or when a proxy run by a service mesh is added on. The only way to know is to actually shut down pods and watch the system react, tune the new component, and then confirm the fix.
Leap Ahead with Confidence
If done right, the benefits of Kubernetes are felt by the entire business. Companies are reaching application scales never before seen, with feature velocity at a rate outstripping older models. With Chaos Engineering, the reliability of these scaling applications can keep up; in fact, performing this testing while setting up the new environment will speed up the process, as there will be less guesswork on what dials to tune, and leave more time to build the products customers demand.
To learn more about Kubernetes and other cloud native technologies, consider coming to KubeCon + CloudNativeCon North America 2020, Nov. 17-20, virtually.
The Cloud Native Computing Foundation is a sponsor of The New Stack.