Smart Workloads: The Bridge to Self-Managing Systems
Congrats on containerizing your workloads and orchestrating them with Kubernetes! Welcome to a new realm of portability and elasticity—it’s just the beginning. Developers have a dream that applications can run anywhere without modification. This begins with Kubernetes, but it extends to serverless, which promises developers to deploy and run functions without having to manage the infrastructure. Behind that dream are the all-powerful business needs—rapidly evolving, always escalating—pushing developers and their applications into the new frontiers of IoT, edge computing, and beyond. But as lines blur between the real world and digital world, the key technologies that make it possible create new challenges for IT Operations.
Containerizing services afforded the ability to stretch them across multiple locations or clouds. Instead of having to write an entire business workflow in one application platform, a developer can now create and integrate many services across different clouds. Now you can access data in one cloud, serverless functions that provide event triggering in another, and machine learning, business analytics, text-to-speech or what-have-you in yet another. This is why cloud providers are finding that their competitive advantage lies not in the elastic compute, storage, or networking resources, but the services that run on top of them.
Developers just want to focus on building the applications that differentiate and drive the business forward – and they have a whole new set of building blocks to do it. A few of those new building blocks include:
- Microservices: Speed up development, allowing teams to work in paralleling, building & deploying decoupled services that make up a single application
- Containers: Build once, run anywhere; again, speeding up development
- Kubernetes: Deploy and run those workloads anywhere; elastically scaling containerized infrastructure
- Functions: Functions such as AWS Lambda, knative, or Openwisk; more abstraction to allow the developer to focus even more on the application
- Data: Lots and lots of data… because data…
- Service mesh: A service mesh is needed to connect it all together, ensuring secure communication between services
- Function gateways: Gateways such as Gloo provide the ability to integrate legacy services that are not ready to containerize, but are a necessary part of the overall business transaction.
- And so much more!
While developers and DevOps teams are leveraging an ever-growing swath of building blocks to build and deploy agile and elastic services, operators are applying the same ol’ methods to a transformed landscape. Does this really make sense?
New Dev, Old Ops
IT operations has always had the mandate to assure application performance while minimizing cost and staying compliant with business policies. With containerized and now serverless applications, that mandate has evolved slightly: Ensure service level objectives (SLOs) while abiding by business constraints (budgets, data locality, security, etc.).
How are they managing this today? With monitoring and process automation via scripting, policies, and thresholds.
Let’s see what this means for Kubernetes. In Kubernetes you have to define, operate, and manage scaling thresholds, aka “autoscaling.” If you’re operating across multiple clouds, you have to do the same for each individual cloud where your services are deployed. So, what are some of the questions a Kubernetes operator needs to ask in order to make the right decisions?
- What are the pod’s limits and reservations? (aka How much CPU and Memory is it allowed?)
- Are the containers in the pod configured correctly? Better get it right, because you’re going to autoscale that configuration quite a bit.
- When does the pod need to scale out to ensure peak demand is met? When should it scale back?
- Can I reschedule a pod to avoid resource fragmentation? What about preventing congestion due to noisy neighbors? Today that’s often done by killing the pod and spinning up a new one, allowing the native scheduler to find available capacity… but that’s disrupted service for the end-user!
- Where should my pods run? Closer to the end-user? Closer to the services to which they are connected? Today that’s hard coded via node labels — and it’s often a best-guess.
What about serverless? For one thing, no function is ever really “serverless.” Serverless just means it’s not the developer’s problem. The functions are running on a platform you—the operator—need to manage. Even the functions have a resource configuration that you need to manage. What about data services? Data services—whether on-premises or public cloud RDS—should have the right size or template to handle peak demand without over spending.
With this dynamic complexity at multiple layers of the stack and across heterogeneous clouds and components, how do you assure the performance of the business transaction from end to end? Today’s approach of individually managing pods, services, and functions is laborious and disconnected. It cannot scale.
What Makes a Workload Smart?
Operators deserve a new approach. In the industry, we talk a lot about automation. And we’ve come a long way by automating processes. But, for the reasons noted above, it’s no longer good enough to rely on thresholds, scripting, and policies. It’s too much work! And, more importantly, it is beyond human scale. Modern distributed applications, and the cloud and containerized multicloud infrastructure they run on, are too complex and too dynamic for people to continuously assure performance.
If workloads can make their own resource decisions, you don’t have to make those decisions anymore. If workloads self-manage, you spend your time launching new services, improving existing processes, or learning new skills. You innovate.
So, what does a smart workload require?
- Full-stack visibility: The workload has to understand what’s in your environment, the interdependencies, and the tradeoffs. The last two are critical. It’s not enough to pull together monitoring data from different systems and consolidating it into one view. The workload needs to understand the relationships between containerized services and platforms, functions, and legacy virtualized systems, across any cloud. Will a change here cause a change there? If I scale out a pod, is there enough capacity on the node? Is there enough capacity on the server? These are all important questions that only full-stack visibility can answer.
- Real-time analytics: Remember my point about dynamic complexity beyond human scale? Ensuring SLOs in this new landscape requires continuous, dynamically adjusting resource management. In other words, it requires 24/7 decision-making about when to scale, by how much; when to scale back; where to place a workload; how to configure (or re-configure) a workload—all based on real-time resource needs.
- Automatable actions: You can have the most powerful analytics in the world telling you exactly what to do, but if the system can’t execute those actions, then what’s the point?
How to make this all work? The right abstraction. It has to be simple enough to be applicable to any new technology and elegant enough to capture what’s really going on—and there’s a lot going on. Observe the fundamentals: applications have resource needs that are met from a limited pool of resources. The constraints may be budget, capacity, or compliance rules. There is demand and there is supply. These problems need to be looked at as an interconnected supply chain that services the workload—whether it’s a container or a function or (heaven forbid!) a VM.
Pumpkin Spiced Latte, anyone?
Can actions in the environment be abstracted to economic decisions? A service relies on a pod which relies on a node which relies on a server, and so on, to function. Sounds like a supply chain to me.
What if that service was coffee and the pod a Starbucks? You and your friends have a compulsive obsession with Pumpkin Spiced Lattes. So does everyone else in town. Demand increases. So many lattes to be made, not enough baristas. So Starbucks deploys a new pod—I mean “store”—in the same neighborhood. This neighborhood is called Worker Node 2. It’s a popular place, lots of other shops (you know, “pods”) pop up. But, as demand increases so do the rent prices in Worker Node 2. At some point, it doesn’t make sense for your favorite Starbucks to stay in the neighborhood—so it moves to the Worker Node 3 neighborhood a couple blocks over. Better rent prices there.
Have I lost you? If so, I’m sorry. We can chat over lattes at KubeCon. The bottom-line is that there are ways to enable software to make decisions—to do the things that we can’t and really don’t want to do, like continuous resource management. The world is automating and that’s a good thing because we can elevate people into roles where they can innovate and think creatively to solve bigger problems.
Feature image via Pixabay.