Platform Engineering: Challenges and Solutions
Hybrid- and multicloud can break the common-sense ergonomics and economics driving the rise of platform engineering. Here are some patterns useful for meeting these new challenges.
Much has been written over the past two years about platform engineering. This is a job concerned with designing, building, maintaining and extending a self-service, automation and abstraction framework that makes platform components (e.g., Kubernetes, databases, services provided by underlying cloud frameworks, etc.) consumable, performant and manageable by service developers, application developers and site reliability engineers (SREs). The illustration below shows where platform engineering fits in the job description “stack.”
Is the platform engineering job really required, and is it distinct from other jobs (e.g., DevOps, SRE)? Absolutely. The rise of container orchestration on robust — but for newcomers, complicated — platforms like Kubernetes creates massive opportunities.
Opportunity creation is also a result of the increasing popularity of Kubernetes as the universal platform for applications and as the pave-the-world framework for underlying infrastructure and services (plus tools, standards and best practices):
- Clean separation of concerns: Platform engineers can create an upward-facing API that lets service engineers consume and use NoSQL databases such as Cassandra or high-cardinality time-series databases and visualization frameworks (e.g., Amazon OpenSearch — you thought I was going to say Elastic, right?). In the process, APIs are constructed that let application developers consume and use services in fully abstracted ways. They can help SREs create operational APIs and automation to martial and/or deploy and monitor around these services on or behind/underneath K8s. They can then let actual Cassandra experts tweak and tune the Kubernetes operators that maintain, monitor and scale the specific components and/or services themselves.
- Deep standardization of tooling and best practices: In this Kubernetes-centric environment, clean separation of concerns is more about actual (business-meaningful, explicable to laypeople) concerns and way less about tooling silos. Everyone can use pretty much the same tools, and there can be lots of sharing, standardization, policy, etc., that:
- Makes the work of one group findable and, in general terms, comprehensible to other groups. Read: easier collaboration, faster personnel onboarding, less confusion in crises, faster times-to-resolution.
- Limits the blast radius of weirdness. Sure, the Cassandra experts can break a Cassandra (hopefully only on staging). But if they’re coloring within lines established by the platform engineers, nothing else should fail. And new application pods that need Cassandra should find a healthy one.
- Abstraction and outward simplification: In this model, APIs and automation hide complexity and efface minor differences between functionally/semantically equivalent services within and around the platform (and further APIs determine how these services are exposed to developers and work). Life is simpler because of this: portability at the application layer, in application-lifecycle management tooling, etc., is increased. Potential lock-in factors are reduced.
- Kubernetes-ification of services: Kubernetes provides a model for declarative configuration, strong contracts about how basic things will work, growing standards for integrating third-party solutions (e.g., LB, DNS, ingress, CNI, CSI, etc.) plus plenty of ways to customize around this model. Platform engineers and service engineers should, in many cases, be able to collaborate efficiently so all platform components and provided services work “just like Kubernetes.”
But There’s a Problem
The model described above is full of promise, but it’s only straightforward and efficient when an organization can make hard choices and place heavy bets on a particular spin and configuration of Kubernetes, and on a particular environment in which Kubernetes can run. If you make those choices, platform engineers can then work in a finite domain and solve each problem once. Anything after that is about optimization and new features (i.e., building value).
Sadly, however, the result of making these choices is often some form of lock-in. “Kubernetes and what it runs on” is the biggest part of “the platform.” Changing the platform itself can be complex, time-consuming and expensive. Think about running a fully platform-engineered Brand A Kubernetes setup on an in-house IaaS data center — say, VMware. And now you want to switch to hybrid cloud and build the same thing on AWS, providing all the same platform services and using those to support the same service-level APIs for devs.
It’s doable, sure. But before the job is done, before your applications are trivially portable from the VMware environment to the AWS environment, you’ll end up a) learning more about AWS than you ever wanted to learn, and b) touching every part of your platform and service engineering stack. Add another public cloud, bare metal cloud, edge server model and your platform and service engineers are learning new ways to do what should be commodity things (or hiring more and more folks with infrastructure-specific skills), and solving the same problems again and again. Each time they solve them, that’s another codebase, toolkit, procedural model, maybe monitoring framework everyone needs to learn to work within and maintain.
As your multicloud builds out, this all eats time, increases risk (more divergent, platform-specific automation tooling and code always means more risk), potentially increases attack surface (more “unknown unknowns”), and locks down the agenda of platform engineering to fundamental enablement, rather than adding strategic value and helping everyone above them in the stack ship code faster.
Plenty of Mediocre Solutions
Of course, DIY-oriented end users, cloud service providers, vendors, etc., have stepped up to offer solutions to parts of this growing snarl of problems. Some examples:
“We have the technology!” — Given the wide availability and elegance of multi-cloud capable declarative and hybrid declarative/procedural deployment tooling (waves to Pulumi!) out there, plenty of people feel capable of building out a container multicloud without external or product support. There are 99 problems latent here, the biggest is that it focuses effort and demands exhaustive learning or hiring expertise not just about Kubernetes, but about underlying infrastructure specifics. DIY multicloud tends to drag platform engineers down into the underlying infrastructure instead of focusing on making life easier for service engineers. It works best (arguably it only works) when the platform engineering stack is shallow and simple. In fact, DIY approaches reinforce a misunderstanding about the role of platform engineering — that it’s really just infrastructure engineering for K8s.
“Who needs platform engineering? Here’s your platform!” — Covering up Kubernetes with a heavy PaaS layer seems to obviate the need for platform engineers entirely. If PaaS is run as an application on K8s, it creates a higher-order platform for certain kinds of routine development, e.g., web apps, and arguably saves both platform and service engineers some work hosting and providing services for these apps — a reasonable goal. But unless PaaS is run as an application on K8s, PaaS-oriented, “Kubernetes under the hood” solutions tend to force several kinds of lock-in:
- Apps get locked into the proprietary platform.
- The platform is locked to the vendor’s spin of k8s, and often the OS.
- If you want multicloud, the big challenge of moving a complex platform stack from one infrastructure to another still remains. Yes, the vendor will help you do this for underlying platforms they support, but don’t expect to move faster than the vendor’s roadmap for enabling providers.
“Put our cloud on your premises!” — This is the rationale behind Amazon Outposts and similar offerings that support a private cloud experience hybridized with your public cloud estate. Convenient? Maybe. CapEx to OpEx? Sure. But it works against commodification of public cloud services and makes switching cloud providers close to impossible. Prepare to go very, very deep into your provider’s One True Way. Plus, it tends to lock you into narrow, vendor-supported options for on-premises hardware and networking setups.
“Let us deploy that for you!” — There are also plenty of solutions out there to deploy consistent Kubernetes over multiple infrastructures. However, these tend to limit operational support of underlying infras — you may be able to deploy K8S easily on cloud X, but not scale it. Easy scaling can only happen on cloud Y — and may or may not give simple, nuanced access to specific cloud-provider services that rock substantial value. In fact, these products seem to be designed to commodify public cloud and infrastructure, which is the wrong approach. Also, they may not support bare-metal data center or edge servers at all, meaning you get to pretend you’re platform engineers while stressing over how to PXE-boot a bunch of blades — not a platform engineering job in the modern vision of the role.
What Real Solutions Look Like
Solving these problems demands a different kind of solution and potentially a different kind of vendor partnership. What do real solutions look like?
They focus on Kubernetes — Kubernetes continues evolving at lightspeed into a complete solution that developers and operators are eager to learn, one whose abstractions and functionality are meaningful and widely understood. If you want PaaS or serverless, that’s great. Good solutions exist that run well on Kubernetes, and platform and service engineers can build upon their value in lots of ways, but there’s no great value in hiding Kubernetes’ elegance under a proprietary facade due to fear, uncertainty and doubt.
On the contrary, distraction from Kubernetes itself will slow you down. A big part of the strategic value of platform engineering does depend on developing expertise in “the platform,” and driving innovation around leveraging that platform. This is valid and beneficial, so long as the platform (i.e., Kubernetes) is open, unconstrained and thus able to benefit from Cloud Native Computing Foundation ecosystem-wide innovation. As to making it consumable — that’s where platform and service engineers want to be focusing effort. Go build secure software supply chains! Go curate container images and Helm charts for dozens of useful components!
They make Kubernetes production-ready — While platform engineers should concern themselves with Kubernetes, a safe way to offload responsibility and reduce risk is to leave selection and integration of critical, best-of-breed extensions up to a vendor. It should go without saying that before the platform engineering really takes off, you need a well-structured Kubernetes cluster model with first-class ingress, networking, perhaps distributed storage and fundamental monitoring/observability in a set of, shall we say “opinionated, but flexible” configurations that serve your application use cases. You need a cluster model that includes required elements and core functionality like a hardened container runtime and private registry. Ideally, this cluster model should be as host OS and environment/infrastructure-agnostic as possible, drawing a clear line between “the platform” and “stuff under and around the platform” that enables portability across infrastructures and helps the solution offer “single panes of glass” into cluster availability and performance.
The goal of the whole exercise, in fact, is to find solutions that can deliver and lifecycle-manage this kind of feature-complete, hardened, secure, but still upstream-centric, standard Kubernetes on any infrastructure. That’s what ultimately lets your developers and DevOps folks trivially port applications, and service automation, from private cloud to public cloud to bare metal to edge.
They focus on operationalizing Kubernetes — Cluster configurations need to be planned for smooth and risk-free operations. For Kubernetes, that mostly means insisting, or at least strongly incentivizing, creation of clusters with highly-available manager node sets, and sufficient worker capacity (of each mission-critical node type: e.g., Windows workers, Linux workers, GPU nodes, etc.) to enable self-healing, rolling updates and other processes to proceed expeditiously without taking APIs or applications offline.
They exploit Kubernetes to manage infrastructure — It’s a good sign when vendors drink their own Kool-Aid (or kombucha) and use Kubernetes objects and best practices to configure, deploy, reconverge, scale and manage whatever infrastructures host your clusters. This is less a statement of religion and more an acknowledgment that Kubernetes provides methods and behaviors for making and keeping things right (dynamically) that are just more powerful than old-school automation tools, though there’s nothing wrong with having an operator call Ansible, either. For platform engineers, this layer of interlinked cluster and infrastructure automation is, or should be, a kind of textbook for how to do platform engineering across infrastructures.
They use this to provide a simple, cloud-native experience — one WebUI, one API — that lets you deploy, scale, observe and lifecycle-manage Kubernetes across all clusters and infrastructures — To speak simply: This is the part of platform engineering that for your organization delivers the least value, requires the most time and expertise, and incurs the greatest risk. You want to outsource this to a responsible vendor with a deep bench, and get:
- Simple, couple-of-clicks, self-service cluster deployment and couple-more-clicks scaling on any infrastructure, delivering consistent, complete clusters that are ready for work and enable really trivial portability of apps and application-lifecycle automation around your multicloud.
- Easy, non-disruptive, riskless, straightforward cluster upgrades, ideally dropped as continuous updates for minimal evaluation and application, without taking apps or cluster APIs offline.
- Easy integration of notifications, corporate directory and other centralized facilities with cluster management, reducing operational overhead.
- Most important: one API for doing a ton of strategic platform engineering work. That means one REST API that lets you deploy, upgrade, scale clusters across all your infrastructures in a simple, dependable way.
Given that singular API, you and your platform engineering team get to focus on, for example, building out efficient, standardized, cross-platform ways of helping your developers use blue/green or canary deployments. Which is what actually helps your business ship software faster.