Cloud Migration and Platform Engineering at Large Organizations
As larger organizations transition to the cloud, platform engineering presents itself as an interesting way to facilitate that move — and hopefully anticipate and address up front the cognitive load that this migration often brings.
During a keynote at All Day DevOps, Yonit Gruber-Hazani, a customer engineer at Google Cloud, reflected on her experience in the public sector and how a platform-as-a-product mindset has helped organizations make the move. In this piece, we summarize some of the patterns and anti-patterns of platform engineering that she’s observed — and where they stay the same and where they differ depending on the size of your company.
Why Platform Engineering?
Imagine you’ve a new developer joining your team, she kicked off her talk. “Let’s say we give them a new laptop and we say ‘Here is a new empty laptop, go ahead and find the operating system you want and install it, and do the security hardening of your laptop, set up your networking, install your antivirus and your IDE [internal developer environment] and go and add yourself to the active directory or to any user management system in our company. And then you can start developing’.”
Of course, we don’t do this. The IT department paves the way with a laptop with all this stuff downloaded and good to go.
“We give them everything they need to focus and start their journey as a developer in the company,” she said.
She kicked off her talk with the anti-pattern that often triggers platform engineering. We would never make folks set up their own payroll and HR processes — so why do we give them so, so many choices for what Syntasso’s Abby Bangser called “non-differential but not unimportant work”? The platform team aims to resolve this via software developer enablement, by paving a Yellow Brick Road for onboarding all the way through to releasing. Development teams can still be autonomous in their creation of value for the end user, but they don’t need to know about all things cloud, operations, security, etcetera.
This out-of-the-box kind of setup is the same thing you want to apply to onboarding new applications, new developers and new application teams, Gruber-Hazani said, “to give them the best development experience when they start developing applications in our production system.”
That’s why large organizations shouldn’t present the full Google Cloud landscape — and especially not the massive open source cloud native landscape — to developers or even teams to pick and choose what they want to use.
“We don’t want to do that because then we get a jungle because the DevOps team will then have to manage everything they choose, and we’ll need to support it, and we’ll have to maintain it and do monitoring,” she said. “And we don’t want to give the developer all this background noise. What they want to do is work on the application and the business logic behind it.”
So, why platform engineering? To reduce developer cognitive load, allowing them to reach a flow state that lets them just focus on building their applications — all with the help of a platform.
Developer Experience and Ever-Increasing Cognitive Load
Developer experience or DevEx, Gruber-Hazani defines, “is why we come to work every morning. It’s why we keep going to work year after year. It defines how happy we feel in our work life, how accomplished we feel, and how helpful to others we feel at the end of the day. It’s what makes us human and able to work.”
DevEx in turn affects:
- Happiness at work
She explained the three DevEx metrics to measure productivity:
- Flow state: Getting “in the zone” — is when a developer is fully involved in and enjoying their work. It takes an average of 23 minutes to get into that flow, which can be quickly interrupted by things like waiting for someone else to do something.
- Feedback loops: The response and response time to something a dev does. Gruber-Hazani gave the examples of waiting for code to recompile or waiting to gain access to another virtual machine. Code reviews are an important feedback loop to close. In fact, this year’s State of DevOps Report 2023 (referred to as the DORA report) found that teams with faster code reviews have 50% higher software delivery performance.
- Cognitive load: “The mental processing required to perform a task,” she explained. “And it can stop developers from creating and delivering value.” Poor documentation and manual, precarious steps increase this negative developer experience and burnout, while this year’s DORA report found that high-quality docs drive a 25% increase in performance.
Last year’s StackOverflow Survey found that 62% of all respondents spent more than 30 minutes a day searching for answers or solutions to problems, while 25% spent more than an hour a day. This avoidable frustration both breaks flow state and increases cognitive load, making documentation an essential part of any platform strategy at companies of all sizes.
What Is Platform Engineering?
If a platform is inherently different at each organization, what common patterns unite the sociotechnical discipline of platform engineering? Gruber-Hazani outlined:
- Repeatable tooling and models, like Infrastructure as Code and reusable CI/CD pipelines.
- Self-service via APIs.
- A one-to-many “vending machine of services,” especially at larger orgs where a platform must service 50 to 100 dev teams.
- Paved roads within guardrails.
- Organizationally specific workflows.
During her role over the last couple years working with Google Cloud customers in the public sector, Gruber-Hazani has found that the majority of applications need much of the same guardrails:
- Security, authentication and authorization
- Connections to backend systems (like structured databases)
- Graceful degradation and failover
- Load shedding and throttling
- Common shared libraries
- Testing infrastructure
- Release infrastructure
All this can typically be taken care of within a platform or a set of platforms, to help platform teams meet the goals of better serving developers by:
- Taming technical complexity
- Reducing costs
- Increasing development speed
- Enforcing safety and security guardrails
- Reducing cognitive load, toil and burnout
This is achieved through paving that golden path to production, with best practices and security guardrails. This is often wrapped in a service catalog — or usually a series of service catalogs for different developer groups — that’s abstracted behind an internal developer portal, like Backstage or Spinnaker.
“We want the developers to work safer and better and faster.” — Yonit Gruber-Hazani, Google
Another way that platform engineering differentiates itself from the top-down platforms of before is the platform as a product mindset, which, Gruber-Hazani said, makes every platform engineer at least a part-time product manager. This necessitates that every platform team balances:
- Business goals
- Internal customer needs
- Product and team capabilities
By doing this, development teams understand how they are able to deliver value faster, while business gains insight into the often ample cost center that is engineering.
Talk to Your Users
A common anti-pattern in platform engineering is that platform engineers are engineers too. That means the are libel to think they know best, but a platform-as-a-product mindset relies on talking to your internal users. Indeed, this year’s DORA report found that teams that focus on users have a 40% increase in performance — and that’s the same if those users were the external paying kind or your internal colleague-customers.
As you treat your platform as a product, you must build incrementally, constantly asking for feedback from your internal devs. Gruber-Hazani said this involves asking developers what products, tools and languages they like using and also which they want to use — which in turn not only helps with DevEx but supports recruitment and retention. Ask your devs if they want to move to containers or what kind of monitoring they’d like to see in the golden path.
That golden path must be paved with a continuous conversation with your developer customers, Gruber-Hazani said, and must include:
- User groups
- Usage reports
- Bug reporting
- Site reliability engineering (SRE)
- New features
Platform Engineering at Larger Organizations
But these are patterns common to most platform strategies. What makes it different at larger organizations?
The more complex your organization, the more complex your cloud cost. This is where FinOps becomes a must-have for cloud migration to be able to track and optimize who is spending what in the cloud. This complexity increases in larger orgs and especially the public sector, which feature intense procurement processes, involving hardware, servers, and licensing. Large-scale FinOps requires budgeting and alerts. This also leads to different decisions. Gruber-Hazani gave the example of how moving to a virtual machine can be a way to reduce costs versus the cloud, or how teams can use spot machines for stateless work.
“Usually large organizations that are coming from the data center period, they still have those practices in place,” she explained, so start with FinOps early but meet finance where they are at.
Another anti-pattern she has uncovered is kicking off your golden path effort with the most complex systems.
“Starting with updating the simpler systems with less dependencies will allow us far better outcomes than starting with the most complex system with the most dependencies,” Gruber-Hazani said. It also allows for early wins that nurture internal platform advocates.
This is where Team Topologies’ concept of the thinnest viable platform comes in, where you build in smaller increments with tight feedback loops to make sure you are building — and maintaining — what your users actually want. Without creating something overly complex too soon that just becomes more technical debt.
It’s also important to take baby steps as you build your golden path because she’s found that larger organizations often adopt high-value services too early, when there isn’t enough resilience or confidence in the platform.
Then, as your platform maturity progresses and you enter Day 2, Gruber-Hazani said, new platform team requirements arise:
- SRE on-call rotations and post-mortems
- Incident management
- Better CI/CD pipelines
- Infrastructure as Code
- More training for platform maintainers and users
- Cross-organizational architectural standards
- Budget strategy
- FinOps planning and forecasting
- Security throughout
Finally, Gruber-Hazani emphasized that knowledge reduces fear, so invest a lot of time into cultivating psychological safety by always giving your users insight into where your platform strategy is headed and how you can help them get there.
“We’re not replacing people,” she said to emphasize, as automation, especially at larger organizations, can trigger fear of job loss. This new golden path to the cloud must come with training for the new skills needed.
And, no matter what the size of your org, continuously gather feedback into how what your internal customers think.