Modal Title
Cloud Native Ecosystem / Culture / Kubernetes / Software Development

Adobe’s Internal Developer Platform Journey and Lessons

In Adobe's internal developer platform journey to a platform development, a number of themes have emerged. A look at how it's dealt with them.
Sep 7th, 2022 12:33pm by
Featued image for: Adobe’s Internal Developer Platform Journey and Lessons
Feature image via Unsplash

In the last 25-plus years I’ve created software components and distributed service development frameworks, built and led teams, and most recently I am driving developer productivity for Adobe’s service development, deployment and management systems. My team presented about Argo-based service delivery at last year’s ArgoCon and blogged about how we are powering a cloud native transformation at Adobe.

Through it all, but especially in the context of platform development, a number of themes have emerged — cautionary tales if you will — of abstraction, customization, support, over-specialization, over-planning and platform team hubris.

Pure Abstractions are a Trap

Srinivas Peri
Srinivas is director of Ethos, the internal developer platform for cloud platform engineering at Adobe. In his 19 years at Adobe, he has moved from tool engineer responsible for releasing one core component to the owner of a daily release system for more than 70 Creative Suite components. He led the creation of the early deployment system for Shared Cloud and Creative Cloud, then through building cloud services and cloud service frameworks, and finally to Ethos. His responsibilities have grown to include leadership of the entire Developer Productivity and Growth engineering group, delivering capabilities for service frameworks, provisioning, CI/CD, observability, diagnostics and developer support.

In the early days of the cloud, every team at Adobe had their own cloud accounts, their own deployment systems and their own wildly differing levels of maturity. It quickly became obvious that we should standardize, so the key problems of cost savings, compliance, security and reliability could be solved once for the benefit of all.

When we started this journey in 2016, Kubernetes was in its early days, not yet ready to support Adobe’s cloud offerings at scale. The best alternative was Mesos, but even then we knew we were in a changing landscape.

So rather than expose our users to the raw platform, we created an abstraction — a “service spec.” The service spec described everything about how a service should be provisioned and deployed. Custom in-house software then transformed the service spec into the necessary primitives at deployment time, and our platform took off, quickly growing to support over 1,000 services and developers

But as we grew in scale and needs, our homegrown solution on top of Mesos was starting to struggle, and Kubernetes had matured. It was time for a change. Here’s where our abstraction, our service spec, saved the day.

We built some custom migration tools and were able to move all those running services from Mesos clusters to Kubernetes clusters without downtime — our new backend just translated the service spec into Kubernetes configurations, and with a few minor hiccups everything worked. Mesos clusters were shut down and cost savings were celebrated.

A few more years down the road and things weren’t quite so rosy. Our service spec allowed for a very simple “paved path” approach suitable for basic apps serving REST APIs or workers, and we were supporting nearly 4,000 of these! And as more and more teams at Adobe came on board looking for the platform cost savings and those guarantees of security, compliance and reliability, their requirements were more and more varied and didn’t quite fit the mold of what our abstraction could provide.

The biggest and most skilled teams were able to build custom solutions directly on our managed clusters, not using our abstraction so they could leverage the full freedom and power of Kubernetes. Of course, they also had to take full responsibility for deployment systems and shoulder greater burdens around security and reliability. And even those who were reasonably satisfied with our canned solution were agitating for more power and more flexibility.

This is where we began to realize we were trapped by our own abstraction. By not allowing teams to use parts of the abstraction while bypassing it when needed to directly customize the canned solution, we were offering an “all or nothing” solution that was not scalable.

The Temptation of Homegrown Software Has a Cost

Our custom solutions had grown complex over time, and the effort to add features, as well as keep up with new developments in Kubernetes, were consuming more and more of the team’s time, leaving little room for innovation. We were falling behind. And worse, we hadn’t built for extensibility. Even with the best intentions, it wasn’t practical for our users to contribute changes they needed when we weren’t able to prioritize them.

Fanatic Support Drives Massive Adoption, but Consumes Platform Team Resources

At every stage (first Mesos, then Kubernetes, and everything we do next) we’ve had to struggle with institutional inertia and the (not unreasonable) fear of change. We can promise great things, but how do we get our users to come on board, to make the switch from a working but outdated solution to something newer and better while addressing their valid concerns about the risk of any change to a production service? There are always a few brave early adopters excited about the technology, but the other 90% are the big challenge.

This is when you need to know your organization. Identify a few key players who are struggling with the current system. Maybe they’re on the paved path, but they can’t grow without more freedom. Maybe they’ve been maintaining a custom solution, but are tired of the operational management required there. Look at the teams that are struggling the most and who are also working on key business needs for the company. Then work with them.

Take some of the engineers who have been building the new solution and send them out to join those client teams. Bring the expertise, train by example and do whatever it takes to make them successful. Roll the lessons learned back into the product. Those success stories will go viral around the company and resistance to change will begin to erode.

This is a fantastic way to start, but it isn’t sustainable. It doesn’t scale. There will always be a few business-critical projects or teams that deserve “white glove” support, but we’ll always have a small number of engineers building the platform. Without some careful management, support can expand to consume all their time. And realistically, we’re not going to hire a bevy of support engineers to do it for them.

So self-service becomes key. Smart error messages that include troubleshooting links. Intelligent agents that can search available documentation to answer questions. Certification and recognition programs for community “champions” who can solve problems for their groups. Automation at every level. We apply every multiplier we can, and then for the few support tickets that make it to the team, we reward quick resolution and client satisfaction.

Developers Need to Invest in Both Breadth and Depth

Whatever solutions we build, we need subject matter experts (SMEs) to create and maintain them. But another trend we’ve fought with over the years is over-specialization. Developers find their niche, they become experts in their component and don’t worry about anything else — they become siloed.

This has some benefits, but brings a lot of issues as well, especially in the context of a platform. Experts often demonstrate very high productivity in their areas of focus. But they’re more likely to face integration issues with systems they don’t understand. And they become single points of failure for the organization should they leave.

There’s no magic solution, but we found that if we can spot when we’re getting stuck and flip our mode, we get great results.

We’ve tackled this in three ways. We ensure that every team member is an expert user of the platform through training and daily use for our own deliverables — we eat our own dog food. We then assign every team member to a support rotation, where they triage incoming issues for the whole platform. And finally, we encourage mobility. We regularly shift developers between components based on their desires and the organization’s needs.

We do all this with care, however. It’s still critical that developers have the focus to get their work done. Spreading support rotations across the whole team means those rotations don’t come up too often, and component shifts are relatively infrequent for any individual developer. This lets us create our “T-shaped” developers, with a breadth of knowledge across the whole platform and depth in a few components.

Perfection Is the Enemy of Progress

Another thing we’ve had to juggle over the years is short-term wins vs. long-term planning. Especially when starting something new, something risky, the organization tends to shift toward caution over time. Planning groups are formed. Specs are written in exhaustive detail and reviewed endlessly. Schedules extend…

There’s obviously a better way, “be agile,” “fail early,” all that industry wisdom. But it’s easy to slip into modes of operation that discount them — resources are always constrained, we don’t want to commit to things without some certainty.

There’s no magic solution, but we found that if we can spot when we’re getting stuck and flip our mode, we get great results. Don’t write that perfect spec. Take a developer (or two, or three) and give them that skunk-works project, remove all distractions and let them innovate for a few weeks. Evaluate their proof of concept by working closely with selected lighthouse users, and if it’s promising, fold it back into the rest of the team and the roadmap. Take those insights and write a light “one-page spec” to be sure the vision is clear. Iterate and deliver the value that the platform’s clients are depending on.

The Platform Team Will Become a Bottleneck

Another natural trend in any platform team is to lose sight of our clients over time. Phrases like “no snowflakes” and “your use case doesn’t fit” start to seem natural. But if the platform team wanders off into the tall grass chasing their perfect solution, they become irrelevant. When all we tell our users is “no,” our users will start looking for other solutions. And then those initial multipliers that come with having a platform, which are key benefits to the company — cost savings, compliance, security and reliability — are lost. And that’s how platforms die.

We avoid this by staying in touch with senior leadership and with key clients — especially new, business-critical initiatives at the company. We make sure the platform evolves to meet their needs. The platform must be general enough to make developers’ lives easy and flexible enough to support all the required use cases. It’s not easy; it takes continuous re-invention.

What’s Next

That’s been our journey — from an idea that we could do things better, to massive scale and continuous adaptation. And now we’re reinventing ourselves again. We were at that inflection point where our abstraction was too limiting, maintaining our custom solutions was consuming the team, and our users wanted more.

So we thought about the things we’ve learned — abstraction is good, but inflexible abstraction can be a trap; building solutions in-house allows total freedom, but that freedom comes at the cost of ongoing maintenance — and we started looking for a new solution. Something that could split the difference between our canned “paved path” solution and raw Kubernetes.

So we spun up that skunk-works project and sent a few developers off to learn everything they could about the open source CNCF projects and Argo in particular, and how we could apply them to our world. And we struck gold!

We’re trading the heavy abstraction of our service spec for the light abstraction of Helm charts. We’re trading our suite of custom components for Argo. We’re designing for extensibility.

Our next challenge is to take developer productivity up another level by creating tighter integrations with things like cloud providers, monitoring solutions, observability, distributed tracing and more.

We’re including our key lighthouse users in the development process from the beginning, combining the powers of open source and inner source. And we’re once again building custom conversion tools — this time to translate our old service spec into Helm charts and Argo templates that our users can then customize at will, unlocking our platform for all of Adobe.

But we’re not stopping there. Our next challenge is to take developer productivity up another level for our users by creating tighter integrations with things like cloud providers, monitoring solutions, observability, distributed tracing and more. We’re starting to integrate all of Adobe’s disparate internal offerings into an internal developer platform or IDP.

With all this in place we’re providing a better platform at lower cost, with higher flexibility and lower friction for our users. Everybody wins.

Closing Notes

I’ll be a keynote speaker at ArgoCon, taking place Sept. 19-21 in Mountain View, California, and I encourage anyone interested to register and join Adobe’s IDP Workshop Session.

In this workshop, which is in the spirit of community open discussion, we’ll be sharing our journey and brainstorming with you and other community thought leaders on making IDP a reality, based on Argo and other open source projects.

We look forward to your active participation and hearing your thoughts and opinions on these topics.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.