How Intuit’s Platform Engineering Team Chose an App Definition
DETROIT — When Intuit’s platform engineering team set out to choose an application definition, it ultimately decided it required a data-centric approach.
Supporting more than 5,000 developers building out products like TurboTax, QuickBooks, Credit Karma and Mailchimp, it was looking to increase developer velocity and self-service by abstracting away the complexities of its underlying Kubernetes platform.
Initially, its deployment pipeline looked like this, with the developer using Kustomize base and overlay configs for their deployment repos which are then picked up by Argo CD, which is Intuit’s primary tool for deploying deployment manifests into Kubernetes production clusters. They’re all isolated using namespaces.
But that created a number of problems:
No. 1 — Kubernetes and cloud complexities are exposed directly to application developers. An application developer routinely has to worry about how do I set horizontal scaling. minReplicas and max Replicas? Is 15 seconds too low for a handshake interval on my application load balancing ingress object? What’s the right maxtUavailable for my pod disruption budget? What are the good CPU and memory limits as for my service? What quota should I set for my application’s namespace?
So now, they are worrying not just about building their node or Java application, but they’re worrying so much about the Kubernetes objects and the cloud complexities, Ragunathan explained.
No. 2 — Kubernetes deprecations are exposed. The developer now has to understand that ingress v1beta, for example, can be deployed on a Kubernetes 1.21 cluster just fine. But when it comes to Kubernetes 1.22 it’s going to break. So the platform team works closely with application developers to migrate them from a deprecated set of APIs to a new set of APIs, but this is just one more thing that causes friction and reduces developer velocity.
No. 3 — Lack of operational input in the application definition, things like how do I enable high availability in my service using an application definition? How can I enable active disaster recovery? What if some of my services need external traffic? How do I specify that?
“We wanted a desired target state where application developer specifies the application intent. And that’s it. We do some magic behind the scenes and that gets deployed into a Kubernetes cluster to the right cloud resources,” Ragunathan said.
An application definition is an operational runbook that describes in code everything an application needs to be built, run, and managed.
“We wanted to take a methodical approach in understanding how we can actually solve the problems [with] existing tools that would also fit with our Intuit toolchain and our use cases,” she said.
“Our main requirements for the app spec needed to be application-centric; there shouldn’t be any leakage of cloud or Kubernetes resources into the application specification. And it had to meet the deployment as well as the operational needs of the application,” she said.
“The two choices we had were the Open Application Model, which suited our needs pretty well. Or we could go with a templating style model where you had to provide a bunch of input parameters. But there was also a lot of abstraction leaked into the application spec. So it was easy for us to go with an application OAM-style specification.”
At a high level, the developer should be able to describe his intent, “This is the image that I want; here are my sizing needs both horizontal and vertical. And I had a way to override these traits, depending on my environment, and be able to generate the Kubernetes resources.”
After studying the pros and cons of Helm, Kustomize, KubeVela, and Crossplane, they sent teams of developers off to create proofs of concept for each one to evaluate the underlying utilities and fit for the organization.
“We had multiple camps of developers, and we found a lot to like each of the solutions. And all the developers came back with advocacy for that solution that they POC’d. So we really needed a way to find out a better way of getting results and comparing them between each other,” Downey said.
So the idea emerged for a more data-centric approach. The platform team sent out a survey to its developer groups with the different aspects weighted according to company priorities, which included time to market, learning curve and effort required to implement
They looked at aspects such as controllers, implementing logic, templating, scalability, technical fit and flexibility. In that way, they narrowed the choices down to Helm and Kustomize.
“If we just pick the raw numbers, we would go with Helm. But we really wanted to analyze in a qualitative way why we would choose one solution or the other,” Downey said.
So again, they assembled a team and asked, “What’s the learning curve for Helm or Kustomize? How much effort is there?” The company was already using Kustomize, so there was no learning curve there.
“This is really what’s kind of the key difference here. We had already adopted Kustomize; our entire control plane and CI/CD pipeline is all based on Kustomize today. … Doing all the templates in Helm would be a very large effort or some other efforts would be kind of adopting element packages and proselytizing across the [organization] as per our needs,” he said.
Though Kustomize is still in alpha status with sparse documentation, the platform team had put more weight on the time to market and the effort required. So in the end they chose Kustomize for its validators as plugins, support for its declarative specification and for being GitOps compatible.
“That’s very important for us; we want one source of truth,” Downey said.
Ragunathan presented a demo showing that the application developer only has to work with the app’s YAML file to make changes in vertical or horizontal sizing. The rest is taken care of by the deployment pipeline and the complexities of the cloud and Kubernetes are all abstracted away.
Using a methodical data-driven approach can lead you to pick a solution that works for you, Ragunathan said, adding that abstracting the complexities of Kubernetes away from developers is doable.
“When we started the talk, we talked a lot about the velocity of the developers, but using an application abstraction also helps the platform teams revise much faster because then they don’t have to worry about writing technical service bulletins to make sure that the developer teams are migrating to a new solution. We can roll out a new service mesh or a new CRI [container runtime interface] or CSI [container storage interface] without having to expose the developers directly to it.
“So speed to benefit is what mattered to us. Know what matters to your organization and work on finding the next solution,” she said.