Vanguard’s Iterative Enterprise SRE Transformation

With $8 trillion in assets under management, global asset manager Vanguard is the largest provider of mutual funds and the second-largest provider of exchange-traded funds. It’s also not-so-secretly a tech company, with 7,000 of its 17,000-person staff working in the technology division.
Just seven years ago, Vanguard services were hosted exclusively in a private data center with almost all monolithic applications. The engineering team had controlled deployments with quarterly releases. No observability, and no affirmation that the systems were functioning the way they hoped them to be. The alert system was centrally owned, which meant developers had to request tickets to add and update alerts. Not surprisingly, Dev and Ops were completely siloed from each other.
At this week’s DevOps Enterprise Summit (DOES) Europe, Vanguard talked about how they made the move from traditional architecture to the majority in the cloud, adopted site reliability engineering and even built their own customer-facing SaaS.
How Vanguard Kicked off Its Move to the Cloud
Senior SRE Coach Christina Yakomin, who has been at Vanguard since the start of their app modernization, said her employer is unique — particularly for financial services. In its 47 years, it’s never been brick and mortar, just over the phone and online — making technology a top priority. But that doesn’t mean it was easy to go distributed and cloud native with the times.
It’s incredibly difficult to take a monolithic application to “lift and shift” to the cloud, she said, so, slowly but surely, they carved out microservices. The Vanguard team initially started by running on an internal private cloud platform-as-a-service. Even with this move, Yakomin said, the regression testing test cycle significantly reduced, and then they added a testing automation role to the engineering division. By adding a new pipeline they were then able to improve deployment frequency and create test evidence to be part of their CI/CD flow.
From there, they were good to go to lift and shift the underlying PaaS to the public cloud.
“This made the lives of the microservices development teams significantly easier but it was incredibly hard on our operations team,” Yakomin said, “But it left us with an unnecessary abstraction layer…overcomplicating the environment and causing more problems than it solved.”
How Vanguard Embraced Cloud Native Chaos Engineering
While this move wasn’t perfect, it did enable optionality for the first time, so they could move to cloud native solutions, opting for a combination of Amazon Web Services:
- Amazon ECS — very similar to what they had, but they didn’t need a team on PasS maintenance
- AWS Lambda — which they found better and cheaper for other event-driven workloads and microservices that were accessed infrequently
- Amazon EKS — for their managed Kubernetes on AWS
“Most product teams were fairly excited to take on accountability to their systems, it was new to them,” Yakomin said. Product teams became responsible for testing application code and configurations.
They also adopted the common physical engineering practice of failure modes and effects analysis. Originally from the U.S. military, FMEA focuses on:
- Failure modes — potential or real ways in which something might fail
- Effects analysis — understanding the consequences of each failure
This is applied by looking through system architecture and creating hypotheses around how things react to that stress and failure, Yakomin explained. Then they integrated chaos engineering and performance testing to test each hypothesis, creating and leveraging self-service platforms. New team rituals included:
- Chaos game days – developed hypotheses on system resilience and expected behavior, caused crashes and validated scaling and self-healing, with just a couple surprises.
- Chaos fire drills – wanted to test for the unknowns. They were trying out a new observability tool, Honeycomb, and they wanted to inject new failures to put the tool to the test. This worked out so well that they recorded these tests and use as part of their on-call training.
- Break testing their self-hosted CI/CD pipeline – They observed recurring instability at high-traffic times which is also when developers were working and naturally deploying. They created several dummy builds and deploy plans that generated a significant volume of logs during execution. Yakomin explained in a follow-up Slack chat that they ran these in off-hours, recreating the peak traffic behavior. They then migrated to a database instance that was better optimized for disk input/output operations, eliminating the issue before devs encountered it the next day.
The Role of SRE in Vanguard’s Move to the Cloud
Site reliability engineering is quickly becoming a necessity for the move to the cloud. While monolithic applications lack speed, agility, and team autonomy, they are undeniably more predictable and easier to monitor and observe than distributed systems in the cloud. Some organizations, like Google, have a specialist SRE team that works to support development teams. Others are embedded right in the team. In a follow-up conversation in the DevOps Enterprise Summit’s Slack community, Yakomin explained how Vanguard has a mix.
“Sometimes a product team has both Application Engineers and Site Reliability Engineers. Application Engineers focus on feature delivery, while Site Reliability Engineers focus on availability, resilience, and other non-functional requirements. And both are responsible for ensuring everything they do is aligned with security controls, of course. In this scenario, they typically share on-call.”
She further explained that sometimes SREs are found “engaged by the DevOps teams’ on-calls through their own SRE on-call rotation — or 24/7 staffed SRE support team — as an escalation path.”
Before embracing site reliability engineering, Vanguard’s availability measurement was binary — is a service up or down? Which, of course, embraces the impossibility of 100% uptime.
Vanguard engineering started to embrace service-level indicators (SLIs), service-level objectives (SLOs), and error budgets, which allowed the SRE team to see a percentage of requests at risk. Suddenly they were able to balance availability versus feature trade-offs. If a feature is a high priority, they can allow for 99.5% or even 99% availability. Or if availability is key, then it’s communicated with developers that they have to follow a slower release plan.
Vanguard added an SRE coaching team of which Yakomin is a part of. These SRE coaches help validate tools, create self-study curricula on reliability and DevOps best practices, and act as advisors to the SRE leads. Those champions work with varying SRE leads that are aligned with groups of related products — and product teams may or may not have a full-time SRE practitioner embedded. All DevOps engineers share responsibility for managing the alert portfolio and making sure the SLOs are being met, Yakomin clarified.
She admitted that it’s a continued challenge to strike a balance. “Every hour I take away from an engineer to put them through a training course while they could’ve been delivering new features,” is an immediate concern, but that training will increase release cadence and reliability later on.
The Vanguard SRE coaching team is now looking for ways to better demonstrate impact, while also struggling against the folks who are used to seeing uptime as all the time, so, Yakomin said, “they may see the impact of my work as challenging.”
One thing is for sure, SRE is one of their fastest-growing roles as they look to scale cross-company, which tracks as it’s often listed as one of the most in-demand tech roles. Vanguard is looking at a mix of up-skilling engineer teammates, while also external hiring.
Vanguard’s DevOps Successes
Vanguard also started out with alert-only visibility, with operations creating alert consoles with the current status of various alerts. Yakomin said that worked for a while but was insufficient.
Once they developed a central microservices platform, dashboards were developed at the team level “because we simply weren’t going to be able to track centrally all the microservices.” She said that the team loved this and requested the addition of cloud logs and metrics. As time went on, this level of customization became so great, teams could be trained, and then given dashboard clones that they can tweak for their needs.
“As we saw this grow in scale, we saw some really positive outcomes and some unexpected consequences,” Yakomin said. Vanguard engineering quickly embraced agility in creating new alerts and dashboards, and teams were able to leverage data for decision-making, even to roll back earlier. They also saw an increased focus on production support and teams taking on more ownership of what they are releasing.
On the other hand, they started getting dashboard clutter — tracking more than they could ever keep track of. Teams were alerting on the wrong things or too many things, which can quickly turn into alert fatigue and ignored alerts. She says they learned that “just because you can do everything in your log aggregation tool, doesn’t mean you should,” as it can cause exponential cost increase while slowing performance.
Vanguard engineering started putting metrics and traces, better leveraging Amazon CloudWatch for monitoring and Honeycomb.io for observability, and standardizing around OpenTelemetry.
“This allows us to make the investment for the future but avoid vendor lock-in — we’re not the only ones standardizing around this framework. It looks like a lot of the industry is as well” sold on the OpenTelemetry framework, Yakomin told the DOES audience. All projects that have adopted OTel may leverage centrally created libraries to extract common fields of interest.
The most notable DevOps success was surely in 2019 with the creation of a cloud native client-facing app to serve half of Vanguard’s business — external financial and wealth advisors. It became Vanguard’s first multitenant AWS-backed application for a retirement planner that was built 100% in the cloud, with the engineers actively applying FMEA, observability and more. It also brought together business and technology to not only decide what happens if something fails, but how to communicate to all stakeholders.
As Gene Kim, founder of DOES event host IT Revolution, pointed out in the live event Slack chat: “As the investment industry races to lower costs, this is an important initiative for Vanguard.” Site reliability engineering is a long-term investment, but so far, that investment seems worth it across industries.