Seth Vargo is a developer relations engineer at Google. He previously held software-development roles at HashiCorp, Chef Software, CustomInk and a few Pittsburgh-based startups. Passionate about reducing inequality in technology, Vargo is also the author of “Learning Chef.”
In this interview, we discuss Google’s site reliability engineering (SRE), Kubernetes hype and what to focus on when deploying reliable software on a massive scale.
You spend a lot of time explaining and preaching the difference between the roles of an SRE, a DevOps engineer and a systems administrator. On the Google SRE landing page one can read, “SRE is what you get when you treat operations as if it’s a software problem,” and “Our job is a combination not found elsewhere in the industry.” Is SRE simply a Google thing? How does SRE manifest itself at other companies (if at all)?
Everyone has their magic number. For me, it’s 25. I think any [software] company with more than 25 employees probably needs SRE support. There is a point in a company’s life cycle where it becomes someone’s full-time job to maintain uptime and availability. So, when you’re a small startup (I’ve worked for many of them) you’re doing some operations or doing some sysadmin work, development or marketing — you’re kind of doing it all. As the organization scales, there’s a point at which you have to specialize, where you can’t be a generalist anymore. Obviously, that’s very true at big companies like you mentioned, Facebook and Google. But it’s also true even in midsize startups — once you hit that 25–50 person mark, you have to have someone who focuses on marketing; you have to have someone who focuses on uptime and availability. When you have two or three full-time people dedicated to uptime and availability, you have to start thinking about something like SRE. When it’s only one person, they own the whole thing, and they don’t have to collaborate with stakeholders as much because they are the primary decision maker. But when you start getting two, three, five or 5,000 — that’s when you need to have a framework, and that’s where something like SRE can help.
In the companies you worked for before Google, was it natural that engineering and ops teams combined forces with other teams like product and sales. Was it like that?
In really small companies, everyone does everything. When you start getting into 20–30 person companies, DevOps almost occurs naturally. There’s almost forced collaboration because there’s no historical knowledge — especially in a startup, as you’re doing everything for the first time. As the company grows it becomes more important to adopt things like automation and codification, and there becomes a point where focusing on reliability and your customer is more important than delivering the next piece of functionality.
You also have to remember that a lot of startups are focused on revenue-driven development or user-acquisition-driven development, because they got some funding and they have that much money to build something or prove something. Then they either need to become profitable or they need to show that they’re worthy of getting more money — particularly in the U.S., but that’s pretty common everywhere. When a startup is bootstrapped, making that money last is important. If you’re a bootstrapped startup you might not be able to afford a full-time SRE because you don’t have enough users to actually warrant it.
I think any [software] company with more than 25 employees probably needs SRE support.
I always like to ask the question, “If you went down overnight in your users’ timezone, how many of your users would notice?” If Facebook or Google go down in the middle of the night, people notices because they are global companies. But if all your users are in one timezone and you do maintenance at 4:00 in the morning when everyone’s asleep, how many of your users are going to file a support ticket? If the answer to that is fewer than one percent of your user base, then you’re probably not in a position where you need a full-time dedicated SRE. During working hours, people are available to resolve incidents, so you don’t need someone on call at night. Once your service scales or once you have an SLA that says, “we’ll be available 100 percent of the time,” that’s where you have to introduce these types of SRE roles into the organization.
During your career, you’ve worked for the tech leaders when it comes to DevOps: HashiCorp, Chef Software and now Google. I work at Semaphore where we often emphasize that we “optimize for happiness” and we hear from our clients about how a flexible and configurable CI/CD pipeline improves the lives of whole teams, and is an actual enabler. What else is important to focus on when dealing with the steps in the software development cycle known as testing/deployment/maintenance?
I think the biggest thing is “observability.” Traditional monitoring, logging and alerting are based on “write some logs and collect some metrics.” What we’re seeing now in the industry is a standardization around observability with things like Open Census where, especially in microservices, it’s not enough just to throw some log messages around. When a user makes a request to your service, that may hit 10 different backend services that have to do different operations. Imagine you’re doing some online e-commerce website. The user comes to the homepage — which needs to hit the order recommendation system, the authentication system to log them in, which needs to pull up their user preferences, which needs to pick their most recently purchased items — that might involve 30, 40 or even 100 microservices. You need to measure the latency between all of them. You might need to measure individual function calls in those microservices to figure out where a performance issue is happening, and when that renders back into the user, you need to measure how long that’s impacting the user.
This is where it’s not enough just to throw a couple of log lines in a database anymore, and we need a standard metric. We need a standard way to talk about observability, and then we need to put that in a system that can quickly do filtering and analytics, so that if there’s a support ticket, or if we start hitting those SLI and SLA barriers, it’s very clear that this is where the system is broken. If you have an error budget of five minutes, it can’t take you 10 minutes to find the problem; if it takes you 10 minutes, you’ve dramatically exceeded your error budget. Whenever one of those alarms fires when you’ve exceeded your SLA, you need to be able to drill down very quickly to find the root causes, so that you can quickly recover, because otherwise, you’re just burning through your error budget.
So observability is also about this feedback loop, that you get info quickly and you feel safe with the things that you do with your systems?
Yes — canary testing and blue-green deployments, and A/B testing tie into that. Imagine you’re pushing a new feature out and you’re afraid because it’s maybe a schema migration or introducing some new functionality that might overload the database. In such cases you can do a blue-green deployment where only 10 percent of your users get that functionality, and then you can use the observability to see that this is an active experiment; it’s only increasing our database usage by one percent, so if we increase it across everyone, that will only increase the whole by nine percent, so we’re probably good to continue rolling this experiment on. But if you don’t have that observability, doing that blue-green deployment doesn’t really help you.
I think the third thing a lot of customers take for granted is social media. There have been situations where social media will alert us to an issue before original monitoring does. That’s usually a case of something that we weren’t monitoring or observing…
Or usually your marketing team is doing that and they assume that if something’s down, the infrastructure team already knows and will take care of it?
I’ll give you a really good example that happened recently at Google in 2018. An Internet service provider accidentally caused a BGP leak. Anyone who resolved through that Internet service provider could not access Gmail, Google Drive or Cloud services. Google doesn’t control downstream ISPs, but that misconfiguration caused a number of users around the world and in different geographies to be unable to access Google services. While Google has internal monitoring and alerting for such incidents, social media made the scale and impact of the incident much clearer.
That’s interesting. So how long did it take you to actually find out the cause of it?
We identified the cause very quickly. All we had to do was to trace the actual packets that were happening, and see where they were not reaching our servers. Especially for small and midsize companies that are just getting started out, you don’t know what to observe; you don’t know what to monitor. Social media is actually a really great way for you to build a brand and for your users to reach you, but it’s also a great way for your users to tell you when something’s broken. All your uptime may be great, but if your CDN is down, your users are having a bad experience, and social media is a great way for them to tell you that.
In your speech from 2015 in Krakow, you discuss whether DevOps is a fad or a buzzword exploited by marketing and HR. While doing this you refer to the Gartner Hype Cycle. Having this graph in mind, where do you think we can put Kubernetes in 2019?
It’s a tough one. I think for large enterprises, Kubernetes is really at the slope of enlightenment right now. They’ve gone through the pain of customizing it and installing it, and updating their applications. There’s another part of Kubernetes, which is that it’s not just enough to have Kubernetes; you also have to have what we call “cloud native applications.” You have to have applications that behave well in a containerized world. I think large enterprises are really starting to see the benefits of this now. They’ve spent the past two or three years building cloud-native applications (or rewriting existing applications to be cloud native) and really understanding Kubernetes at scale. I think for small and mid-size startups, they’re still at the peak of inflated expectations, and about to hit the trough of disillusionment.
I saw this amazing tweet where there’s a little toy truck on the back of a huge flatbed truck and it says “This is my blog on Kubernetes.” I think it’s really accurate. I think there’s a lot of startups that are looking at this, and they think they need Kubernetes. You’re not at a scale yet where it makes sense. If you have two microservices, Kubernetes brings along 100. You’ve increased your number of services by 2,000 percent for no real benefit (n.b. exaggeration). If you look at the Gartner Hype Cycle, I would say the large companies are getting a lot of benefit out of it, and that small and midsize companies are slowly realizing that the burden of maintaining a cluster isn’t necessarily worth the uptime and auto-restarting, and all the stuff that comes with it.
This goes back to our earlier conversation, too: When does it make sense to have an SRE? I think Kubernetes doesn’t make sense until it also makes sense to have dedicated SREs. Someone has to keep Kubernetes up and running. If you’re using a cloud provider, — something that’s managed and has uptime — I think that pushes Kubernetes a little bit to the right, even if you’re a small startup. It’s also worth noting that the costs of those services are high. The average Kubernetes cluster is at minimum $300 a month across all the clouds for a small company, and you could run on a VM for a third of that cost. Again, going back to budgeting and funding, there are benefits, but you really have to figure out whether it is worth it.
Being a developer advocate, you talk to a lot of practitioners, but I’m sure you also get a lot of feedback from senior software developers and people working in business-related roles. What are the most common concerns when it comes to software deliverability, maintenance and scalability that you hear about from them?
Product teams generally have a difficult time understanding reliability. Many product owners tend to have a “where’s my feature?” attitude. Bug fixes and improving reliability is not something very high on their radar. This is where error budgets and the SRE model come in. If product owners had their way, you would just ship your features all the time. But the SREs need to be able to push back and say, “The last 10 features you delivered were great, but they have some bugs. If you keep delivering features that have bugs, eventually the system is going to become too unreliable.”