Here’s a typical, if nasty, outage:
The network goes down for a few seconds. Each client application goes into offline mode and queues up customer requests, and when everything is back online, all the requests flood in at once. Even though you’ve set up your application code using a serverless service to handle bursty traffic from clients, this traffic spike is bigger than usual. Serverless’ auto-scaling responds by doubling instances, but the database isn’t configured to handle this many open connections. This begins a cascading failure: the database starts throwing errors, which causes each failing request to be retried, which causes more auto-scaling, which…
The rest of your day is spent resizing instances, reconfiguring the database, and frantically adding an exponential backoff feature to your clients to avoid DOSing yourself next time. With luck, traffic will slow down enough during off-hours for everything to return to normal.
I support blameless postmortems, but in these situations, I tend to ask myself whether this is my fault or a consequence of our stack. Earlier in my career, I would blame myself — I should’ve been more familiar with the code, responded faster, had playbooks ready, and it wouldn’t have turned out like this.
After going in-house to work on cloud platforms, I’ve become convinced that some percentage of outages is a natural function of your choice of infrastructure provider.
Serverless offers functions-as-a-service with unlimited scalability, but it’s up to you to make sure everything it touches can handle its scale.
In this outage, the database threw errors because it couldn’t handle the scale. But I would suggest the real cause was the design philosophy behind serverless: It offers functions-as-a-service with unlimited scalability, but it’s up to you to make sure everything it touches can handle its scale.
In essence, serverless gives you enough freedom to screw things up but not enough flexibility to prevent these errors. It doesn’t just run.
These design choices are particularly frustrating when viewed in terms of the alternatives to serverless. When a team picks serverless for their application code, they’re making a strong statement about their priorities for their service and technology strategy — they’d like to focus on business logic and hand over the rest to their cloud provider.
This choice to opt for more managed infrastructure can be represented on a spectrum with other common choices for new projects: Kubernetes, and some flavor of PaaS, such as Heroku or Amazon Web Services‘ Elastic Beanstalk.
The subtle but sinister problem is that each of these options is implemented with similar complexity under the hood. Serverless is an abstraction, and no matter how simple it appears, issues can arise from anywhere in the underlying infrastructure. Serverless asks the least when setting it up, but when something goes wrong, it demands the most.
Kubernetes knows what it is and it doesn’t care if you think it’s complicated. You choose Kubernetes for its flexibility, and if you need any help managing it, vendors can jump in.
Serverless is on the opposite side of the spectrum, and its tradeoffs are only worthwhile in isolation. If serverless took a stance on its surrounding ecosystem, it could’ve prevented this outage. What if it came packaged with an elastic, globally accessible storage layer? What if its load balancers rate-limited the incoming requests after errors began? What if it could reconfigure the number of database connections opened per instance?
It doesn’t matter which, it should just be something instead of nothing. Serverless is so close to handling all production DevOps work, what if it went the last mile?
Feature image via Pixabay.