Serverless / Sponsored

How Cold Starts Impact Serverless Performance

4 Sep 2018 6:00am, by

Sam Goldstein
Sam Goldstein is the vice president of product and Engineering at Stackery. He is an engineering team leader with a history of building great products and high output teams. Before joining Stackery, Sam led software development teams at New Relic covering infrastructure, agents and browser monitoring.

Stackery sponsored this post.

There are many reasons to go serverless — but there are also many reasons not to. For example, in some cases, it’s not the cheapest alternative. You also often can’t move all our services to it. The fact that physical and virtual hosting has been tried and test for decades as a more than just a viable server hosting alternative dissuades many from betting at least some of the farm on a new trend for services.

At the end of the day, the best-virtualized environment option should be:

  • faster to deploy;
  • easier to scale;
  • simple to replicate.

However, the benefits of virtualization will also come with a loss of control. This isn’t a side effect or a “bug” with a single product. The same system that didn’t require you to configure your queueing or routing logic is sometimes going to do things with queueing and routing that aren’t what you’d like.

Cracking Open a Cold Start

Serverless isn’t magic. The code you write still runs on a Linux server somewhere, and every Lambda on every server, can’t always in that server’s memory ready to go. At some point, a Lambda deployment not seeing frequent use will get unloaded from memory or the virtual server with the code will get shut down.

When your idle Lambda gets a request, the code will have to be loaded, and response time can be expected to increase. In some cases, this difference can be extreme: a simple lambda that responds in under 20ms can take 800ms to respond if it’s been idle for more than an hour.

It gets worse: in an excellent exploration by Yan Cui, it is shown that every simultaneous invocation of a Lambda has its own cold start overhead. This means that far from being a problem mainly for infrequently used functions, it might get worse for a function that sees large sudden surges.

The most frequent reason for using serverless is that it gets a new feature to market faster. If cold starts slow down when a service is either seldom used or seeing a sudden surge in traffic, doesn’t that mean that Lambdas are just bad all around?

Possible Solutions (That You Maybe Shouldn’t Try)

  1. Change Languages

More than one person has observed that Node.js Lambdas usually start faster than those in other languages, probably because the implementation is more mature. But weren’t we getting into Lambdas because we could develop and deploy services faster? I don’t think I’m in the minority when I say changing languages greatly impacts my workflow.

  1. Change Memory Allocation

There’s some evidence that increasing memory allocation improves cold start performance and the time before a Lambda goes cold. But don’t we want to scale our Lambdas based on traffic? If we optimize for the rare cold start, won’t we be spending more than we should?

  1. Ping the Lambda

There’s an easy brute force way to prevent cold starts: never let your lambda sleep. This shouldn’t mean you’re getting billed for 100 percent uptime since you can send periodic “pings” and still get the benefit. But even without paying for 100 percent uptime, we’re still spending more, and there doesn’t seem to be a way to make sure that the case of a simultaneous request is covered, where every new instance has its own cold start.

Fear of the Unknown

The fear that these figures invoke is a fear of what we can’t control. AWS is perfectly generous with the information offered by X-Ray, so it’s not like we can’t see the problem. While these demo stats look terrifying, production stats just don’t hold up this theory.

Experiments Don’t Beat Experience

Thousands of products use lambdas in some way, many of those are fledgling products that don’t see consistent traffic. How does that calque with these performance problems?

Over on New Relic’s blog we see that production performance numbers just don’t bear out this trend, with cold starts occasionally taking less time than warm functions.

Why? One possible answer is that Lambda’s servers aren’t open source, we can only observe how “cold starts” are timed and how they’re engineered behind the scenes. But a more general solution that doesn’t rely on the unknown is in production environments, other concerns vastly outweigh cold starting as a cause of latency.

Database query time, routing, queuing, all contribute to latency. Many of those are things we can significantly improve with architecture and query optimization. Given what we see in production it makes zero sense to focus on problems we can’t control rather than ones we can.

Feature image via Pixabay.

A newsletter digest of the week’s most important stories & analyses.

View / Add Comments

Please stay on topic and be respectful of others. Review our Terms of Use.