How Cold Starts Impact Serverless Performance
Stackery sponsored this post.
There are many reasons to go serverless — but there are also many reasons not to. For example, in some cases, it’s not the cheapest alternative. You also often can’t move all our services to it. The fact that physical and virtual hosting has been tried and test for decades as a more than just a viable server hosting alternative dissuades many from betting at least some of the farm on a new trend for services.
At the end of the day, the best-virtualized environment option should be:
- faster to deploy;
- easier to scale;
- simple to replicate.
However, the benefits of virtualization will also come with a loss of control. This isn’t a side effect or a “bug” with a single product. The same system that didn’t require you to configure your queueing or routing logic is sometimes going to do things with queueing and routing that aren’t what you’d like.
Cracking Open a Cold Start
Serverless isn’t magic. The code you write still runs on a Linux server somewhere, and every Lambda on every server, can’t always in that server’s memory ready to go. At some point, a Lambda deployment not seeing frequent use will get unloaded from memory or the virtual server with the code will get shut down.
When your idle Lambda gets a request, the code will have to be loaded, and response time can be expected to increase. In some cases, this difference can be extreme: a simple lambda that responds in under 20ms can take 800ms to respond if it’s been idle for more than an hour.
It gets worse: in an excellent exploration by Yan Cui, it is shown that every simultaneous invocation of a Lambda has its own cold start overhead. This means that far from being a problem mainly for infrequently used functions, it might get worse for a function that sees large sudden surges.
The most frequent reason for using serverless is that it gets a new feature to market faster. If cold starts slow down when a service is either seldom used or seeing a sudden surge in traffic, doesn’t that mean that Lambdas are just bad all around?
Possible Solutions (That You Maybe Shouldn’t Try)
- Change Languages
More than one person has observed that Node.js Lambdas usually start faster than those in other languages, probably because the implementation is more mature. But weren’t we getting into Lambdas because we could develop and deploy services faster? I don’t think I’m in the minority when I say changing languages greatly impacts my workflow.
- Change Memory Allocation
There’s some evidence that increasing memory allocation improves cold start performance and the time before a Lambda goes cold. But don’t we want to scale our Lambdas based on traffic? If we optimize for the rare cold start, won’t we be spending more than we should?
- Ping the Lambda
There’s an easy brute force way to prevent cold starts: never let your lambda sleep. This shouldn’t mean you’re getting billed for 100 percent uptime since you can send periodic “pings” and still get the benefit. But even without paying for 100 percent uptime, we’re still spending more, and there doesn’t seem to be a way to make sure that the case of a simultaneous request is covered, where every new instance has its own cold start.
Fear of the Unknown
The fear that these figures invoke is a fear of what we can’t control. AWS is perfectly generous with the information offered by X-Ray, so it’s not like we can’t see the problem. While these demo stats look terrifying, production stats just don’t hold up this theory.
Experiments Don’t Beat Experience
Thousands of products use lambdas in some way, many of those are fledgling products that don’t see consistent traffic. How does that calque with these performance problems?
Over on New Relic’s blog we see that production performance numbers just don’t bear out this trend, with cold starts occasionally taking less time than warm functions.
Why? One possible answer is that Lambda’s servers aren’t open source, we can only observe how “cold starts” are timed and how they’re engineered behind the scenes. But a more general solution that doesn’t rely on the unknown is in production environments, other concerns vastly outweigh cold starting as a cause of latency.
Database query time, routing, queuing, all contribute to latency. Many of those are things we can significantly improve with architecture and query optimization. Given what we see in production it makes zero sense to focus on problems we can’t control rather than ones we can.
Feature image via Pixabay.