(Editor’s Note: This story has been altered from the original posted version, with some material removed for further review of technical accuracy.)
Engineers and DevOps leads are sharing the cost implications and performance downgrades of running their systems because of in-house and cloud-based CPU challenges arising from the Meltdown security patches.
TNS analyst Lawrence Hecht’s round-up of the Sysdig, AWS and Red Hat last week pointed to some early data on the impacts that may become an ongoing concern. The Meltdown security patches will continue to require higher CPU workloads to address the vulnerability in the Intel and other processing chips. After all, the chips themselves aren’t patched, the security patches must work around the hardware which means application and infrastructure operational expenses jump to a new normal.
Grab, a fast-growing South-East Asian taxi, delivery and payment provider, is one of the latest to document the impacts of Meltdown on its infrastructure. While their stateless Elastic Cloud Compute (EC2) instances were relatively safe — they merely terminated existing instances and spun up new ones — the impacts on its ElastiCache and Redis instances (as Sysdig also reported) were seeing steep spikes in CPU utilization.
Grab were in a race against the clock to manage spikes in their performance: its peak traffic point occurs on Fridays, so it needed to keep implementing solutions to any potential performance problems ahead of its busiest customer demand period. As AWS began performing rolling patches to ElastiCache nodes, Grab’s engineers could watch their CPU spikes, see failover be triggered to a new node, and then watch again as that new node was then patched, forcing a new CPU strike. That insight let Grab’s engineers introduce more Redis clusters with additional shards to better spread the load, resulting in new average CPU usage peaking at around an opotimal 24 to 30 percent.
The experiences of Grab and others is demonstrating that Meltdown’s impacts vary depending on the type of workload being carried out. But could serverless workloads be more immune to performance (and price) spiking than other architecture arrangements?
Amazon Web Services has been releasing regular news updates announcing its security patches for Meltdown and Spectre almost daily since January 3. By January 4, all instances of AWS infrastructure running Lambda functions — the core of AWS’ serverless offering — had been patched requiring no action by end users.
For serverless users, this was a vastly different scenario to those cloud users who were managing their own cloud infra, including on EC2 (who were faced with the tedious tasks of reviewing whether they had automatic updates enabled on Windows instances, for example). In the main, DevOps engineers were encouraged to at patch their instance operating systems, but, like Grab’s engineers, a more through analysis of performance impacts across all regular and business critical workloads could be required.
Serverless industry experts suggest there could be some slight serverless impacts from Meltdown. When Lambda spools up an instance, which is essentially a container, then that process could have a lot of performance impact, some speculate. There is a local security difference between the Amazon space, the kernel space, and the end user’s workspace, so the security patch could result in cold boots of new Lambda functions working at a slower pace.