A Trick to Reduce Processing Time on AWS Lambda from 5 Minutes to 300 Milliseconds
At the beginning of 2016, Jean Lescure, Senior Software Engineer and Architect at Gorilla Logic, watched a 3GB file containing five million rows of data churn through Amazon Web Services’ Lambda serverless computing service. He knew that operation, as it stood then, wouldn’t scale to larger files, and wondered if he could get it to run faster. By the Stream Conference held September in San Francisco, Lescure had dropped the time to 300 milliseconds. For Gorilla Logic’s client, a large aerospace company that has a throughput of 2 petabytes of data per year, that’s nothing short of astonishing. Can others replicate his success?
They can, according to Lescure, speaking at the conference. He started his talk by launching the demo from his phone app. It consisted of two apps side-by-side, each generating five million random rows for a file in AWS S3 bucket. He noted that one app had completed the task and went on to explain the use cases for this hack.
Lescure’s approach works fantastic for doing data migrations, he said, because you can get it done really quickly. Spin up a Lambda client that streams row by row and doesn’t lock your application. This means users can still access your app while you are migrating data in the background.
Or you can use it for tedious processing — like receiving invoices or other data in a Google drive. Or you can do ‘neural computing’ analysis and image processing, high availability, on-demand computing.
OK, but how?
So Much Data
Lescure discovered this hack working with a telecom client, another large user of data. Lescure explained that he works as an AWS full stack developer in AWS, with the skills needed in Ruby on Rails and Node.js to provide clients swift access to data.
The second app finally completed its task. Lescure pointed out the elapsed five minutes and continued.
In approaching the company’s data requirements, he decided to go with streaming technologies. In the first iteration of his demo app, the data is streamed from the S3 (Simple Storage Service) bucket into AWS Lambda. But the more data you have in your bucket, the costlier it gets. The second iteration dropped the time from five minutes to thirty seconds by streaming data directly into Lambda, then sending output row by row. No S3 in the middle.
He used the streaming capabilities embedded in Node.js and Ruby. It’s basically about opening input and output ports to allow bytes to run from end to end without any middleware, he explained. In this case, the middleware is the Lambda app but there is no cost in getting it to disc because it only runs in memory.
After this startling improvement, he decided to optimize each and every step, further cutting the processing time.
Getting to 300 Milliseconds
In testing, Lescure found that uncompressing files were one of the more costly parts in the process. By simply removing compression, he got them 80 percent to the 300-millisecond mark.
Of course, the idea needs to be sold to the client who thinks the files need to reside in the S3 bucket for safeguarding against loss in transit. He explained that they could have as much redundancy on the database as needed, but if the client needed redundancy in the S3 bucket, later on, another Lambda instance could be spun up to compress the files and send them back to S3.
By moving around the workload a little bit, the processing time could drop.
Lescure explained that when you do an insert on a regular database, it will check the schema using extremely optimized algorithms and pieces of code. But the database instances that Amazon spins up are not optimized for computing, so doing any analysis, especially on the schema side of things, will generate a cost in performance. That’s where the Lambda can save the day, offering the ability to do schema validation much more rapidly, with a few extra instances.
By reducing the computing power on the database side and upping the computing power on the Lambda side, Lescure was able to cut processing time to under the one-second mark, even when managing gigabytes of data.
It was that simple.