Presto’s New Foundation Signals Growth for the Big Data SQL Engine
Presto, the open source SQL query engine that touts the ability to tap into data anywhere, now has a foundation.
It’s part of an effort to show that the project has truly grown into an international community beyond a single corporate interest and will be around for the future, according to Justin Borgman, co-founder and CEO of Starburst Data, which supports Presto.
“From the beginning, we stressed the importance of code quality, architectural extensibility, and open collaboration with the community,” said Martin Traverso, who created Presto at Facebook along with Dain Sundstrom and David Phillips.
“With the rapid expansion of both the Presto user base and Presto developer community over the last several years, establishing a non-profit to institutionalize these values is the next logical step to ensure that this project stands the test of time.”
The three creators, as well as the team from Starburst data, will be the founding members. Along with Borgman, Starburst’s founders — Kamil Bajda-Pawlikowski, Matt Fuller and Wojciech Biela — came from Teradata, which had acquired their SQL-on-Hadoop company Hadapt in 2014. They founded Boston-based Starburst Data in 2017.
Java-based Presto was created to be a faster Hive, which was used to pull SQL queries against Hadoop. However, Presto also could query across Hive and MySQL, of which Facebook also was a large-scale user.
Since being open sourced in 2013, it now can query across data sources both on-prem and cloud repositories including HDFS, Amazon S3, Kafka, Cassandra, Postgres, Oracle and Redis.
Presto, which can run full ANSI SQL, is designed for high-performance, high concurrency and low latency. Airbnb, Netflix, Treasure Data and Uber were early adopters and contributors, and among the companies running hundreds of nodes against petabytes of data.
Part of the performance gain comes from not using MapReduce, which writes results back to disk. Instead, Presto compiles parts of the query on the fly and does processing in memory, which comes with limited fault tolerance, Treasure Data warns.
Presto separates compute and storage.
“Presto thinks of databases or other places you store data as simply storage, rather than being its own database,” Borgman said.
One of Starburst’s clients, a media company, stores viewing data in Hadoop and billing data in Teradata. Presto sees those as just two places data is stored, and as an abstraction layer above these different data sources, allows users to query those and join across them, Borgman said.
“It could be Hadoop, a traditional database like Oracle or Teradata or moving to the cloud and query data in S3,” he said.
“I think over time S3 is becoming the new data lake. It has been Hadoop, but now S3 or the equivalent — Blob Storage on Microsoft, Google Cloud Storage on Google — these are becoming the low-cost place to let your data live. And if you can query the data there without having to load it into some other platform, that’s going to save you time and money. I think that’s a big motivator for why Presto has taken off.”
Presto works with any flavor of Hadoop — or without it. Kubernetes could further simplify things with Presto and other data technologies, Iguazio Chief Technology Officer Yaron Haviv recently wrote for The New Stack.
Like other relational database systems, it can be virtualized as one coordinator node working in sync with multiple worker nodes. Its metadata API, data location API and data stream API enable it to connect across multiple storage sources.
Though these APIs, it asks the data source for the list of tables or columns, data types and the location of the data so it can be assigned to workers that will execute the work in parallel.
Visualization and other SQL tools can be added on top of it.
Its built-in functionality includes support for regular expression functions, lambda expressions and functions and geospatial functions. It can handle complex data types including JSON, array, map and row/struct.
Much of Starburst’s work these days is helping customers use Presto across a mixture of on-prem and cloud data, Borgman said.
He pointed to Snowflake as its closest competitor on cloud data, but users have to load data into it in its own proprietary format. Presto queries data directly in open source format, he said.
AWS also makes Presto, the project, available on its platform as part of its EMR (Elastic Map Reduce) offering.
The most recent developments in Presto have included a cost-based optimizer, which uses things like properties like CPU cost, memory requirements and network bandwidth usage to determine how queries can be executed as fast as possible with the least resources, and role-based access controls and security enhancements. It’s available now on Azure as well as AWS.
Feature Image: “King’s Cross Station Facelift – Jan 2014 – Waiting for a Train” by Gareth Williams. Licensed under CC BY-SA 2.0.