Presto for Big Data SQL: Challenges, Considerations, Cloud Solutions
SQL is the lingua franca for data practitioners and analysts. So even as newer data lake analytical data systems emerged, a SQL interface was a must. Meta (formerly Facebook) developed Hive to be a SQL interface on top of the first generation of data lakes built on Hadoop. But, Hive had its shortcomings. The nature of Hive was not conducive for human-scale interactivity needed for dashboards and ad hoc analysis.
Hence, based on an in-memory architecture, Presto was built to meet the low-latency needs of these interactive and ad hoc use cases. Presto also made improvements in other areas including standardizing on an ANSI SQL interface — versus a SQL dialect like HiveQL — and a pluggable connector to directly connect to other data sources, allowing for more flexibility and federated workloads.
Today, Presto is used heavily at some of the most successful internet-scale companies, including Meta (holding 300PB of data, 1,000 daily active users and 30,000 queries per day), Uber (50PB, with 7,000 weekly active users and 100 million queries per day) and ByteDance (1 million queries per day).
The Presto project has been made part of the Linux Foundation, ensuring open governance and community stewardship.
The adoption and performance of Presto have made query engines and data stacks built around query engines, like Lakehouses, a viable — and arguably more flexible — alternative to all-in-one SQL solutions for big data.
In this article, we will learn more about the technical aspects of Presto and how cloud solutions like Ahana Cloud make it possible for data teams and organizations to leverage it.
Let’s briefly dive into the Presto architecture and system.
Coordinator and Workers
Presto is a distributed system, which allows it to scale with data sizes. Like many distributed compute systems, Presto separates the activity of planning work and the actual processing of work itself. In Presto, processes are separated into two server types — Coordinator and Worker. A Presto cluster has a single Coordinator and one or more Workers. It is not uncommon for large clusters to have hundreds of Workers. Recently, the Presto community has introduced multiple, disaggregated Coordinators to scale out to an even larger number of Workers in a single cluster as well and improve cluster availability.
Presto is designed to support data source connectors (versus only being tied to a data lake). A connector provides a table-based data source model similar to what you find in relational database management systems (RDBMS). In Presto, a connector provides a three-level hierarchy — catalog, schema, and table — to expose data source data as tables; this is regardless if the underlying data source is natively relational, or tabular, in nature or not (e.g., NoSQL). A connector implementation handles the details of converting the native data source format to a uniform in-memory table format for the Presto query engine; the end user simply works with data through SQL in terms of tables, columns and rows.
Configuration and Administration
Presto is configured through properly formatted and placed configuration files.
Presto provides a basic access control framework at two levels — system access control and connector access Control. System access control primarily deals with who has access to what catalog and to what degree (e.g., read-only). Connector access control is specific to each connector and is handled by the connector implementation. Connector access control exposes data access controls via familiar GRANT and REVOKE SQL semantics. For finer-grained data access control, Presto can integrate with third-party data authorization frameworks, such as Apache Ranger and AWS Lake Formation.
Presto also provides a mechanism — called resource groups — to place limits on resource usage and enforce queueing policies, such as limits on the number of concurrently running queries, CPU time and memory usage. These policies can be defined against users, client source and query type (e.g., SELECT versus INSERT versus DDL).
Getting Started with Presto
There are several ways to get started and deploy Presto. For direct local installation, you can grab the tarball and/or JAR for any release (see release notes) and follow the instructions to deploy those artifacts. For a more packaged deployment, you can experiment with a Presto Docker image. Finally, if you don’t want to worry about managing infrastructure altogether, there is a free-forever Presto managed service community edition provided by Ahana. After you’ve deployed Presto, you can connect to the cluster via JDBC or Presto CLI and query your data sources via standard ANSI SQL syntax.
Challenge 1: Managing and Configuring Distributed Infrastructure
Getting started with Presto is one thing, but managing and configuring a distributed system like Presto for production workloads on an ongoing basis can be challenging, particularly in on-premises environments. Even with the cloud, there can be friction in infrastructure and configuration.
On the infrastructure side, you’ll want to take advantage of any elasticity and specialized compute resources. For example, you may want to scale out and in the number of Workers during periods of heavy and low usage, respectively, to optimally utilize compute resources and manage costs. You may want to fine-tune hardware for optimal price-performance on workload SLA and predictability, such as leveraging spot instances and making use of ARM-based processors.
Even after the infrastructure can be managed, there is complexity in configuring Presto. Presto has exposed hundreds of configuration parameters to allow for tuning. These configuration parameters can provide flexibility and are a natural offshoot of open source development. However, it can quickly lead to a management headache.
When it comes to managing Presto infrastructure and configuration in the cloud, you’ll want to consider to what degree of control you want or need and how much of that can be traded for simplicity. On one end of the spectrum, you can go serverless (e.g., Amazon Athena), abstracting away much of the underlying infrastructure and exposing a simple query interface and connectivity — query-as-a-service if you will. This can work well when getting started, but over time, the opaqueness (i.e., limited visibility and control) can prevent advanced tuning required for better or more optimal price performance. Further, in some cases, limits on quota or multitenant sharing of compute resources can result in inconsistent latency performance, where the same query on the same data can take longer in subsequent submissions.
On the other end of the spectrum (e.g., Amazon EMR), you can fully manage the infrastructure yourself and go with a more do-it-yourself approach. You will have full control. But, in this scenario, you’ll need to understand and manage any ongoing tuning (e.g., configurations) and operations. All this is exacerbated when looking at several clusters, and given the importance of tuning, it is quite common to have multiple clusters designed for different workloads.
Finally, a managed service (e.g., Ahana Cloud) can offer a balance between control and simplicity between both ends of the spectrum, where the most significant configurations are exposed in an easy-to-use interface. This can be the right choice for teams that outgrow the limitations of serverless options, but do not want to or have the expertise to completely control the infrastructure and configurations. No matter what deployment style or offering you choose, you also need to consider new software versions and upgrades for new functionality, stability and fixes — both for Presto itself and any underlying dependencies (e.g., Kubernetes cluster).
One final point is around data privacy and the risk of data extrication. For data infrastructure that touches data, it’s important to understand how the data is flowing and the potential points of risk — not only in terms of an organization’s own data, but in terms of any data that might put their customers at risk. For example, Amazon EMR and the Ahana Cloud managed service deliberately deploys all the Presto cluster infrastructure into a customer’s account and network for this reason.
Challenge 2: Coordinating Disaggregated Services
As a query engine, Presto is only one piece of a broader stack of disaggregated services required for SQL workloads on top of a data lake or federated across data sources. A metadata catalog is required to allow Presto to determine which files to properly scan for a given query; further, the metadata catalog can contain valuable data statistics that are crucial for optimal query performance. Another important piece is fine-grained data access control, governing which users have access to which table, columns and rows.
Ahana Cloud not only makes it easy to deploy Presto but also to integrate with related services. In the case of the metadata catalog, Ahana Cloud can integrate with an open option like the Hive Metastore or a cloud native one, like AWS Glue. In fact, for convenience, Ahana Cloud provides a built-in Hive Metastore, as well. For data access control, again, Ahana Cloud provides easy integration to open and cloud native options, such as Apache Ranger and AWS Lake Formation, respectively.
Challenge 3: Price Performance
After all the infrastructure, configuration and deployment is said and done, there must be a sustainable cost-to-value trajectory. With ever more data, SQL compute costs — which is already the dominant cost driver in a big data SQL stack — will continue to grow. In the cloud, data lakes built on object stores offer scalable, resilient, low-cost file storage – in contrast to expensive specialized hardware. Well-designed and well-priced query engine services can deliver superior price performance benefits to non-data lake alternatives.
Our solution, Ahana Cloud can offer at least a three-times-better price performance to alternatives due to sophisticated multi-level caching, intelligent autoscaling and the ability to run on price-performance optimized CPUs. This will further improve dramatically with ongoing work in more efficient native execution.
In this post, we introduced you to Presto, a little bit of how it works, and some of the challenges of deploying and managing a Presto service. The cloud helps to alleviate many of the pain points compared to on-premises, and there are several Presto solutions for the cloud. Nonetheless, there are still challenges. Ahana Cloud is a cloud native managed service for Presto that aims to provide simplified management and configuration without limiting the most important aspects of control to deliver best-in-class price performance for large-scale SQL production workloads on the data lake.