From Big to Fast: Presto Continues to Shine for Cloud Data Lake Analytics
Over the last three decades, various technologies have been developed and applied for big data analytics on structured and unstructured business data. Because today most companies store data on different platforms in multiple locations with various data formats, these large and diverse data sets often stymie the ability to capture real-time opportunity and extract actionable data insights. Enterprise organization data from those multiple systems traditionally went through an Extract, Transform and Load (ETL) process to transform the data from its operational state into a state that’s better for querying, and then the data was often moved into the enterprise data warehouse.
The ETL process created a big problem for users like data engineers, data scientists, and system administrators as the ETL jobs for data pipelining and data movement can be very time consuming and difficult to manage when working with multiple databases and datastores. Hence, demand of analytics-as-a-service has been emerging as a new model to meet the business requirements for both archived and real-time data.
A federated data computing framework that allows users to retrieve and analyze data from multiple data sources and compute in a centralized platform is a better solution for fast data analytics in today’s data environments.
Presto, an open source platform, was originally designed to replace Hive, a batch approach to SQL on Hadoop and was built with higher performance and more interactivity compared with Apache Hive. The concept of Presto was to support a MPP (Massive Parallel Processing) framework to compute large scale data, so the architectural model is designed to support disaggregation of compute and storage and process real-time and high performing data analytics. Presto is not supposed to store data, instead the data sources are accessed via various connectors.
After years of development, the latest version of PrestoDB supports SQL-on-Anything and is an interactive query framework that can fit in any enterprise architecture as an in-memory query engine.
How Does Presto Work?
Presto is based upon a standard MPP database architecture, which enables horizontal scalability and the ability to process large amounts of data. Presto’s in-memory capabilities allow for interactive querying across various platforms of data sources. In order to access data at different locations, Presto is designed to be extensible with a pluggable architecture. Many components can be added to Presto to extend it further using this architecture, including connectors and security integrations.
The Presto cluster is a query engine that runs a single-server process on each instance, or node. It consists of two types of service processes: a Coordinator node and a Worker node. The Coordinator node’s main purpose is to receive SQL statements from the users, parse the SQL statements, generate a query plan, and schedule tasks to dispatch across Worker nodes. Meanwhile, the Worker node may communicate with other Worker nodes and execute the task from the query plan from the Coordinator, which is fragmented for distributed processing.
The Presto Coordinator is a single node deployed to manage the cluster. The coordinator allows users to submit queries via Presto CLI, applications using JDBC or ODBC drivers, or other available client API libraries of connections. The Coordinator is also responsible to talk to Workers to get update status, assign tasks, and send the output result sets back to the users. All communication is done through the RESTful API by Coordinator’s StatementResource class.
Inside a Presto cluster, there may be one Coordinator node with multiple Worker nodes. If the Coordinator is a leader, the Worker nodes are followers. Each Worker node stays alive as a process of a service that listens to the Coordinator for task executions and actual compute. The Worker will periodically send a heartbeat to the Discovery Server via RESTful API to signal the server with its health status of whether or not the worker is online or offline. This lets the Coordinator know from the Discovery Server which Worker nodes are available to dispatch tasks when the user submits a query.
The logical implementation of Presto is shown below. There are seven basic steps for running a query:
- User submits a query from client API to Presto coordinator via HTTP protocol.
- The Coordinator receives the SQL statement, which is in textual format, follows a series of steps to parse, analyze and create a logical plan for execution using an internal data structure called the Presto query plan. After the query plan is optimized, there are three internal classes including query execution, stage execution, and task distribution, that are generated accordingly so that the Coordinator can create HTTP tasks depending on data locality property.
- Depending on where the data is located, the Coordinator generates tasks and dispatches to the designated Worker nodes to process through HttpRemoteTask’s HttpClient. HttpClient creates or updates the task’s request and the TaskResource on the local data Worker node provides a RESTful API. The TaskResource takes the request and either starts a SqlTaskExecution object on the corresponding Worker node or updates the object’s Split.
- Upstream task reads data from the corresponding connector.
- Downstream task consumes output buffers from upstream’s task and starts to compute for data processing within its stage on all Worker nodes. Presto is a memory computing engine, so the memory management must be refined to ensure the orderly and smooth execution of query, and some cases such as starvation and deadlock occur. Because the worker node is designed for pure in-memory computing, there will be no data spill to the disk when there is not enough memory. Therefore, the query could fail due to Out-Of-Memory. However, the latest version of Presto supports disk spills which is an option for users to tune but not recommended due to the high latency cost it will bring if the switch is turned on.
- Once the Coordinator dispatches the task across Worker nodes, it continuously listens to retrieve task’s computing results within the final stage.
- After the client submits a SQL statement, the client continuously listens to retrieve final result sets from the Coordinator. These result sets are streamed back to the client piece by piece using HTTP protocol when the outputs are available.
Presto doesn’t use MapReduce. It computes through a custom query and execution engine. All of its query processing is in memory, which is one of the main reasons for its high performance.
Presto’s Use Cases
Presto is a distributed SQL engine for data analysis on data warehouses and other disparate data sources. It can achieve excellent performance for real-time or quasi-real-time analytic computing. Queries run with response times from millisecond to seconds. For a complex query with the right configuration, runtime can finish within the unit of minutes vs. hours or days if running on the Hive system. With its federated architecture, Presto is a proven technology that is most suitable for the following application scenarios:
- Replace Hive queries for better performance. Presto’s execution model is a pure memory MPP model, which is at least 10 times faster than the MapReduce model of disk shuffle used by Hive.
- Unified SQL execution engine. Presto is compatible with the ANSI SQL standard and can connect to multiple RDBMS and data warehouse data sources, using the same SQL syntax and SQL functions on these data sources.
- Bring SQL execution capabilities to storage systems that do not have SQL execution capabilities. For example, Presto can bring SQL execution capabilities to HBase, Elasticsearch, and Kafka, and even local files, memory, JMX, and HTTP interfaces.
- Construct a virtual unified data warehouse with federated querying of multiple data sources. If the data sources that need to be accessed are scattered in different RDBMS, data warehouses, and even other Remote Procedure Call (RPC) systems, Presto can directly associate these data sources together for analysis (SQL Join), without the need to copy data from the data source, and without needing to centralize it in one location.
- Data migration and ETL tools. Presto can connect to multiple data sources, plus it has a wealth of SQL functions and UDFs, which can conveniently help data engineers pull (E), transform (T), and load (L) data from one data source to another data source.
A Popular Data Lake Analytic Engine
First of all, Presto adopts a full memory computing model with excellent performance, which is especially suitable for ad hoc query, data exploration, BI reporting and dashboarding, lightweight ETL and other business scenarios.
Secondly, unlike other engines that only support partial SQL semantics, Presto supports complete SQL semantics, so you don’t have to worry about any requirements that Presto can’t express. Furthermore, Presto has a very convenient plug-in mechanism, you can add your own plug-ins without changing the kernel. In theory, you can use Presto to connect any data source to meet your various business scenarios.
Finally, Presto has a very active community. As part of the Linux Foundation’s Presto Foundation, many large enterprise companies in addition to Facebook such as Twitter, Uber, Amazon Athena, and Alibaba embrace Presto’s data lake analytic capability to develop features using Presto’s codebase to support large scale, high volume OLAP transactions on top of their own data federation system. Based on the above advantages, Presto is a proven technology to provide cloud data lake analysis as the underlying analytics engine.
The priority design of data lake, through opening the underlying file storage, brings maximum flexibility to the data into the lake. The data entering the data lake can be structured, semi-structured, or even completely unstructured raw logs. In addition, open storage also brings more flexibility to the upper-level engines. Various engines can read and write the data stored in the data lake according to their own scenarios, and only need to follow the relatively loose compatibility conventions (such loose conventions will have hidden challenges, which will be mentioned later). But at the same time, file system direct access makes many higher-level functions difficult to implement. For example, fine-grained (less than file granularity) permission management, unified file management and read-write interface upgrade are also very difficult (each access file engine needs to be upgraded before the upgrade is completed).
A Fast SQL Engine
Presto also features performant SQL processing. Here are a few reasons why:
- Presto supports standard ANSI SQL, including complex queries, aggregation, join, and window functions. As the substitutes of Hive and Pig (Hive and Pig complete HDFS data query through MapReduce pipeline), Presto does not store data itself, but can access multiple data sources, and supports cascading queries across data sources.
- YARN is a general resource management system. However, no matter what kind of engine Hive uses when executing SQL, such as MR and TEZ, each executing operator runs in the YARN container, and the performance of YARN pulling up the container is particularly low (second level). It’s like an application pulling up a process and turning on multithreading. The thread is more lightweight, and the speed of starting the thread is faster and the acceleration is more obvious with simple operation; however, the startup process is much more cumbersome, and it is easy to be restricted by the operating system. Presto scheduling uses threads, not processes.
- Presto’s Coordinator/Worker architecture is more like Spark standalone mode, which is only completed in two processes and services. However, Spark focuses more on the dependency relationship between SparkRDD, and stage failure and linear recovery lead to higher overhead. Spark input also directly relies on Hadoop input format API, which makes SparkSQL unable to transmit SQL optimization details to inputformat at runtime. Presto discards Hadoop inputformat, but adopts similar data partition technology. After SQL is parsed, it can generate a tuple domain from where conditions pass to the connector. The connector can use a certain degree of index push down according to the data sources according to the metastore data, and greatly reduce the data scanning interval and the amount of data involved in calculation.
- Presto is completely memory-based parallel computing. Unlike Hive MR/ TEZ which needs to write intermediate data to disk or Spark which needs to write overflow data to disk, Presto completely assumes that data can be effectively put into memory. Furthermore, thanks to Presto’s pipelined job computing capability, the data displayed can be returned immediately by analyzing the execution plan of SQL. While this gives users a very fast “false impression”, this “illusion” is also justifiable. Even if we extract a large amount of data from a result, we also traverse the cursor. When we traverse to that location, the subsequent result data has been continuously calculated, which does not affect our results.
In many scenarios, Presto’s ad-hoc query runtime is expected to be 10 times faster than Hive in seconds or minutes. It supports multiple data sources, such as Hive, Kafka, MySQL, MongoDB, Redis, JMX, and more. As an open source distributed SQL query engine, Presto is a proven analytic framework to quickly analyze queries for any size of data. It supports both non-relational and relational data sources. Supported non-relational data sources include Hadoop distributed file system (HDFS), Amazon S3, Cassandra, MongoDB, and HBase. Furthermore, Presto supports JDBC / ODBC connection, ANSI SQL, window function, join, aggregation, complex query, etc. These key features are the founding keystones of building a cloud-based data lake analytics.
The Linux Foundation is a sponsor of The New Stack.
Feature image via Pixabay.