Data / Observability / Contributed

Run Optimized Geospatial Queries with Trino

8 Apr 2022 10:00am, by
Guy Mast
Guy Mast is a Product Manager at Varada, an autonomous data lake acceleration platform, where he works on combining Varada's indexing technology with the Trino OSS project.

The Trino open source distributed query engine is known as a choice for running ad-hoc analysis where there’s no need to model the data and it can be easily sliced and diced. As such, Trino can be also leveraged for running geospatial workloads atop different data sources.

When it comes to running geospatial queries, Trino is OpenGIS compliant and supports a wealth of geospatial-specific functions using WKT/WKB vector data formats. Its diverse array of functions gives you the ability to unify and join geospatial data from multiple sources. For example, you can join points of interest by using Trino’s Postgres connector with events tables stored on S3 by using the Hive connector. 

In this post, we’ll walk you through two methods for running geospatial queries on the data lake using Trino’s Hive connector, explore some optimizations to help you accelerate and improve the interactivity for your geospatial queries.

Deep-Dive: Geospatial Analysis Methods with Trino

Leveraging Trino’s native Geospatial support and Bing tiles system are two methods for running interactive geospatial queries for common use cases. In this section, we’ll demonstrate these two methods by querying the trips_data table (4.4 billion rows) and the places table (10 rows) from the public ride-sharing dataset by using Trino’s Hive connector.

Places table locations:

Here is a list with some common geospatial analysis use-cases we came across:

  1. Getting the details of events near points of interest, for example, trips in a ride-sharing app passing by various distances from a billboard or a store — by using Trino’s great_circle_distance function 
  2. Intersecting events (such as taxi trips) with a well-defined polygon (such as neighborhoods) — by using Trino’s ST_Contains function 
  3. Finding out if a polygon contains a specific  point, for example, if a trip started or ended in a specific area of the city — by using Trino’s ST_Contains function

Method 1 – Leveraging Trino’s Geospatial Support

Trino offers a wide range of geospatial functions that can be used out of the box.

The first use case might require running queries that get all the riders that were up to some distance from the points of interest (restaurants, billboard, etc.).

Below you will find such a query that joins the trips_data table and the places table, and counts the number of riders in trips that passed up to 500 million from the points of interest stored in the places table.

Finding Adjacent Geographical Data Points

Query #1

Finding Data Points in Defined Geographical Locations

Another example for using the Geospatial functions in Trino is running queries related to use-case 2 in which we look up for events that took place in a well-defined area. Below you kind find such a query that counts the number of drivers that drove in a specific area:

Query #2 

In the above query the Geospatial functions used in the predicate were not pushed down to the Hive connector, resulting in a massive data read. In order to mitigate this, it’s possible to build an additional bounding box by using the lon and lat columns as suggested in query #2.1 or to use the Bing tiles system as suggested in query #4.

Building Additional Bounding Box

Query #2.1 

Here we built a bounding box using the lat, lon columns which include the polygon of interest. The filters on these columns are being pushed down to the Hive connector, which results in reading less data from S3, compared to query #2.

Although we saw a performance improvement by adding the bounding box predicate both on the Hive and Varada connector, it’s important to note that dynamically building a bounding box is not always straightforward.

Method 2 – Using Bing Tiles

Bing tiles are a way to define map areas at different resolutions:

They are stored as “quadkeys”:

  • Quadkey = “03” defines a tile at zoom level 2, tile side length is ~7,000 kilometers
  • Quadkey = “03023323” is zoom level 8, which side length is ~60 kilometers at the equator 
  • We used level 19, which side length is ~75 meters

Note that a prefix of a tile quadkey is always a larger tile that includes that tile at a lower resolution.

Building Bing Tile Using Trino’s CTAS Command

In order to utilize the Bing tile segmentation, we can either create the Bing tile during the query runtime (as we did in query #4) or by using an ETL/ELT procedure that will add a new quadkey column to both the trips_data and the places table. In this article, we used Trino’s CTAS command for the ELT procedure described here.

Once the quadkey column is in place we can now JOIN the places table and the table trips_data table by the quakey column.

CTAS for creating the trips_data_bing table

The instructions for creating the trips_data and places table can be found here.

Finding adjacent geographical data points using Bing tiles

Below in Query #3 we implement the same logic as query #1 and utilize the Bing tiles for doing the join between the places and trips table.

Query #3

This query joins the table by the quadkey column, which significantly reduces the number of rows read and boosts the query performance thanks to the dynamic filtering optimization.

Finding Data Points in Defined Geographical Locations Using Bing Tiles

Bing tiles can also be leveraged as can be seen below in query #4, which implements the same login we saw on query #2:

Query #4

When creating the Bing tile during query execution, execution time was slower for this query compared to the equivalent query #2, which might suggest that for this type of queries creating the Bing tile during query execution might not always be beneficial. That being said, when we ran the same experiment using the Varada Trino Connector we did see a significant improvement.

The Results 

After running the above queries we can then compile the data into a mini benchmark that details the queries’ runtime and showing the improvement factor after implementing the suggested optimization. Here are the results of a mini benchmark data, using 4Xi3.4xlarage machines on Amazon Web Services.

Optimization improves performance by up to four times.

Conclusion 

Running these queries underlines the fact that you can use Trino to run geospatial queries on your data lake thanks to its out-of-the-box support for geospatial functions. 

However, by simply applying the suggested methods above you can improve your geospatial queries performance by up to X4.

You can leverage your investment in Trino to run geospatial analysis today on the Hive connector by applying any of the methods and optimizations described in this article. And check out this post for results of running the same queries on the Community Edition of Varada’s Trino connector.

Feature image via Pixabay.