TNS
VOXPOP
Will real-time data processing replace batch processing?
At Confluent's user conference, Kafka co-creator Jay Kreps argued that stream processing would eventually supplant traditional methods of batch processing altogether.
Absolutely: Businesses operate in real-time and are looking to move their IT systems to real-time capabilities.
0%
Eventually: Enterprises will adopt technology slowly, so batch processing will be around for several more years.
0%
No way: Stream processing is a niche, and there will always be cases where batch processing is the only option.
0%
Data / Storage

Accelerating SQL Queries on a Modern Real-Time Database

A Trino connector enables SQL access to real-time data and allows data architects and developers to expand fast analytics data accessible from it.
Nov 3rd, 2022 10:48am by and
Featued image for: Accelerating SQL Queries on a Modern Real-Time Database
Image via Pixabay.

When deploying large-scale real-time applications across a wide range of verticals, businesses require “as it happens” visibility, sometimes in near-real time, over these systems via notifications, ad-hoc queries, dashboards, and reports.

SQL is broadly used as a data access language for analytics, and Trino provides a powerful engine for SQL access to multiple data sources. A Trino connector enables SQL access to real-time data through Trino, and more broadly, allows data architects and developers to expand fast analytics data accessible from Trino.

For analytics use cases, you can synchronize transactional and operational data from globally distributed clusters to a system of record (SOR) or analytics store in near-real-time using an advanced XDR protocol.

Enabling Fast Queries at Scale

Superior databases provide fast access to large volumes of data with massive parallelism.

Large Volume of Fast Storage

Modern databases have a cluster architecture spanning multiple nodes for scale, performance, and reliability. High density of fast storage is achieved through solid-state disks (SSDs). Hybrid memory architecture (HMA) stores indexes and data in dynamic random-access memory (DRAM), SSD, and other devices to provide cost-effective fast storage capacity.

Low-Latency Access

Disk (SSD) reads and writes are optimized for latency and throughput.

Indexes play a key role in realizing fast access to data. This requires supporting secondary indexes on integer, string, geospatial, map, and list columns.

Multiple Levels of Parallelism

The thread architecture on a cluster node is optimized to exploit parallelism of multicore processors of modern hardware, and also to minimize conflict and maximize throughput. The data is distributed uniformly across all nodes to maximize parallelism and throughput. The client library connects directly to individual cluster nodes and processes a request in a single hop, by distributing the request to nodes where it is processed in parallel, and assembling the results.

Fine-Grained Sub-Queries

Ideally, you want to distribute records in uniform partitions, and allow separate sub-queries over them for maximum parallelism. In other words, a query is split into independent parallel sub-queries over one or more partitions, for the needed parallelism to match the required throughput. Further, each data partition can be subdivided into N sub-partitions by adding the modulo filter expression digest % N == i for 0 <= i < N, where digest is the hashed key of the record.

Since digest is held in memory with other record metadata, the filter expression evaluation for a record’s membership in a sub-partition requires no access to data on the SSD. Therefore, a sub-partition query reads only the data in its sub-partition, minimizing the SSD reads across the multiple sub-partitions. This sub-partitioning scheme allows for an arbitrary number of parallel streams.

Using this scheme, a large number of parallel tasks in Trino worker nodes can uniformly split the data for processing via an equal number of mutually exclusive and collectively exhaustive splits or streams using partition queries in combination with the modulo filter expression. The appropriate data scale, throughput, and response time can be achieved by adjusting the cluster size as well as the number of attached SSD devices per node.

Stream Access

Query results can be retrieved and processed in a stream of smaller chunks by repeatedly asking for a specific number of remaining records.

The Trino Connector

The Aerospike Trino Connector enables access to real-time data through Trino for analytics use cases such as ad-hoc SQL queries, reports, and dashboards. Data in multiple clusters can be queried together using Trino’s data federation, which also makes it possible to merge Aerospike data with data from other sources.

Additional details about the Trino Connector can be found in the blog posts “Deploy Aerospike and Trino based analytics platform using Docker” and “Aerospike Trino Connector – Chapter Two.”

Starburst is a SQL-based MPP (massively parallel processing) query engine based on Trino that enables you to run Trino on a single machine, a cluster of machines, on-premise or in the cloud. The blog post “Analyze Data with Aerospike and Starburst Anywhere” describes how to use Starburst Enterprise. Recently released Aerospike SQL Powered by Starburst 1.1.0 supports Starburst Enterprise Platform (SEP), which is “a fully supported, enterprise-grade distribution of Trino that adds integrations, improves performance, provides security, and makes it easy to deploy, configure, and manage your clusters.”

Other SQL Options

Applications can access the data through SQL in a few other ways.

Spark

You can use Spark SQL to manipulate Aerospike data on the Spark platform. Aerospike Spark Connector provides parallel access to the Aerospike cluster from Spark.

Spark SQL merges two abstractions: replicated distributed datasets (RDDs) and relational tables, and is used to manipulate and process data in RDDs. Find examples of importing and storing Aerospike data to and from RDDs in these tutorials.

More details on the Spark Connector are available in the blog posts “Using Aerospike Connect for Spark” and “Accelerate Spark queries with Predicate Pushdown using Aerospike.”

JDBC

Application developers can use simple SQL to access Aerospike data with the community-contributed JDBC Connector. Please read more details in the blog post “Introducing Aerospike JDBC Driver.”

Aerospike API

Applications requiring the full functionality of the Aerospike API can use the [SQL patterns] available in the APIs to implement specific SQL CRUD operations easily.

Aerospike Database’s fast high-capacity storage and parallel processing aligns with Trino’s distributed SQL query engine to accelerate query processing over large data sets. Aerospike’s hybrid memory architecture (HMA) leverages SSDs along with DRAM to greatly expand fast storage capacity in its cluster. Further, Aerospike distributes data and processing over a large number of partitions, providing a high degree of parallelism. For Trino queries over data in Aerospike, the result is accelerated performance at scale.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Pragma, Docker, Real.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.