Data / Edge / IoT / Kubernetes / Contributed

Kubernetes-Run Analytics at the Edge: Postgres, Kafka, Debezium

1 Jul 2021 4:00am, by
Jonathan S. Katz
Jonathan S. Katz is the Director of Cloud Engineering & Growth at Crunchy Data, the leading provider of trusted open source PostgreSQL technology, and is responsible for its cloud offerings, including the PostgreSQL Operator. Jonathan is also responsible for advocacy and other governance efforts of the PostgreSQL Global Development Group and is a board member of the nonprofit United States PostgreSQL Association. Jonathan enjoys building applications with PostgreSQL and revels in showing off all of the wonderful features of PostgreSQL. Prior to Crunchy Data, Jonathan was CTO at VenueBook, and before that, VP of Technology at Paperless Post. At both companies Jonathan developed robust platforms using PostgreSQL, taking advantage of its many features, from complex data types to its ability to stream logical changes. Jonathan graduated from Tufts University with a B.S. in Computer Science and a B.A. in Mathematics.

You don’t have to look very far to be reminded that macro trends like cloud computing, Internet of Things (IoT), and machine learning/AI are all driving the creation of applications and compute requirements that span from data centers to public clouds to connected devices.  We now see users discussing the ability to extend data sources to applications and users across the spectrum as “edge computing.”

Edge computing helps businesses become more proactive and dynamic by placing applications and processing power closer to the devices and sensors that create or consume data, enabling them to gather, analyze and turn large dataflows into actionable insights, faster.  Industries everywhere are discovering the transformational opportunities that bringing edge computing together with cloud native applications and cloud operational models can create for their business.

Fortunately, open source software is particularly well suited to address the inherent flexibility required by edge computing use cases.

In particular, PostgreSQL, Apache Kafka and open source data change capture software Debezium can be deployed using Kubernetes Operators on Kubernetes to provide a cloud native data analytical solution that be can be used for a variety of use cases and across hybrid cloud environments — including data center, public cloud infrastructure, and the edge.

Data Analytics at the Edge

One powerful application of the potential for cloud native data analytics services built from PostgreSQL and Kubernetes is spatial data analysis of connected vehicles.

Connected vehicles, whether used by ride-hailing or delivery services, both produce vast arrays of sensor data and benefit from the rich analytics that result from aggregation and processing this data. These connected vehicles, and the supporting fleet management systems found in the manufacturing and logistics ecosystem, stand to benefit from cloud native data analytics services.

The raw data generated by fleets of moving objects fitted with sensors (aka “telematics”) is voluminous and fast-changing. It also contains many analytical insights, that can be mined using systems that combine stream processing to extract data of interest with edge databases to collate and analyze those data.

PostgreSQL, Apache Kafka and Debezium can be deployed using Kubernetes Operators on Kubernetes to provide a cloud native data analytical solution that be can be used for a variety of use cases and across hybrid cloud environments — including data center, public cloud infrastructure, and the edge.

A standard architecture takes a stream of raw locations, and corrects them to the known travel paths of interest: a street network. Particular pieces of the data can be drawn out, and then generalized. For example, a hard “Z spike” in an accelerometer from one car could be a glitch, but if clustered with spikes from multiple cars could indicate a pothole requiring maintenance. The spikes can be extracted from the mainstream and analyzed in a separate edge analytics database.

Similar systems can take aggregates to analyze large-scale patterns, like commuting movements, or small-scale phenomena like persistent under-speed areas of roadway. The key is moving the mid-process data out of the (huge, difficult to differentiate) main data stream and into edge databases for interactive exploration and visualization.

The combination of open source Crunchy PostgreSQL for Kubernetes, Apache Kafka and Debezium on Kubernetes provides a data analytic pipeline from these connected cars at the edge to the public cloud or the private cloud data center.

Benefits of Change Data Capture

Microservices teams gain agility by avoiding dependencies such as shared database tiers or common access models. One popular solution to this information sharing challenge is for each microservices team to replicate the data in an intermediate store of its choice and populate it with the data owned by other teams.

Change data capture (CDC) is a well-established software design pattern for a system that monitors and captures the changes in data so that other software can respond to those changes. CDC captures row-level changes from database tables and passes corresponding change events to a data streaming bus. Applications can read these change event streams and access these change events in the order in which they occurred. Thus, change data capture helps to bridge traditional data stores and new cloud native event-driven architectures.

Debezium is a set of distributed services that captures row-level changes in databases so that applications can see and respond to those changes. Debezium is built upon the Apache Kafka project and uses Kafka to transport the changes from one system to another.

Historically, data was kept in a monolithic datastore. Newer systems are trending towards microservices where the processing of data is broken up into smaller discrete tasks. The challenge at that point is making sure that each microservice has an up-to-date copy of the data. CDC shines at this as it:

●      Uses the write-ahead logs to track the changes

●      Uses the datastore to manage the changes (don’t lose data if offline)

●      Pushes changes immediately

The most interesting aspect of Change Data Capture with Debezium is that at the core it is using CDC to capture the data and push it into Kafka.  The source PostgreSQL database remains untouched in the sense that we don’t have to add triggers or log tables. In fact, PostgreSQL has had built-in native CDC facilities for close to a decade! This is a huge advantage as triggers and log tables degrade performance. In addition, PostgreSQL manages the changes in such a way that they are not lost during a restart or outage.

This makes the system much more flexible. If you want to add a new microservice, simply subscribe to the topic in Kafka that is pertinent to the service.

Edge Data Analytics Toolbox

Implementing databases and data analytics within cloud native applications involves several steps and tools from data ingestion, preliminary storage, to data preparation and storage for analytics and analysis.  An open, adaptable architecture will help you execute this process more effectively. This architecture requires several key technologies. Container and Kubernetes platforms provide a consistent foundation for deploying databases, data analytics tools, and cloud native applications across infrastructure, as well as self-service capabilities for developers and integrated compute acceleration.

PostgreSQL, Apache Kafka and Debezium can be deployed using Kubernetes Operators on Kubernetes to provide a cloud native data analytic solution that be can be used for a variety of use cases and across hybrid cloud environments — including datacenter, public cloud infrastructure, and the edge — for all stages of cloud native application development and deployment.

PostgreSQL is a powerful, open source object-relational database system with more than 25 years of active development and a strong development community. While Postgres is known for and widely popular as a transactional database due to its SQL compliance, reliability, data integrity and ease of use, it also can be extended to performance advance data analytic capabilities. When combined with PostGIS, the geospatial extender for PostgreSQL, PostgreSQL users are able to perform powerful spatial analytics within the PostgreSQL database. That said, in order to benefit from the rich analytic capability of Postgres for Edge applications, it is necessary to bring the data from the edge to PostgreSQL.

Crunchy PostgreSQL for Kubernetes allows for enterprises to deploy production-ready, trusted open source PostgreSQL on Kubernetes. Powered by open source PGO: the Postgres Operator from Crunchy Data, Crunchy PostgreSQL for Kubernetes provides the essential features in a turnkey manner for running PostgreSQL that work at an enterprise level. These include provisioning, high availability, disaster recovery (backups & restores), monitoring, advanced security controls, and more.

Apache Kafka has become the streaming technology of choice for this type of replication. Kafka is prized by these teams for performance, scalability, and ability to replay streams so that the teams can reset their intermediate stores to any point in time. This is a benefit not just to microservices teams but also to a large range of use cases, including website activity tracking, metrics and log aggregation, stream processing, event sourcing, and Internet of Things (IoT) telemetry.  As more applications move to Kubernetes, it is increasingly important to be able to run the communication infrastructure on the same platform. Kubernetes, as a highly scalable platform, is a natural fit for messaging technologies such as Kafka. The Strimzi Kafka Operator makes running and managing Apache Kafka Kubernetes native through the use of powerful operators that simplify the deployment, configuration, management, and use of Apache Kafka on Kubernetes.

Feature image via Pixabay.