Modal Title
Data / Software Development / Storage

Building Large-Scale Real-Time JSON Applications

What to look for in a real-time data platform and the core JSON functionality required for ease of development.
Sep 9th, 2022 10:55am by
Featued image for: Building Large-Scale Real-Time JSON Applications
Feature image via Pixabay.

“Real-time describes various operations or processes that respond to inputs reliably within a specified time interval.”— Wikipedia

Neel Phadnis
Neel is the director of developer engagement at Aerospike. He is a technologist with leadership experience in building innovative products and bringing them to market.  He has held senior engineering management roles at Tealeaf, Efficient Frontier, AOL and Netscape. Neel holds an MBA from Boston University, Master’s in computer science from the University of Wisconsin-Madison, and B.Tech. from Indian Institute of Technology, Madras.

Real-time data must be processed soon after it is generated, otherwise, its value is diminished, and real-time applications must respond within a tight timeframe or the user experience and business results are impaired. Real-time applications must have reliable, fast access to all data, real-time or otherwise.

The number of real-time interactions between people and devices continues to grow. Leveraging real-time data is still a competitive edge in some areas, but its use is expected in others. Up-to-the-moment relevant information is expected to deliver the best possible customer experience or business decisions.

Much of the data today is generated, transferred, stored and consumed in the JSON format, including real-time data such as feeds from IoT sensors and social networks, and prior data such as user profiles and product catalogs.

Therefore, JSON data is ubiquitous and growing in use. The best possible real-time decisions, increasingly based on artificial intelligence/machine learning (AI/ML) algorithms, will arrive using continually updated massive data sets.

Overview

This article discusses the database perspective on building large-scale real-time JSON applications and touches upon the following key topics:

  • What to look for in a real-time data platform.
  • How to organize JSON documents for speed at scale.
  • The core JSON functionality required for ease of development.

Database for Large-Scale JSON Applications

The key requirements in a database to build such applications are described below.

Reliably Fast Random Access at Scale

Reliably fast response time for read and write operations at any scale and any read-write workload mix are required to meet the real-time contract. This is delivered through:

  • Fast and uniform hash-based data distribution to all nodes for optimal resource utilization
  • Cost-effective storage of indexes and data in an optimal mix of DRAM, SSD and other devices
  • Optimized processing of writes and garbage collection for predictable response
  • One-hop access to all data from the application
  • A smart client that handles cluster transitions and data movements transparently
  • Primary and secondary indexes for fast access
  • Async and background processing modes for greater efficiency
  • Multi-op requests to perform many single-record operations atomically in one request

Fast Ingest Rate

The database must support fast ingestion speeds so that surges in real-time data feeds do not overwhelm the system or result in data loss.

The database must support batch operations for read, write, delete and user-defined logic, so that ingest can achieve the necessary high throughput.

Fast Queries

The database must handle concurrent queries over large data efficiently and provide various indexes and preferably granular control over parallel processing of queries.

Convenient JSONPath-Based Access

JSONPath-based document API offers a convenient way to access and modify specific elements within a document and is discussed below.

Rich Document Functionality

JSON documents are stored in the database in data types that must offer rich functionality, as described below.

Efficient Storage and Transfer

The internal format must allow the documents to be stored and transferred efficiently.

Rich API

The API must support complex processing on the server side to eliminate data retrieval to the client side.

Well Integrated into Other Performance Features

The document data types must be well integrated into various performance features, including server-side processing, batch requests, multi-op requests and secondary indexes.

  • Documents can be used in expressions that offer efficient server-side execution.
  • Batch requests can be performed on multiple documents.
  • Multi-op requests allow many operations on one document to be performed in one request. For instance, in the same request, you can add items to a JSON array, sort it, get its new size and top N items.
  • Document elements at any nested level can be indexed for fast and convenient access, as described further below.

Synchronizing Data with Other Systems

The database must offer control over efficiently replicating all or a subset of the data to other clusters through cross-data-center replication (XDR). Edge-core synchronization is often necessary for collecting real-time data as well as delivering real-time user experience at the edge. As described below, various connectors facilitate convenient and fast synchronization with other systems.

Easy Integration with Real-Time Data Streams

The database must provide streaming connectors to integrate with the standard streaming platforms like Kafka, Pulsar and JMS and allow CDC streams to be delivered to any HTTP endpoint.

Fast Access from Data Processing and Analytics Platforms

The database must provide fast access from platforms such as Spark and Presto (Trino)  to enable analytics, AI/ML, and other processing on the respective platforms.

Organizing for Scale and Speed

A critical part of building large-scale JSON applications is to ensure the JSON objects are organized efficiently in the database for optimal storage and access.
Documents may be organized in the database in one or more dedicated sets (tables), over one or more namespaces (databases) to reflect ingest, access and removal patterns. Multiple documents may be grouped and stored in one record either in separate bins (columns) or as sub-documents in a container group document.

Record keys are constructed as a combination of the collection-id and the group-id to provide fast logical access as well as group-oriented enumeration of documents. For example, the ticker data for a stock can be organized in multiple records with keys consisting of the stock symbol (collection-id) + date (group-id). Multiple documents can be accessed using either a scan with a filter expression (predicate), a query on a secondary index, or both.

A filter expression consists of the values and properties of the elements in JSON. For example, an array larger than a certain size or value is present in a sub-tree. A secondary index defined on a basic or collection type provides fast value-based queries described below.

Example: Real-Time Events Data

Real-time event streams can be ingested and stored in the database as JSON documents. To allow access by event-id and timestamp, they can be organized as follows:


Event-id-based document access is simple record access by incorporating the event-id in the record key. The exact match or range query on timestamp is possible by defining an integer index.

For greater scalability, multiple event objects can be grouped in a single document:


The event-id id contains the group-id and event-num unique within the group. The group-id, which identifies the record, can be a time period identifier such as the day, week, or month in the year covering all events in the record or another logical identifier for all record events such as the sensor-id.

To access an event directly by its event-id, the group-id is extracted from the event-id, the record is accessed by group-id, and then a JSONPath query is issued on the matching id field. The exact match or range query on timestamp can be performed by creating an integer index on the respective fields in the record.

Resources:

CRUD Operations with JSONPath

The database must provide the ability to store JSON documents, retrieve them and perform CRUD operations (create, read, update and delete) on elements identified by JSONPath.

More details on the document API can be found:

Query JSON Documents

The database must provide fast and efficient queries over documents. Performance-sensitive operations may also demand control over the parallel processing of queries.

In Aerospike Database 6.1+, any JSON element can be indexed to support exact match and range queries. Parallel partition-grained secondary index queries are available to boost throughput in large-scale applications in 6.0 and later releases. Find more details on indexing JSON documents in the blog post “Query JSON Documents Faster” and code examples in the tutorial on CDT Indexing.

Real-time large-scale JSON applications need reliable, fast access to data, high ingest rates, powerful queries, rich document functionality, scalability with no practical limit, always-on operation and integration with streaming and analytical platforms. They need all this at a low cost.

The Aerospike Real-time Data Platform provides all the functionality, making it a good choice for building such applications. Aerospike’s collection data types (CDTs) provide powerful capabilities to model, organize and query a large JSON document store. Visit the tutorials and code sandbox on the Developer Hub to explore the platform’s capabilities, and play with the Document API and query capabilities for JSON.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Pragma, Real.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.