Building Large-Scale Real-Time JSON Applications
“Real-time describes various operations or processes that respond to inputs reliably within a specified time interval.”— Wikipedia
Real-time data must be processed soon after it is generated, otherwise, its value is diminished, and real-time applications must respond within a tight timeframe or the user experience and business results are impaired. Real-time applications must have reliable, fast access to all data, real-time or otherwise.
The number of real-time interactions between people and devices continues to grow. Leveraging real-time data is still a competitive edge in some areas, but its use is expected in others. Up-to-the-moment relevant information is expected to deliver the best possible customer experience or business decisions.
Much of the data today is generated, transferred, stored and consumed in the JSON format, including real-time data such as feeds from IoT sensors and social networks, and prior data such as user profiles and product catalogs.
Therefore, JSON data is ubiquitous and growing in use. The best possible real-time decisions, increasingly based on artificial intelligence/machine learning (AI/ML) algorithms, will arrive using continually updated massive data sets.
This article discusses the database perspective on building large-scale real-time JSON applications and touches upon the following key topics:
- What to look for in a real-time data platform.
- How to organize JSON documents for speed at scale.
- The core JSON functionality required for ease of development.
Database for Large-Scale JSON Applications
The key requirements in a database to build such applications are described below.
Reliably Fast Random Access at Scale
Reliably fast response time for read and write operations at any scale and any read-write workload mix are required to meet the real-time contract. This is delivered through:
- Fast and uniform hash-based data distribution to all nodes for optimal resource utilization
- Cost-effective storage of indexes and data in an optimal mix of DRAM, SSD and other devices
- Optimized processing of writes and garbage collection for predictable response
- One-hop access to all data from the application
- A smart client that handles cluster transitions and data movements transparently
- Primary and secondary indexes for fast access
- Async and background processing modes for greater efficiency
- Multi-op requests to perform many single-record operations atomically in one request
Fast Ingest Rate
The database must support fast ingestion speeds so that surges in real-time data feeds do not overwhelm the system or result in data loss.
The database must support batch operations for read, write, delete and user-defined logic, so that ingest can achieve the necessary high throughput.
The database must handle concurrent queries over large data efficiently and provide various indexes and preferably granular control over parallel processing of queries.
Convenient JSONPath-Based Access
JSONPath-based document API offers a convenient way to access and modify specific elements within a document and is discussed below.
Rich Document Functionality
JSON documents are stored in the database in data types that must offer rich functionality, as described below.
Efficient Storage and Transfer
The internal format must allow the documents to be stored and transferred efficiently.
The API must support complex processing on the server side to eliminate data retrieval to the client side.
Well Integrated into Other Performance Features
The document data types must be well integrated into various performance features, including server-side processing, batch requests, multi-op requests and secondary indexes.
- Documents can be used in expressions that offer efficient server-side execution.
- Batch requests can be performed on multiple documents.
- Multi-op requests allow many operations on one document to be performed in one request. For instance, in the same request, you can add items to a JSON array, sort it, get its new size and top N items.
- Document elements at any nested level can be indexed for fast and convenient access, as described further below.
Synchronizing Data with Other Systems
The database must offer control over efficiently replicating all or a subset of the data to other clusters through cross-data-center replication (XDR). Edge-core synchronization is often necessary for collecting real-time data as well as delivering real-time user experience at the edge. As described below, various connectors facilitate convenient and fast synchronization with other systems.
Easy Integration with Real-Time Data Streams
The database must provide streaming connectors to integrate with the standard streaming platforms like Kafka, Pulsar and JMS and allow CDC streams to be delivered to any HTTP endpoint.
Fast Access from Data Processing and Analytics Platforms
Organizing for Scale and Speed
A critical part of building large-scale JSON applications is to ensure the JSON objects are organized efficiently in the database for optimal storage and access.
Documents may be organized in the database in one or more dedicated sets (tables), over one or more namespaces (databases) to reflect ingest, access and removal patterns. Multiple documents may be grouped and stored in one record either in separate bins (columns) or as sub-documents in a container group document.
Record keys are constructed as a combination of the collection-id and the group-id to provide fast logical access as well as group-oriented enumeration of documents. For example, the ticker data for a stock can be organized in multiple records with keys consisting of the stock symbol (collection-id) + date (group-id). Multiple documents can be accessed using either a scan with a filter expression (predicate), a query on a secondary index, or both.
A filter expression consists of the values and properties of the elements in JSON. For example, an array larger than a certain size or value is present in a sub-tree. A secondary index defined on a basic or collection type provides fast value-based queries described below.
Example: Real-Time Events Data
Real-time event streams can be ingested and stored in the database as JSON documents. To allow access by
event-id and timestamp, they can be organized as follows:
Record key:(namespace, set, <event_id>)
Event-id-based document access is simple record access by incorporating the event-id in the record key. The exact match or range query on timestamp is possible by defining an integer index.
For greater scalability, multiple event objects can be grouped in a single document:
Record key:(namespace, set, <group-id>)
id: <group-id, event-num>,
id contains the group-id and event-num unique within the group. The group-id, which identifies the record, can be a time period identifier such as the day, week, or month in the year covering all events in the record or another logical identifier for all record events such as the sensor-id.
To access an event directly by its event-id, the group-id is extracted from the event-id, the record is accessed by group-id, and then a JSONPath query is issued on the matching
id field. The exact match or range query on timestamp can be performed by creating an integer index on the respective fields in the record.
CRUD Operations with JSONPath
The database must provide the ability to store JSON documents, retrieve them and perform CRUD operations (create, read, update and delete) on elements identified by JSONPath.
More details on the document API can be found:
Query JSON Documents
The database must provide fast and efficient queries over documents. Performance-sensitive operations may also demand control over the parallel processing of queries.
In Aerospike Database 6.1+, any JSON element can be indexed to support exact match and range queries. Parallel partition-grained secondary index queries are available to boost throughput in large-scale applications in 6.0 and later releases. Find more details on indexing JSON documents in the blog post “Query JSON Documents Faster” and code examples in the tutorial on CDT Indexing.
Real-time large-scale JSON applications need reliable, fast access to data, high ingest rates, powerful queries, rich document functionality, scalability with no practical limit, always-on operation and integration with streaming and analytical platforms. They need all this at a low cost.
The Aerospike Real-time Data Platform provides all the functionality, making it a good choice for building such applications. Aerospike’s collection data types (CDTs) provide powerful capabilities to model, organize and query a large JSON document store. Visit the tutorials and code sandbox on the Developer Hub to explore the platform’s capabilities, and play with the Document API and query capabilities for JSON.