Why Fast JSON Analytics Is Foundational for Modern Web Apps
Having worked on Azure Cosmos DB and being a founding member of Azure Synapse Analytics, I have had the chance to work deeply across transactional and analytics workloads. Customers increasingly need analytics capabilities such as aggregates and joins over collections — and the only solution is often for them to ETL (extract, transform, load) the data in a different data format and system.
Below I will share some scenarios that make analytics over JSON necessary and explain how a data platform can best support analytics in modern applications with JSON.
The History of JSON with Analytics
It wasn’t until the rise of big data and the need for more flexible data structures that JSON started gaining traction in the analytics space. Most of these workloads were optimized for time series (especially with IoT, clickstream and telemetry) and search, which are niche compared to the whole analytics market. Running joins between two collections (or indexes) was very tedious, and data modeling for your application would be key to avoid any join operations.
There has undoubtedly been a rise in JSON to support various use cases and needs across several verticals like Internet of Things (IoT), finance and marketing. In fact, it has become table stakes for many of the industry’s leading databases, including SingleStoreDB, to offer built-in support for JSON data and enable users to perform analytics on both structured and semi-structured data.
Now, some databases have provided a superior experience and performance over JSON than others, and I am placing a large bet that support for JSON for transactional and analytical workloads is the future — especially when it comes to modern web apps.
Requirements for Modern Web Applications
There are a few pillars that are important, in my opinion, for modern applications:
Real-Time Decision Making
I don’t think all experiences need to be in real time within an application, but there are always places that require real time. For example:
- Retail app: Making a purchase recommendation right after you’ve picked a product, or for inventory management
- Fintech app: The ability to execute a transaction as soon as a condition is met
- AdTech app: The ability to advertise a product at the right time and right place
- IoT app: Event-based triggers for alerts, notifications and resolution
Adaptive and Intelligent
To provide a rich and intelligent experience, the application needs to incorporate a wide range of data not residing within it. This is why databases will need to integrate with an ecosystem of data sources and data formats with fast and reliable ingest performance. Integrating machine learning (ML) models requires broad extensibility and the ability to scale those models, usually in real time, over a large amount of data.
Our tutorials and webinars around image recognition by ingesting vectorized pictures and matching them is such a use case. Early on, we made a strong bet on WASM support, and WASM is expanding more than just for web applications use cases. Additionally, the ability for the database to vectorize JSON data will be powerful, as we’ve entered the GPT era.
Analytics and Insights
There are various kinds of analytics that you want to surface within your application. Oftentimes, 360 experiences within your product requires your application to run aggregate or joins over collections/tables. Internally, the team working on the application might want to drive analytics on raw data for surfacing issues, track engagement or conversion.
Since JSON is a dominant format within web applications, being able to support those scenarios with limited friction and cost requires your application to rely on a database that has strong transactional and analytical support over JSON.
In the past, we would say that running online analytical processing (OLAP) workloads or ad hoc queries within the online transactional processing (OLTP databases) was an anti-pattern. With SingleStoreDB — and thanks to the separation of storage and compute — users can simply spin up a new compute pool as read only and attach that database to it.
With built-in workload management and resource governor capabilities, and our very fast analytical query execution, it’s possible to do analytics and transactions concurrently on the same system.
Features to Support JSON Analytics
While JSON will never outperform parquet and columnstore formats for analytics, the need to execute analytics and artificial intelligence (AI) workloads within the same database has become important because it’s the only path for providing a real-time experience. Introducing data movements and transformations will introduce latency and complexity in providing an interactive experience within your application.
The following diagram is a short overview of SingleStoreDB capabilities and layers between the data API, the compute layer (including Pipelines for fast ingest from a variety of sources and data formats) and the storage (database).
Data API Interoperability over the Same JSON Data
A database should be versatile with regard to its developer interface around JSON. Today, even for multi-API databases, you need to pick one data API. A modern database should provide a rich and interoperable data API experience to meet developers and data engineers where they are comfortable.
Teams working on the same application will have different ways to interact with the data. One team might prefer MongoDB and the other team might prefer SQL, but they should not have to deal with siloed database systems.
Interoperability means that if I create a MongoDB table through a Mongo API, I should be able to read that table with SQL and vice-versa. When you think about a universal store, you should also think about it at the data API layer.
Isolation of Compute Workloads
If your application has a mix of operational and analytical workloads, you would try to isolate some of the spiky and resource-intensive analytical queries, especially around JSON. Your application should be able to leverage multiple isolated compute pools (we call them workspaces at SingleStore) to make sure that spiky analytics workloads do not affect the rest of your application.
SingleStoreDB lets you create a read/write workspace and many other isolated workspaces (read only) over the same data to run analytics. This approach differs from lakehouses and data warehouses because data lands first into memory, and is then saved in object store later. This helps provide low-latency performance across transactional and analytical workloads.
Core Database Engine
That’s where it gets even more interesting. Our database engine gives fast JSON analytics performance in the following areas:
- We derive the schema from the JSON document and store that schema in a columnstore format for fast read.
- At the segment level: If you filter your data with a
WHEREclause, we only scan the right segments. Internally we implemented that using row-level encoding.
- At the document level: If you only look for one field, whether it’s in an array or a nested field, we only scan for that field. Internally we use seekable JSON, which is on by default in our 8.0 release.
- Tune performance by combining JSON and relational in one table: Sometimes you need faster performance than you can get by just using the JSON document. With SingleStore, you can add a column next to your JSON column that is automatically extracted from the JSON, calculated and persisted as new data is coming in. This lets you shift some processing into “normal” SQL if needed for some scenarios, such as improved performance or ease of querying. We call it a persistent computed column and, you just need to use the following syntax within a table called
exampleCollectionthat has a JSON column called
jsoncolumnthat includes a field called
alter table exampleCollection
add column Code as jsoncolumn::Code persisted text;
addcolumn Code as jsoncolumn::Code persisted text;
Finally, besides choosing between rowstore and universal store (hybrid rowstore + columnstore), you can take advantage of all the available types of indexes to enhance your application performance further:
- Hash index
- Full-text index
- Geospatial index
- Skiplist index (Rowstore table only)
Again I love Parquet, Delta Lake or Iceberg for running large-scale analytics for many business intelligence and artificial intelligence scenarios. However for web applications, JSON is the most important format, and being able to have good analytics and machine learning performance where the data is born is key.
The trend of bringing analytics and operational data together is already happening. Many players in the NoSQL database space are trying to provide analytical capabilities. The reverse is also happening among OLAP players. The issue is that they move the data from one place to the other, which introduces latency.
When you look at our workload mix by type of operations, you can see that Singlestore already provides these capabilities. Note that select can be both a point-read query or an analytical query.