While Apache Cassandra is great at ingesting data, it has a strict query pattern that limits developers who are used to relational databases. This can extend development cycles and time to market.
Here, we’ll walk through storage-attached indexing, a way to simplify and speed up the data-modeling process on Cassandra.
Indexing in Distributed Systems
Indexing is the process of associating a search key with the location of a corresponding data record. If you’ve ever worked with a database, you’ve used a primary index where you set up a primary key for a data table. Indexing is done automatically with primary keys, and data lookup is fast and efficient.
But when we go beyond the primary index into nonprimary key queries, things start to change. A relational database will do the lookup, but Cassandra returns an error because the data layout in distributed systems is very different. Not only that, but Cassandra specifically disallows nonprimary key queries at scale.
Because a main index organizes data based on a primary key, the table may be organized as a B+ tree. In a B+ tree, leaf nodes store rows, rather than a grid with rows and columns. For both relational and Cassandra databases, truly efficient querying is supported on primary keys.
Secondary indexes help with querying nonprimary-key columns, but a secondary index is generally less efficient than a main index. Again, this is true for both relational and Cassandra databases. Since developers still needed to run nonprimary key queries, several secondary indexing practices emerged.
Limitations of Secondary Indexing
To understand secondary index limitations, let’s take a closer look at how they work in comparison to Cassandra tables and materialized views.
Tables and materialized views are examples of distributed indexing. Based on a partition key, table or view data structure is distributed across all nodes in a cluster. When retrieving data using a partition key, Cassandra knows exactly which replica nodes may contain the result. For example, in a 100-node cluster with a replication factor of five, a query needs at most five replica nodes and one coordinator node.
In contrast, secondary indexes are examples of local indexing. A secondary index is represented by many independent data structures, indexing data stored on each node. When retrieving data using only an indexed column, Cassandra has no way to determine which nodes may have necessary data. Therefore, it has to query all nodes in a cluster. For example, given a 100-node cluster with any replication factor, all 100 nodes have to search their local index data structures. This does not scale well and can get costly.
Therefore, for real-time transactional queries, you should only use a secondary index when a partition key is also known. This makes your query retrieve rows from a known partition based on an indexed column. In this case, Cassandra takes advantage of both distributed and local indexing.
For expensive analytical queries that retrieve a large subset of table rows based on a low-cardinality column, secondary indexes are also beneficial for distributing processing across all nodes in a cluster. These types of queries generally run via Spark Cassandra Connector, where retrieved data is further processed using Apache Spark. Note, however, that Apache Solr-based search indexes perform substantially better than secondary indexes in this use case.
Secondary Indexing in Cassandra
Secondary indexes have been around since the beginning of Cassandra to search column data and run arbitrary WHERE clauses. Although it has a very user-friendly syntax, it would reindex all the columns and tables whenever there was a write to the database on Cassandra. This made it hard for enterprises to manage their databases.
After a while, Apple introduced Storage-Attached Secondary Index (SASI) for tokenized text searching with fast range scans and in-memory indexing. But SASI blew up disk usage with inefficient indexing, and it was difficult for general users to get the hang of it since its syntax was for a specific use case.
Along came DataStax Enterprise (DSE) Search integrating the Solr-Lucene combination, which is a powerful search engine and indexing setup for relational or distributed databases. Historically, Lucene is the gold standard for indexing text data. The two-for-one combination allows DSE Search to provide full text searching, efficient indexing and in-memory indexing. (Here’s a full breakdown of different secondary indexes available for Cassandra.)
The Solution: Storage-Attached Indexes (SAI)
DataStax’s Storage-Attached Indexing (SAI) addresses the shortcomings of secondary indexing practices by providing an efficient secondary index for Astra DB and Cassandra. SAI has the user-friendly syntax used in secondary indexing and in-memory indexing like SASI.
You can use SAI to add column-level indexes to any column and almost any Cassandra data type including text, numeric and collection types. This functionality enables you to filter queries using Cassandra Query Language (CQL) equality, range (numeric only), and CONTAINs semantics. SAI provides more functionality compared to Cassandra secondary indexes, is faster at writes compared to any Cassandra or DSE Search index and uses significantly less disk space. Figure 1 illustrates the density expectations on the disk of SAI, SASI and Legacy 2i.
You can now run powerful queries at scale beyond primary keys on Cassandra just like you would on relational databases. Figure 2 shows examples of storage-attached indexes on Cassandra that you later on create in hands-on exercises.
How Does This Change Your Data Model?
Storage-Attached Indexing simplifies and speeds up the data modeling process on Cassandra. You can now index columns before denormalization for your new data models — just start with a single table and build indexes from there.
Cassandra performs best when the data model is denormalized. But for those of us who come from the relational world, we are used to running flexible queries on a relational database. With Storage-Attached Indexing, you add more flexibility to your Cassandra data model, taking the mental load off denormalization and decreasing the amounts of tables that you have to deal with.
However, if you have tight service-level agreement requirements that call for your response times to be within single-digit milliseconds, you’ll still want to denormalize your database. Optimizing for denormalization is optimizing for performance, like optimizing for a Porsche or Ferrari. We give a more comprehensive breakdown of the denormalization and data modeling process here, or read through our documentation and FAQs on SAI.
DataStax is also actively incorporating SAI with Cassandra. Reach out to us at [email protected] to get updated news on this integration.
Hands-on Exercise: Getting Started with SAI
Let’s go ahead and set up Astra DB to create a fully-realized Cassandra database and query outside of primary keys through SAI. Follow the step-by-step instructions in this video and GitHub to create and configure your Astra database.
Next, simulate a use case of a client data model that a dentist might use on Cassandra. Here’s a breakdown of the steps:
- Navigate to the CQL Console on your Astra DB and log in to the database.
- Describe all the keyspaces in your database.
- Create a “clients” table and insert some data.
- Verify that data exists.
- Create indexes on three fields — first name, last name and birthday.
- Execute queries using our indexes.
By this stage, you would have queried against a combination of string and date fields using exact matches, multiple string cases and date ranges. By adding an index on three fields, we significantly expanded the flexibility of our data model. Thinking ahead, let’s add another index to support a new data model requirement for finding clients based on their next appointment.
Hands-on Exercise: IoT Sensor Data Model
Afterward, you’re ready for an actual IoT use case. In this exercise, you’ll help an organization supercharge their Cassandra data model and make it simple and easy to query beyond primary keys. Watch this video for instructions and visit GitHub for the codes. Here are the steps:
- Navigate to the Studio console on your Astra DB and import a preconditioned Studio Notebook.
- Go to “Section 3. IoT sensor data model use case” to continue.
We hope that you’ve gained a clearer picture of Storage-Attached Indexing and how it superpowers the efficiency of Cassandra data models.