What Is a Column Database and When Should You Use One?
There are a lot of different options when it comes to choosing a database for your application. A common discussion seems to be the high-level SQL vs. NoSQL database argument of whether data should be stored in a relational database or in a NoSQL alternative like key-value, document or graph databases.
Another option is to flip things entirely by going with a column database. In this article, you will learn why it might make sense to make that choice and some of the strengths and weaknesses of column databases.
What Is a Column Database?
As the name suggests, a column (or columnar) database stores data organized into columns instead of rows on disk. Using an IoT sensor as an example, a row-based database would store the data on disk like this:
A column database would organize the same data so that each column value is stored next to each other sequentially on a disk:
The main advantage of a columnar database is that it can significantly reduce the amount of disk space required to store the data due to improved compression ratios. In addition, columnar databases are much faster at processing analytic-type queries than traditional row-based databases.
Why Are Column Databases Good for Analytics Workloads?
So how exactly does changing the storage format of data result in improved performance? There are several factors at play that result in column databases being able to provide orders of magnitude better performance for online analytical processing (OLAP) workloads, compared to traditional databases.
The first reason is improved compression. This is due to column databases being able to use optimal compression algorithms for each data type because each column is the same type of data rather than a row of mixed data types. This not only reduces storage cost on disks but improves performance because fewer disk seeks are needed, and more data can fit into RAM.
Another way column databases improve performance is by actually storing multiple different versions of the same column under the hood, which are sorted into different orders to allow for faster filtering and selections for certain queries.
There are a number of other ways that column databases are able to squeeze out improved performance. Here are a few examples:
- Adaptive indexing as data is being queried
- Vectorized processing
- Optimized and more efficient joining of columns
- Late materialization of columns
The results of these optimizations can be seen in this visualization for testing done as part of a paper that analyzes column database performance. Based on the TPC-H data-warehousing benchmark dataset, the column database performed about 10 times faster compared to a conventional row-structured database.
The chart shows which optimizations have the biggest impact on performance for the column database, with late materialization giving the biggest boost in this case.
Column Database Performance Trade-offs
As with almost everything in computer science, trade-offs are also made when it comes to column database performance. They are optimized for analytics workloads and, by design, aren’t going to be a good fit for traditional online transaction processing (OLTP) workloads that relational databases were designed for.
The biggest performance sacrifices will be in cases where you attempt to update specific data points or write individual data points. With column databases, you want to insert data in batches when possible.
Column databases will also take a hit on read queries, where you grab all the data that would be in a row as if you were using a relational database. Because each column has to be reconstructed to create the entire row, performance takes a hit.
Column Database Use Cases
Column databases are a good fit for any situation where you want to analyze large amounts of data. Let’s take a look at some common use cases.
Column databases are perfect for analyzing sales data because they allow you to divide the information in a variety of ways. This can help you to identify trends and patterns that you might not have otherwise seen. For example, you can use column databases to compare sales figures for different products over time by location, brand or product category.
Application Performance Monitoring
Application performance monitoring is another common situation to use column databases to help improve software reliability and performance. By tracking and analyzing data about how an application is performing, you can identify issues before they cause the application to crash or slow down. This can help you to avoid downtime and to ensure that your users are getting the best possible experience. By using a column database, you can store more granular data for deeper insights while reducing costs due to superior data compression.
More and more internet-connected devices are being deployed, and many of them are collecting data that is used for analytics workloads. Column databases can be used to store this data for real-time alerting and also for generating forecasts to improve efficiency in many different industries.
Specialized Column Database Examples
So far we’ve looked at column databases and their benefits at a high level. While all column databases broadly share the same traits and can be used effectively as general-purpose data warehouses or data lakes, in this section we will look at how they can be tuned and optimized for even more specific performance characteristics.
InfluxDB IOx is an open source columnar storage engine for InfluxDB that is optimized for working with time series data. Time series data presents a unique challenge in terms of performance because users want to query the data in ways that strain a database from opposite sides of the performance spectrum:
- Analytics-type queries that only require a few columns of data, like getting the average temperature from a sensor over the past week.
- Relational-type queries where the user wants all available information from many different sensors for the past 5 minutes.
Time series data also tend to arrive in massive volumes, which require fast ingestion capabilities so data can be indexed and queried quickly for real-time monitoring and alerting. In addition, many users want to be able to store this data long term for historical analysis and forecasting without breaking the bank.
InfluxDB allows users to get the best of both worlds by managing the life cycle of data and moving it between hot and cold storage to get fast performance on historical data while also keeping storage costs down by using cheaper object storage for data that isn’t being actively queried.
A key part of accomplishing this was done by building on and contributing to projects like Apache Arrow, DataFusion and Parquet. Arrow allows data to remain compressed in a columnar format while in memory and being moved between different parts of the database. Parquet is used for efficient persistent storage, and DataFusion provides high performance querying and SQL support. A number of other major projects and vendors are building on Parquet and Arrow as well, which also enables integration and compatibility with the broader big data ecosystem.
Apache Druid is a real-time database with an underlying column data structure. Druid works for typical data warehouse-type workloads like you would expect from a columnar database, but it also places a priority on low latency response times and is often used for things like interactive user interfaces.
DuckDB is an in-process database designed for OLAP workloads, essentially designed to be SQLite for analytics. DuckDB uses columnar vectorized processing to run SQL queries with great performance and can be embedded into applications easily. The primary selling point of DuckDB is that it is easy to set up and get run locally in a world where most OLAP databases are either cloud hosted or require an extensive installation process.
Choosing the Right Tool for the Job
At the end of the day, technology decisions come down to what makes sense for your use case. Choosing a database is probably one of the most important choices when it comes to building your application for the long term, so it makes sense to know all the options available and pick the one that fits best.
If you are working with large amounts of data that will primarily be used for analytics, a column database might be a good option.