Time series data is special — not just in the unique data that it captures, but also in the ways we interact with that data. Maybe you’re starting to use time series data from sensors in your company’s thermostats (to finally prove that Dad is turning down the temperature at night) or to analyze historical data to make predictions about market prices. You’re crushing it.
But with new types of data come new responsibilities. Time series data is evanescent and voluminous, which is to say, it comes and goes quickly and in great number. That calls for different considerations for storage and retrieval than other types of data. If you want to retrieve a user from a table in a relational database, you can query by any number of attributes in your schema: ID, last name, first name, favorite member of Earth, Wind, & Fire. If you want to know exactly when your drone (alias: Skynosaur) sent its coordinates home, you can do that, too. But not without some trade-offs.
When to Use a Time Series Database
Lots of companies and individuals store their time series data in other types of databases (relational, noSQL) successfully. If you’re one of those, you’re happy, and you have no current issues, far be it from me to demand you change. You do you.
However, there are definite benefits to using a database designed for your time series data.
Scalability is one of those magical words that we hear often and is used correctly sometimes. The general problem with time series and scale outside of a Time Series Database is this: if Skynosaur flies for 1,500 hours (the minimum number of hours for a commercial pilot’s license), we’ve already reached over a million data points for one device. The makers of Skynosaur (Skynosaurus Rex, Inc.) could have thousands of devices sending data home. Querying by timestamp would involve millions of rows of data in a relational database.
People often claim that SQL databases don’t scale well while NoSQL databases do, but it was easier for me to understand in terms of ACID versus BASE. To unfairly summarize, ACID-compliant databases are concerned with guaranteeing validity — data should be atomic, consistent, isolated and durable. The BASE model allows us to give up some of the ACID principles for the sake of speed, or scale, or whatever we want to prioritize. To decide which system works, we need to establish the main purpose of our database.
If we don’t care about durable data, we can write commands without flushing to disk (meaning the data probably won’t survive a reboot). If we don’t care about atomicity, we can shorten the duration that data sets are locked. Time series databases balance the ACID/BASE relationship by offering principles that suit time series data.
For example, time series data is more valuable as a whole than as individual points, so the database knows it can sacrifice durability for the sake of a higher number of writes. Skynosaur sends data home every five seconds, so if we lost some data points in 1,500 hours of flight time, our overall trends would still be intact.
Scalability, in this case, means that a time series database specializes in a higher number of writes with eventual consistency, even across distributed storage, and that specialty means less worry for the people that care about that data.
If all of our data lived in a secure, durable black box, we could breathe easy. But how we access the data can be just as important as its storage. Every database has its query language, designed to access the contents as efficiently as possible. Keep that in mind because as we mentioned earlier, time series data is special. It’s a double rainbow with a timestamp.
Think of the army of Skynosaurs sending data to Skynosaurus Rex headquarters again. There are millions of data points to search, but now we have a query language that is built for the task at hand—not to view data as it relates to other pieces of the schema but to view data in the context of time in order to aggregate, set windows or see trends. This isn’t about whether other databases are capable of doing such a thing, it’s about how we choose to spend our resources.
Database architecture is about trade-offs and priorities. Do you need speed or accuracy or volume or predefined schemas? The proof is in the benchmarks. Measure everything. Don’t choose a tool or a product—choose a solution to your problem. Specialty tools are made for special problems, so time series databases are optimized for time series problems.
InfluxData is a sponsor of The New Stack.
Feature art by Katy Farmer.