Making It Easier to Build Apps with Time Series Data
Time series data isn’t a new challenge for software developers, but the exponential increase in time series data in recent years (with no signs of slowing down) certainly makes it more complicated.
If information is power, then every bit and byte of data is like a tiny flake of digital gold. Many companies have had access to time series data, but never really understood how to leverage it effectively. That’s all changing.
In simple terms, time series data refers to any data point with a timestamp: weather information, stock prices, and even something like ocean tide schedules. To constitute a series, the data points must come from the same source and track sequentially over time. We can easily visualize this data as a line chart or graph, though it could appear in a table or in other forms.
Software developers have been working with time series data in their applications for decades. But working with this kind of data effectively requires a variety of things, including a database set up to handle it. For example, a database that continually overwrites the previous data point when a new one comes in is problematic for developers trying to use time series data.
The big difference today is that there’s been a surge in time series data and it just keeps growing, driven by widespread instrumentation and mainstream adoption of the Internet of Things (IoT) and edge computing, with millions of devices and sensors coming online.
While an organization may certainly find value in historical data, it’s usually the most recent data points that are most critical to daily operations. That’s true in a wide range of scenarios: application monitoring, a proximity sensor in an autonomous vehicle, a gas sensor in an industrial setting, or a connected home appliance.
“If I’ve got thermostat information coming from my water heater, knowing the average temperature from a year ago is mildly interesting,” Barbara Nelson, vice president of engineering at InfluxData, which makes the time series platform InfluxDB, told The New Stack. “Knowing what the temperature of my water heater is right now is potentially very relevant, especially if it’s overheating and I need to take some action. Knowing how much the temperature has changed over the past five minutes is also potentially very relevant.”
The value and relevance of time series data correspond to when a sensor records a particular data point. So, working with these massive amounts of data efficiently and quickly — and in real-time — means that companies and developers need the right tools.
Devs must often build and integrate applications that consume time series data using a diverse mix of environments/clouds, languages, frameworks and tooling. That can be a lot harder than it first appears.
Working with Time Series Data: 4 Key Issues
There are several key challenges developers and other IT pros encounter when working with time series data, according to Kesara Kudalugodaarachchi, R&D software development senior director at PTC ThingWorx, an industrial IoT platform. (The company is a technology partner of InfluxData.)
Among the biggest challenges:
- Time series data is big. The amount of data that accumulates over a given time period can be huge, especially when a long-term data retention policy exists. “As the amount of data grows, working with that data becomes increasingly difficult due to the increase in query response times,” Kudalugodaarachchi said.
- Data storage gets expensive. As the volume of time series data grows, storing it becomes costly and purging old data becomes an operational headache.
- Analyzing time series “slices” is too time-consuming. A lot of the value in time series data is tied to the ability to process and analyze it quickly. That’s tricky when working with “time slices” of a large data volume, such as calculating average hourly temperature. “Performing this kind of aggregation at the application level requires fetching the data into the application layer over the network, which is an expensive operation,” Kudalugodaarachchi said.
- Data persistence is hard to scale. Finally, he noted that time series data persistence requires high throughput in IoT environments — one of the major categories for time series data use cases — with hundreds of thousands of devices connected (inserting hundreds of thousands of database records per second). That is hard to scale, especially with relationship-database management systems.
All of these factors, Kudalugodaarachchi said, can significantly hamper developer experience, even when dev teams come up with reasonable homegrown or point solutions on their own. And “expensive,” “difficult” and “time-consuming” aren’t exactly sought-after traits in the business world.
Balancing Business and Developer Needs
This creates a layered problem: Organizations need to derive business value from time series data, but they also need to ensure a high-quality developer experience — or risk recruiting and retention problems.
If your devs have to build a database from scratch, or retrofit another solution, to support time series data in their applications, or if they otherwise have to tackle the challenges listed previously on their own, your developer experience is almost certainly below par.
The “good” news here is that some of these developer experience issues aren’t necessarily unique to time series data.
“Tool integration and usage is always a challenge and time series data applications are no exception,” said Yugal Joshi, partner at Everest Group, a tech analyst firm, where he leads digital, cloud and application services research. “However, this challenge isn’t worse in this scenario.”
But the challenge is still real: For example, integrated developer environments (IDEs) or other platforms may not always support the needed libraries required to build software for time series data use cases, according to Joshi.
Nelson, of InfluxData, noted that developer experience with time series data — and therefore the business value of applications that rely on time series data — is often hampered today by time-consuming manual effort around data ingestion, clunky workarounds and integrations, data formatting issues, and other problems. Native data formats from an edge device, for example, don’t necessarily match how you would expect that data to appear in a database.
Because so much software today depends on other software — especially open source — to run properly, applications that consume time series data (and the tools that devs use to build them) need to be able to integrate quickly and securely. Otherwise, the bottlenecks will start bulging.
“Given time series data software is heavily used for forecasting, financial insights, anomaly detection, supply chain, and detecting trends, it needs to allow developers to combine different open source modules seamlessly and securely,” Joshi said. “This is needed, otherwise developers will spend way too much time in creating these modules.”
The challenges will get heavier, too, as organizations across a broad swath of industries develop new or growing uses for time series data. They’re also compounded by increasingly complex use cases.
“Now time series software needs to support natural language processing, artificial neural networks, and advanced pattern recognition,” Joshi said. “These are very complex domains and developers need significant tools, pre-made templates and tested modules to deliver their software on time.”
Finding a Solution
The solution to the broader set of challenges with time series data starts with recognizing, rather than papering over, the problems facing a particular team or organization and then identifying the causes so that you can alleviate them.
As with many other dimensions of modern software development, “time to value” (which is ultimately what the C-suite cares about) and developer experience (which is absolutely critical to that time to value) depend on an environment or platform that consistently prioritizes automation, interoperability, integration, speed, and flexibility. Throw in openness — as a counterweight to vendor lock-in — for good measure, too.
Companies with considerable time series data use cases, Kudalugodaarachchi said, may benefit from using a data store that specializes in time series data, since it will take into account the challenges described previously, such as performing those time-slice aggregations in the database itself (instead of the application layer) to deliver near-real-time response rates.
This means developers build quickly and see results faster — without spending countless hours on database administration or other operational headaches outside of their wheelhouse.
This is essentially what InfluxData built with InfluxDB, a developer-oriented platform for building and scaling time series applications that streamlines the common challenges teams face in terms of ingestion, real-time analytics, storage, and more.
For any company with significant time series data use cases, Nelson stressed the need for a time series solution that “meets developers where they are.”
At InfluxData, she said, “we focus on all the different ways to help a developer get their data into the time series database, and manage it within the database. The developer typically has a need to store a lot of time-series data, to use it as part of their application. They want to be able to analyze and be able to act on it. They don’t want to spend all their energy just trying to figure out how to get their data in the right format, store it efficiently, and manage it over time.”
Significant attention has been given to how data is ingested and how developers query it. Users of InfluxData’s solution can use either SQL, InfluxQL (the Influx query language), or Flux, which enables users to not only query the data but act on it.
Data query results are presented in a user-friendly fashion. For instance, Nelson said, the web UI “enables you to just basically point and click, select your measurements, select your time range, and we will show you the results in any one of a variety of different graphical formats.”
The company also maintains an open source project, Telegraf, an agent that sits at the edge. It has more than 300 different plugins that can fit to the protocols for pulling data from edge devices. Then, Nelson said, it can write “to wherever the time series destination would be, with transformation, batching and retry logic, all in that agent.” This makes for a much more resilient data pipeline.
InfluxData is also designed to help developers manage their data over time. Time series data can fill up disk space with old data if you don’t keep an eye on it.
“When you first define a bucket, which is the term we use for the database that is storing this data, you define a retention policy,” Nelson said. “You can say, ‘Hey, data in this bucket, I want to keep it for six months.’ And then we take care of auto-purging it over time, so you don’t run into the situation where your disks are filling.”
She summarized, “We make it very easy for you to do a combination of retention policy enforcement and downsampling. So that you can basically have all of this data flowing in on an ongoing basis, and not risk that, at some point, you’re suddenly going to find that you have a lot of ancient data that you have to clean up and deal with. We take care of that for you.”
The InfluxData approach, Nelson said, comes from looking “holistically at the day in the life of the developer who is trying to build an application that has time series data. … we’ve looked at well, what are all the things they’re likely to need to do? And how do we make that easier for them?
“We don’t want to say to the developer that they need to make all these changes to match our world. No. We will develop capabilities to match your world.”