Best Practices to Build IoT Analytics
Today, Internet of Things (IoT) data or sensor data is all around us. Industry analysts project the number of connected devices worldwide to be a total of 30.9 billion units by 2025, up from 12.7 billion units in 2021.
When it comes to IoT data, keep in mind that it has special characteristics, which means we have to plan how to store and manage it to maintain the bottom line. Making the wrong choice on factors like storage and tooling can complicate data analysis and lead to increased costs.
A single IoT sensor sends, on average, a data point per second. That totals over 80,000 data points in a single day. And some sensors generate data every nanosecond, which significantly increases that daily total.
Most IoT use cases don’t just rely on a single sensor either. If you have several hundred sensors, all generating data at these rates, then we’re talking about a lot of data. You could have millions of data points in a single day to analyze, so you need to ensure that your system can handle time series workloads of this size. Otherwise, if your storage is inefficient, your queries are slow to return, and if you don’t configure your analysis and visualization tools for this type of data, then you’re in for a bad time.
In this article, I will go over six best practices to build efficient and scalable IoT analytics.
1. Start Your Storage Right
Virtually all IoT data is time series data. Therefore, consider storing your IoT data in a time series database because, as purpose-built solutions for unique time series workloads, it provides the best performance. The shape of IoT data generally contains the same four components. The first is simply the name of what you’re tracking. We can call that a measurement, and that may be temperature, pressure, device state or anything else. Next are tags. You may want to use tags to add context to your data. Think about tags like metadata for the actual values you’re collecting. The values themselves, which are typically numeric but don’t have to be, we can call fields. And the last component is a timestamp that indicates when the measurement occurred.
Knowing the shape and structure of our data makes it easier to work with when it’s in the database. So what is a time series database? It’s a database designed to store these data values (like metrics, events, logs and traces) and query them based on time. Compare this to a non-time series database, where you could query on an ID, a value type or a combination of the two. In a time series database, we query based entirely on time. As a result, you can easily see data from the past hour, the past 24 hours and any other interval for which you have data. A popular time series database is InfluxDB, which is available in both cloud and open source.
2. High-Volume Ingestion
Time series data workloads tend to be large, fast and constant. That means you need an efficient method to get your data into your database. For that we can look at a tool like Telegraf, an open source ingestion agent meant to run as a cron job to collect time series metrics. It has more than 300 plugins available for popular time series data sources, including IoT devices and more general plugins like execd, which you can use with a variety of data sources.
Depending on the database you choose to work with, other data ingest options may include client libraries, which allow you to write data using a language of your choice. For instance, Python is a common option for this type of tool. It’s important that these client libraries come from your database source so you know they can handle the ingest stream.
3. Cleaning the Data
You have three options when it comes to cleaning your data: You can clean it before you store it, after it’s in your database or inside your analytics tools. Cleaning up data before storage can be as simple as having full control over the data you send to storage and dropping data you deem unnecessary. Oftentimes, however, the data you receive is proprietary, and you do not get to choose which values you receive.
For example, my light sensor sends extra device tags that I don’t need, and occasionally, if a light source is suddenly lost, it sends strange, erroneous values, like 0. For those cases, I need to clean up my data after storing it. In a database like InfluxDB, I can easily store my raw data in one data bucket and my cleaned data in another. Then I can use the clean data bucket to feed my analytics tools. There’s no need to worry about cleaning data in the tools, where the changes wouldn’t necessarily replicate back to the database. If you wait until the data hits your analytics tools to clean it, that can use more resources and affect performance.
4. The Power of Downsampling
Cleaning and downsampling data are not the same. Downsampling is aggregating the data based on time. For example, dropping a device ID from your measurement is cleaning, while deriving the mean value for the last five minutes is downsampling. Downsampling is a powerful tool in that, like cleaning data, it can save you storage costs and make the data easier and faster to work with.
In some cases, you can downsample before storing it in its permanent database, for example, if you know that you don’t need the fine-grained data from your IoT sensors. You can also use downsampling to compare data patterns, like finding the average temperature across the hours of the day on different days or devices. The most common use for downsampling is to aggregate old data.
You monitor your IoT devices in real time, but what do you do with old data once new data arrives? Downsampling takes high-granularity data and makes it less granular by applying means, averages and other operations. This preserves the shape of your historical data so you can still do historical comparisons and anomaly detection while reducing storage space.
5. Real-Time Monitoring
When it comes to analyzing your data, you can either compare it to historical data to find anomalies, or you can set parameters. Regardless of your monitoring style, it’s important to do so in real time so that you can use the incoming data to make quick decisions and take fast action. The primary approaches for real-time monitoring include using a built-in option in your database, real-time monitoring tools or a combo of the two.
Regardless of the approach you choose, it’s critical for queries to have quick response times and minimal lag because the longer it takes for your data to reach your tools, the less real time it becomes. Telegraf offers output plugins to various real-time monitoring solutions. Telegraf is configured to work with time series data and is optimized for InfluxDB. So if you want to optimize data transport, you might want to consider that combination.
6. Historical Aggregation and Cold Storage
When your data is no longer relevant in real time, it’s common to continue to use it for historical data analysis. You might also want to store older data, whether raw or downsampled, in more efficient cold storage or a data lake. As great as a time series database is for ingesting and working with real-time data, it also needs to be a great place to store your data long term.
Some replication across locations is almost inevitable, but the more you can prevent that, the better, outside of backups, of course. In the near future, InfluxDB will offer a dedicated cold storage solution. In the meantime, you can always use Telegraf output plugins to send your data to other cold storage solutions.
When working with IoT data, it’s important to use the right tools, from storage to analytics to visualization. Selecting the tools that best fit your IoT data and workloads at the outset will make your job easier and faster in the long run.