TNS
VOXPOP
Will JavaScript type annotations kill TypeScript?
The creators of Svelte and Turbo 8 both dropped TS recently saying that "it's not worth it".
Yes: If JavaScript gets type annotations then there's no reason for TypeScript to exist.
0%
No: TypeScript remains the best language for structuring large enterprise applications.
0%
TBD: The existing user base and its corpensource owner means that TypeScript isn’t likely to reach EOL without a putting up a fight.
0%
I hope they both die. I mean, if you really need strong types in the browser then you could leverage WASM and use a real programming language.
0%
I don’t know and I don’t care.
0%
Data / DevOps / Storage

Best Practices: Collect and Query Data from Multiple Sources

Tips to unlock the true value of data and gain actionable insights to drive business growth and innovation.
Aug 3rd, 2023 8:24am by
Featued image for: Best Practices: Collect and Query Data from Multiple Sources
Image from NicoElNino on Shutterstock.

In today’s data-driven world, the ability to collect and query data from multiple sources has become vital. With the rise of the Internet of Things (IoT), cloud computing and distributed systems, organizations face the challenge of handling diverse data streams effectively. It’s common to have multiple databases/data storage options. For many large companies, the days of storing everything in a singular database are in the past.

It is crucial to implement best practices for efficient data collection and querying to maximize your data stores’ potential. This includes optimizing data ingestion pipelines, designing appropriate schema structures and using advanced querying techniques. On top of this, you need data stores that are flexible when querying data back out and are compatible with other data stores.

By adhering to these best practices, organizations can unlock the true value of their data and gain actionable insights to drive business growth and innovation. One such option is InfluxDB, a powerful time series database, that provides a robust solution for managing and analyzing time-stamped data, allowing organizations to make informed decisions based on real-time insights.

Understanding Different Data Sources

When it comes to data collection, it is crucial to explore different data sources and understand their unique characteristics. This involves identifying the types of data available, their formats and the potential challenges associated with each source. After identifying the data sources, selecting the appropriate data ingestion methods becomes essential. This involves using APIs, Telegraf plugins or implementing batch writes, depending on the specific requirements and constraints of the data sources.

Data space and speed are essential considerations, especially with IoT data. Ensuring data integrity and consistency throughout the collection process is of utmost importance. So, too, is having backup plans for data loss, stream corruption and storage at the edge. This involves implementing robust mechanisms to handle errors, to handle duplicate or missing data, and to validate the accuracy of the collected data. Proper data tagging and organization strategies also are essential to efficient data management and retrieval. By tagging data with relevant metadata and organizing it in a structured manner, it becomes easier to search, filter and analyze effectively.

It’s helpful to note here that most data storage solutions come with their own recommendations for how to begin collecting data into the system. For InfluxDB, we always suggest Telegraf, our open source data-ingestion agent. Or for language-specific needs, we suggest our client libraries written in Go, Java, Python, C# and JavaScript. The important takeaway here is to go with recommended and well-documented tools. While it might be tempting to use a tool you are already familiar with, if it’s not recommended, you might be missing out on the mechanisms for handling problems.

Effective Data Modeling

Effective data modeling is a crucial aspect of building robust and scalable data systems. It involves understanding the structure and relationships of data entities and designing schemas that facilitate efficient data storage, retrieval and analysis. A well-designed data model provides clarity, consistency and integrity to the data, ensuring its accuracy and reliability. The most important piece when dealing with multiple data sources is determining your “connector,” or your data piece that connects your data.

For example, let’s look at a generator that has two separate data sets: one in a SQL database storing the unit stats and one in the InfluxDB database that has real-time data about the battery capacity. You might need to identify a faulty generator and its owner based on these two data sets. It might seem like common sense that you would have some kind of shared ID between these two data sets, but when you are first modeling your data, the concern is less about being able to combine data sets and more about the main data use case and removing unnecessary data. Also the other question is: How unique is your connector and how easy will it be to store? For this example, the real-time battery storage might not have easy access to a serial number. That might need to be a hard-coded value added to all data collected from the generator.

Furthermore, as data evolves over time and variations occur, it becomes essential to employ strategies to handle these changes effectively. This may involve techniques such as versioning, migration scripts or implementing dynamic schema designs to accommodate new data attributes or modify existing ones.

For example, if our generator adds new data sets, it’s important that we add our original connector to that new data. But what if you are dealing with an existing data set? Then it gets trickier. You might have to go back and retroactively implement your connector. In this example, you might require people to manually enter their serial number in the app where they register their generator and view battery information. This allows you to tag them as the owner, and you can run analysis on their device from a distance to determine if it’s operating within normal range.

Obviously, this is a very simple example, but many companies and industries use this concept. The idea of data living in a vacuum is starting to disappear as many stakeholders expect to access multiple data sources and have an easy way to combine the data sets. So let’s dive into how to combine data sets once you have them. Let’s continue from our previous example with InfluxDB and a SQL database, a common use case for combining data.

When it comes to querying your data, and especially when it comes to combining data sets, there are a couple of recommended tools to accomplish this task. First is SQL, which is broadly used to query many data sources, including InfluxDB. And when it comes to data manipulation and analysis, a second tool, Pandas, is useful for flexible and efficient data processing. Pandas is a Python library that is agnostic to the data it accepts, as long as it’s within a Pandas DataFrame. Many data sources document how to convert their data streams into a Pandas DataFrame because it is such a popular tool.

The following code is an example of a SQL query in InfluxDB, which returns the average battery level over the past week for this specific device (via serial number):

This query would happen on the app side. When a user logs in and registers their generator’s serial number, that enables you to store the data with a serial number tag to use for filtering. For the readability of this query, it’s easier to imagine all generator data goes into one large database. In reality, it’s more likely that each serial number would be in a unique data storage, especially if you want to offer customers the chance to store their data longer for a fee, which is a common offer for some businesses and use cases, like residential solar panels.

An app developer would likely write several such queries to cover averages for the day and week, and to account for battery usage, battery levels and most recent values, etc. Ultimately, they hope to end up with between 10 and 20 values that they can show to the end user. You can find a list of these functions for InfluxDB here.

Once they have these values, they can combine all the data points with their SQL database that houses customer data, things like name, address, etc. They can use the InfluxDB Python client library to combine their two data sets in Pandas.

This is an example of what that join would look like in the end. When it comes to joining, Pandas has a few options. In this example, I’m using an inner join because I don’t want to lose any of the data from my two data sets. You would probably need to rename some columns, but overall, this query results in a combined data frame that you can then convert as needed for use.

You can imagine how data scientists might use these tools to run anomaly detection on the data sets to identify faulty devices and alert customers to the degradation and needed repairs. If there is a charge for storing data, users can also combine this data with a financial data set to confirm which customers pay for extended storage time and possibly receive extra information. Even in this simple example, there are many stakeholders, and at scale the number of people who need to access and use multiple data sets only expands.

Key Takeaways

With so much data in the world, the notion of storing everything in a single database or data store may seem tempting. (To be clear, you may want to store all of the same type of data, such as time series data, in a single database.) While it can be a viable solution at a small scale, the reality is that both small- and large-scale companies can benefit from the cost savings, efficiency improvements and enhanced user experiences that arise from using multiple data sources.

As industries evolve, engineers must adapt and become proficient in working with multiple data stores, and the ability to seamlessly collect and query data from diverse sources becomes increasingly important. Embracing this approach enables organizations to leverage the full potential of their data and empowers engineers to navigate the ever-expanding landscape of data management with ease.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Pragma.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.