8 Real-Time Data Best Practices
More than 300 million terabytes of data are created every day. The next step in unlocking the value from all that data we’re storing is being able to act on it almost instantaneously.
Real-time data analytics is the pathway to the speed and agility that you need to respond to change. Real-time data also amplifies the challenges of batched data, just continuously, at the terabyte scale.
Then, when you’re making changes or upgrades to real-time environments, “you’re changing the tires on the car, while you’re going down the road,” Ramos Mays, CTO of Semantic-AI, told The New Stack.
How do you know if your organization is ready for that deluge of data and insights? Do you have the power, infrastructure, scalability and standardization to make it happen? Are all of the stakeholders at the planning board? Do you even need real time for all use cases and all datasets?
Before you go all-in on real time, there are a lot of data best practices to evaluate and put in place before committing to that significant cost.
1. Know When to Use Real-Time Data
Just because you can collect real-time data, doesn’t mean you always need it. Your first step should be thinking about your specific needs and what sort of data you’ll require to monitor your business activity and make decisions.
Some use cases, like supply chain logistics, rely on real-time data for real-time reactions, while others simply demand a much slower information velocity and only need analysis on historical data.
Most real-time data best practices come down to understanding your use cases up front because, Mays said, “maintaining a real-time infrastructure and organization has costs that come alongside it. You only need it if you have to react in real time.
“I can have a real-time ingestion of traffic patterns every 15 seconds, but, if the system that’s reading those traffic patterns for me only reads it once a day, as a snapshot of only the latest value, then I don’t need real-time 15-second polling.”
Nor, he added, should he need to support the infrastructure to maintain it.
Most companies, like users of Semantic-AI, an enterprise intelligence platform, need a mix of historical and real-time data; Mays’ company, for instance, is selective about when it does and doesn’t opt for using information collected in real time.
He advises bringing together your stakeholders at the start of your machine learning journey and ask: Do we actually need real-time data or is near-real-time streaming enough? What’s our plan to react to that data?
Often, you just need to react if there’s a change, so you would batch most of your data, and then go for real time only for critical changes.
“With supply chain, you only need real-time time if you have to respond in real time,” Mays said. “I don’t need real-time weather if I’m just going to do a historic risk score, [but] if I am going to alert there’s a hurricane through the flight path of your next shipment [and] it’s going to be delayed for 48 hours, you’re reacting in real time.”
2. Keep Data as Lightweight as Possible
Next you need to determine which categories of data actually add value by being in real time in order to keep your components lightweight.
“If I’m tracking planes, I don’t want my live data tracking system to have the flight history and when the tires were last changed,” Mays said. “I want as few bits of information as possible in real time. And then I get the rest of the embellishing information by other calls into the system.”
Real-time data must be designed differently than batch, he said. Start thinking about where it will be presented in the end, and then, he recommended, tailor your data and streams to be as close to your display format as possible. This helps determine how the team will respond to changes.
For example, if you’ve got customers and orders, one customer can have multiple orders. “I want to carry just the amount of information in my real-time stream that I need to display to the users,” he said, such as the customer I.D. and order details. Even then, you will likely only show the last few orders in live storage, and then allow customers to search and pull from the archives.
For risk scoring, a ground transportation algorithm needs real-time earthquake information, while the aviation algorithm needs real-time wind speed — it’s rare that they would both need both.
Whenever possible, Mays added, only record deltas — changes in your real-time data. If your algorithm is training on stock prices, but those only change every 18 seconds, you don’t need it set for every quarter second. There’s no need to store those 72 data points across networks when you could only send one message when the value changes. This in turn reduces your organizational resource requirements and focuses again on the actionable.
3. Unclog Your Pipes
Your data can be stored in the RAM of your computer, on disk or in the network pipe. Reading and writing everything to the disk is the slowest. So, Mays recommended, if you’re dealing in real-time systems, stay in memory as much as you can.
“You should design your systems, if at all possible, to only need the amount of data to do its thing so that it can fit in memory,” he said, so your real-time memory isn’t held up in writing and reading to the disk.
“Computer information systems are like plumbing,” Mays said. “Still very mechanical.”
Think of the amount of data as water. The size of pipes determines how much water you can send through. One stream of water may need to split into five places. Your pipes are your network cables or, when inside the machine, the I/O bus that moves the data from RAM memory to the hard disk. The networks are the water company mainlines, while the bus inside acts like the connection between the mainlines and the different rooms.
Most of the time, this plumbing just sits there, waiting to be used. You don’t really think about it until you are filling up your bathtub (RAM.) If you have a hot water heater (or a hard drive), it’ll heat up right away; if it’s coming from your water main (disk or networking cable), it takes time to heat up. Either way, when you have finished using the water (data) in your bathtub (RAM) it drains and is gone when you’re done with it.
You must have telemetry and monitoring, Mays said, extending the metaphor, because “we also have to do plumbing while the water is flowing a lot of times. And if you have real-time systems and real-time consumers, you have to be able to divert those streams or store them and let it back up or to divert it around a different way,” while fixing it, in order to meet your service-level agreement.
4. Look for Outliers
As senior adviser for the Office of Management, Strategy and Solutions at the U.S. Department of State, Landon Van Dyke oversees the Internet of Things network for the whole department — including, but not limited to, all sensor data, smart metering, air monitoring, and vehicle telematics across offices, embassies and consulates, and residences. Across all resources and monitors, his team exclusively deals in high-frequency, real-time data, maintaining two copies of everything.
He takes a contrary perspective to Mays and shared it with The New Stack. With all data in real time, Van Dyke’s team is able to spot crucial outliers more often, and faster.
“You can probably save a little bit of money if you look at your utility bill at the end of the month,” Van Dyke said, explaining why his team took on its all-real-time strategy to uncover better patterns at a higher frequency. “But it does not give you the fidelity of what was operating at three in the afternoon on a Wednesday, the third week of the month.”
The specificity of energy consumption patterns are necessary to really make a marked difference, he argued. Van Dyke’s team uses that fidelity to identify when things aren’t working or something can be changed or optimized, like when a diplomat is supposed to be away but the energy usage shows that someone has entered their residence without authorization.
“Real-time data provides you an opportunity for an additional security wrapper around facilities, properties and people, because you understand a little bit more about what is normal operations and what is not normal,” he said. “Not normal is usually what gets people’s attention.”
5. Find Your Baseline
“When people see real-time data, they get excited. They’re like, ‘Hey, I could do so much if I understood this was happening!’ So you end up with a lot of use cases upfront,” Van Dyke observed. “But, most of the time, people aren’t thinking on the backend. Well, what are you doing to ensure that use case is fulfilled?”
Without proper planning upfront, he said, teams are prone to just slap on sensors that produce data every few seconds, connecting them to the internet and to a server somewhere, which starts ingesting the data.
“It can get overwhelming to your system real fast,” Van Dyke said. “If you don’t have somebody manning this data 24/7, the value of having it there is diminished.”
It’s not a great use of anyone’s time to pay people to stare at a screen 24 hours a day, so you need to set up alerts. But, in order to do that, you need to identify what an outlier is.
That’s why, he said, you need to first understand your data and set up your baseline, which could take up to six months, or even longer, when data points begin to have more impact on each other, like within building automation systems. You have to manage people’s expectations of the value of machine learning early on.
Once you’ve identified your baseline, you can set the outliers and alerts, and go from there.
6. Move to a Real-Time Ready Database
Still, at the beginning of this machine learning journey, Van Dyke says, most machines aren’t set up to handle that massive quantity of data. Real-time data easily overwhelms memory.
“Once you get your backend analysis going, it ends up going through a series of models,” Van Dyke said. “Most of the time, you’ll bring the data in. It needs to get cleaned up. It needs to go through a transformation. It needs to be run through an algorithm for cluster analysis or regression models. And it’s gotta do it on the fly, in real time.”
As you move from batch processing to real-time data, he continued, you quickly realize your existing system is not able to accomplish the same activities at a two- to five-second cadence. This inevitably leads to more delays, as your team has to migrate to a faster backend system that’s set up to work in real time.
This is why the department moved over to Kinetica’s real-time analytics database, which, Van Dyke said, has the speed built in to handle running these backend analyses on a series of models, ingesting and cleaning up data, and providing analytics. “Whereas a lot of the other systems out there, they’re just not built for that,” he added. “And they can be easily overwhelmed with real-time data.”
7. Use Standardization to Work with Non-Tech Colleagues
What’s needed now won’t necessarily be in demand in the next five years, Van Dyke predicted.
“For right now, where the industry is, if you really want to do some hardcore analytics, you’re still going to want people that know the coding, and they’re still going to want to have a platform where they can do coding on,” he said.
“And Kinetica can do that.”
He sees a lot of graphical user interfaces cropping up and predicts ownership and understanding of analytics will soon shift to becoming a more cross-functional collaboration. For instance, the subject matter expert (SME) for building analytics may now be the facilities manager, not someone trained in how to code. For now, these knowledge gaps are closed by a lot of handholding between data scientists and SMEs.
Standardization is essential among all stakeholders. Since everything real time is done at a greater scale, you need to know what your format, indexing and keys are well in advance of going down that rabbit hole.
This standardization is no simple feat in an organization as distributed as the U.S. State Department. However, its solution can be mimicked in most organizations — finance teams are most likely to already have a cross-organizational footprint in place. State controls the master dataset, indexes and meta data for naming conventions and domestic agencies, standardizing it across the government, which it based on the Treasury’s codes.
Van Dyke’s team ensured via logical standardization that “no other federal agency should be able to have its own unique code on U.S. embassies and consulates.”
8. Back-up in Real Time
As previously mentioned, the State Department also splits its data into two streams — one for model building and one for archival back-up. This still isn’t common practice in most real-time data-driven organizations, Van Dyke said, but it follows the control versus variable rule of the scientific method.
“You can always go back to that raw data and run the same algorithms that your real-time one is doing — for provenance,” he said. “I can recreate any outcome that my modeling has done, because I have the archived data on the side.” The State Department also uses the archived data for forensics, like finding patterns of motion around the building and then flagging deviations.
Yes, this potentially doubles the cost, but data storage is relatively inexpensive these days, he said.
The department also standardizes ways to reduce metadata repetition. For example, if a team wants to capture the speed of a fan in a building, but the metadata for that would include the fan’s make and model and the firmware for the fan controller. However, Van Dyke’s team exponentially reduces repetitive data in a table column by leveraging JSON to create nested arrays, which allows the team to decrease the amount of data by associating one note of firmware with all the speed logs.
It’s not just for real time, but in general, Van Dyke said: “You have to know your naming conventions, know your data, know who your stakeholders are, across silos. Make sure you have all the right people in the room from the beginning.”
Data is a socio-technical game, he noted. “The people that have produced the data are always protective of it. Mostly because they don’t want the data to be misinterpreted. They don’t want it to be misused. And sometimes they don’t want people to realize how many holes are in the data or how incomplete the data [is]. Either way, people have become very protective of their data. And you need to have them at the table at the very beginning.”
In the end, real-data best practices rely on collaboration across stakeholders and a whole lot of planning upfront.