Bridging Time Series from Edge to Cloud
Consider high-frequency trading firms; you know, the ones you might have read about in Michael Lewis’ “Flash Boys”?
They sit trading algorithms physically next to exchanges to execute trades as fast as possible, based on an evaluation the algorithms make in real time. Where does such an algorithm come from and how does it stay up to date with the very rapidly changing circumstances of the market?
Table that for a minute.
Imagine an ocean wind farm with hundreds of 75-meter, 6-megawatt wind turbines that cost millions of dollars to maintain. How do operators know when they’re having issues without constantly sending engineers into the ocean to check?
Hold on to that one too.
Lastly, consider an application infrastructure case where the critical data comes from hundreds of Kubernetes clusters. A single cluster has its own context in which monitoring dashboards can be run, alerting can be defined and actions can be taken. However, an operations center may need a consolidated overview of the entire Kubernetes fleet, or data science may need global training data for honing anomaly detection.
How does that business manage those different “edge” contexts effectively? The answer lies in the relationship between edge data and cloud data. The trading device, wind turbine and Kubernetes cluster are each “edge” assets in respect to the cloud. (I use the term “assets” here as a catchall — the term edge is dependent on context.)
Edge assets rarely deal with data pertaining to any other asset but themselves. To analyze many assets, however, you need to zoom out and consolidate contexts. Sometimes this is a factory, data center, satellite or private Kubernetes cluster. The centralized context is usually in a public cloud presence containing data from assets all over the globe, connected by a WAN (such as the internet).
Tying to the examples, the algorithm creation, turbine monitoring and central K8s monitoring happen in the zoomed-out layer that provides the required context.
Cloud data has momentum, but the need for edge data is growing regardless. Application data has immense gravity and as applications become more distributed, the edge becomes more critical. An edge data presence makes data cheaper, subject to less latency and does not require the internet to make decisions.
The need for cloud data, on the other hand, is wholly different. The cloud is central. It can see all and provide context surrounding devices that otherwise wouldn’t have it, and it can store more data for the more brainy business-level insights.
The relationship between the edge and the cloud hinges on their interdependence. Edge cannot see the forest for the trees and the cloud can only see what it is given. Leveraging their respective advantages, they can consume and analyze only the data and insights that each needs.
The data at the edge and the data moving to the cloud are time series. Given that — and for the purpose of this post — it’s important I mention some core properties of time-series data:
- It has a half-life. Time-series data points are born with all their value and as time passes, that value begins to diminish as the point tells us less about reality.
- It appends rather than replaces data and does so frequently, which makes it ever-increasing in volume.
- It is often critical and/or sensitive in nature.
Given these properties, businesses often find themselves choosing between the edge and cloud because marrying them is daunting, but it doesn’t need to be.
Edge-Cloud Duality in Action
In High-Frequency Trading
A device is often installed near an exchange and fitted — into its firmware, even — with a ruleset on how to execute trades without human intervention. This allows decisions to be made in microseconds. Make no mistake, however, the device is not the intelligent actor here. The real brainy stuff happens far away from that device, where the algorithms are born. At the edge, trades are made and the data is forwarded. The cloud is where the algorithms are then built and trained because that is where billions of prices across thousands of tickers can be stored, along with other exogenous parameters.
The time-series pipeline here necessarily requires that data be as precise or granular as possible at the edge, but since the internet can’t always support such volumes, there must also be a way to reduce it significantly on its way to the cloud.
- The problem: While these firms might have deep pockets, getting data from the exchanges to the cloud may be hindered not by budget but by bandwidth. Now you need a way to handle high volumes of data for short periods of time. You need enough data to be pushed to the cloud to be able to produce a clear analysis of what is happening on-site — all without breaking the internet. So, what can be done?
- Status quo:
- Include a custom application at the edge that reduces the data and then forwards it. This is an undertaking and, in this case, often means the code is inaccessibly part of the edge. If it’s not built in, it may be an app that developers have to host and maintain.
- Or manually export the data and load it elsewhere. This is too slow for these firms as the cloud is training the algorithms constantly, and they’re also likely running human-readable dashboards from the cloud dataset that needs to be updated in near-real-time. Remember that time series has a half-life and trading firms deal in half-lives akin to radioactive isotopes.
In Wind Farming
In this case, turbines need upkeep before they incur extremely expensive malfunctions. Time series data emitted from them alludes to future malfunctions, which gives operators the heads up to physically visit the farm. The turbine health data is no good sitting with the machine when operators are remote, but it can also be useful at the machine when operators are there. There needs to be a way to forward the meaningful data for predictive maintenance while also retaining detailed data for forensic analysis on-site.
- The problem: While this scenario shares problems with the trading scenario, it introduces a new one. The access these turbines have to the central network they report over can be extremely unreliable. This means that even if you have a way to constantly forward data from the edge to the cloud, you need a way to intelligently deal with intermittent and durable connectivity issues.
- Status quo: Turbines emit data when the network allows it. Central monitoring gets intermittent visibility and a lot is missed. Any aggregation of data is either done manually or with software that is hard to update if your aggregation strategies change.
This one is a little different. There is plenty of room for data in enterprise Kubernetes clusters, and even in some “edge K8s” clusters. Here, businesses can do full monitoring, alerting, anomaly detection and predictive analytics without much friction.
- The problem: Federating it out to a central hub is again constrained by budget and bandwidth. Monitoring each cluster by precise dimensions and providing a holistic picture and data warehousing is conceptually not too daunting, but given the half-life and volume of time series, doing it practically and flexibly is notoriously difficult.
- Status quo: Database and other monitoring applications can run at any level of these topologies. Businesses will monitor clusters with these but will choose or emphasize edge or cloud. When they do manage an edge-cloud duality, the data in each is likely not how they want it and making changes to that is rarely feasible.
In Other Settings
Applications for edge-cloud duality extend across energy, manufacturing, aerospace and other high-tech industries. Further examples:
- The Google Photos app uses a constantly trained model to categorize photos it deems to be alike. The data used to train that model comes from everyone’s phones, not just yours — and it comes from them all the time.
- Manufacturing plants have operators on the floor who watch expensive machinery very closely. On the other hand, there are often analysts or other engineers at headquarters who want to understand a global view of all factories without breaking the bank or the internet.
The Bottom Line
It’s important to design a data topology that appropriately leverages the edge and the cloud in such a way that both have the data they need to make sense — and that neither has data it doesn’t need or can’t get.
And while there are countless ways the edge and cloud can leverage each other’s strengths to deliver real value to businesses, the trick to this is designing an edge-cloud topology that is practical and effective at the same time. Striking that balance requires technology that accommodates the properties of both the edge and cloud in a way that’s easy to assemble and maintain. Stay tuned for more from InfluxData on this.