The data landscape is replete with vendors, industry pundits, and even journalists, espousing the virtues of decoupling storage and compute. After all, the decoupled approach makes it easier to scale compute and storage independently, and it suits both the cloud’s business model and its object storage architectures.
But here’s the part that often is not said out loud: decoupled storage is typically lousy for query performance. If you want queries — especially operational ones — to run fast, positioning data near computations, or compute where the data lives, is usually the better way.
That’s exactly what Ocient has done with the latest release of its eponymous hyperscale data warehouse. It’s architected so that its compute and storage are as co-located as possible, which the company says yields immense performance gains.
This isn’t to say that coupled compute and storage is categorically superior to the decoupled approach. As is often the case, the preferred approach is decided by the particular application. But, according to Ocient CEO Chris Gladwin, when those applications involve data at an enormous scale, the answer to this quandary is never truly in doubt.
“If you’re storing a year or petabytes of data, and your queries — your business or mission requirements — mean you have to analyze at hyperscale, you’re going to look at a trillion things in order to respond to that query,” Gladwin said. “If you’ve got a compute-separate tier, it’s not going to help you.”
For low-latency responses at this sort of scale, Gladwin says tightly pairing compute and storage produces the best performance. Modern data analysis platforms make a point of relying on what Gladwin termed a “compute-adjacent storage architecture” to deliver these performance gains.
A Question of Scale
Granted, there are several deployments in which decoupling compute from storage is not only viable but also the best option. Typically, such applications are less time-sensitive, involve more modest data amounts, or are part of workloads serviced by mainstream cloud data warehouses, which may cache the data in compute-adjacent storage tiers in any case.
Expounding on this further, Gladwin commented “That’s a great architecture if you have a use case where you have an elastic or intermittent compute load, where it comes and goes. For example, you’re running financials at the end of the week or the month, or if the amount of data you’re analyzing is small relative to the total amount you’re storing.” However, hyperscale applications like digital advertising auctions — which Gladwin estimated occur 10 million times per second — are availed by pairing storage with compute.
The adtech use case involves companies bidding on digital ad placements almost every time users click in browsers. Successful bidding requires analyzing browsing history at immense scale in real-time, with time requirements that are too short for retrieving data from a separate storage layer. It’s better when “the CPU core that’s going to be analyzing data is not separated by a network to the storage; it is in the box,” Gladwin indicated.
Coupling compute and storage reduces latency for this and other use cases, like telemetry data analysis for vehicle fleets. Compared to the decoupled compute and storage approach, these deployments benefit from a warehousing architecture in which “there are four PCI lanes that connect each CPU core to each NVMe drive, [that] can get data off the drive with 1/15th of the latency,” Gladwin revealed.
Machine Learning Models
Such temporal benefits are substantial for the petabytes of data necessary for real-time analysis in hyperscale applications. They also significantly impact training, and retraining, machine learning models with similar rapidity. The intra-database machine learning of premier data analysis platforms delivers considerable performance benefits for updating model inputs to respond to events as they occur. With conventional warehousing approaches, “If you want to do machine learning on a petabyte of data, or 10 petabytes of data… just checking that much data out of your storage library can take an hour,” Gladwin said.
Ocient’s hyperscale data warehouse, with in-database machine learning, beats this approach by keeping all compute in process, the company says. “Instead of waiting to download a petabyte of data, we just run it right there,” Gladwin remarked. “Everything that’s there for the database, you utilize the same way [for machine learning].”
The real-world benefits of this approach are beyond dispute. Telcos, for example, typically run policy models that determine the rules for the next day about how to route traffic. Without this approach, “it takes them hours to run that because it’s a giant machine learning model,” Gladwin mentioned. Top systems coupling compute and storage perform these jobs in seconds. Doing so is invaluable for rerouting cell tower traffic based on late-breaking events involving weather, safety, sports competitions, and more.
A Victorious Combination
Again, common cloud wisdom dictates that decoupled compute and storage is advantageous in a broad array of use cases. Nonetheless, as the above hyperscale, real-time analytics use cases across industries denote, pairing compute and storage often turns out to be critical to performance. Shrewd organizations will avail themselves of these capabilities when their applications mandate them.
The New Stack is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Real.