Why Data Management Lives or Dies by the Fight for Namespace Control
Red Hat sponsored this post.
Many organizations want to maintain complete control over the way they store, process and use data — which is why controlling namespaces is so important.
A namespace is a file, folder or object designed for storage. Controlling their own namespaces — rather than relinquishing control to a vendor — gives organizations an easier way to access, store, process and use their data.
In this post, we look at what enterprises get from controlling their own namespaces and the solutions they can deploy to meet their needs.
Flexibility to Decouple or Couple Compute and Storage Resources as Necessary
There’s an argument to be made for abstracting data management from storage resources. However, there’s an equally compelling argument to be made for keeping data and storage tightly coupled in certain situations. The answer: as with most things in life, the best option is moderation.
Think back to the early generations of Hadoop, when compute and storage were tightly integrated. This was great from an overall performance standpoint, but it created economic challenges. When the data grew, so did the need to expand compute resources. That could get expensive and result in waste and inefficiency if companies purchased more than they actually needed in their quest to achieve redundancy for stateful data storage.
The converse is to go completely the other way and decouple everything. But this approach can become unwieldy quickly, creating a nightmare scenario for data managers and developers. There’s also value in keeping data in proximity to compute resources; the closer the data is to those resources, the more likely an organization is to see better performance with its data services.
For most organizations, a moderate approach is the best policy. Some services can be decoupled — for example, those that may not need immediate access to data, or may only require access every so often — whereas others should be kept closely aligned with compute resources (think data services at the edge). Organizations want to be able to choose what’s best for each use case because not every workload is the same.
A Scalable, Easy-to-Manage and Cost-Effective Storage Infrastructure
Data comes in many forms, and exists in many places. Some of it’s structured, some of it’s unstructured. Some of it exists in a data lake, some of it resides in databases. The challenge is to bring all of this data together, simplifying access to the data and lowering the total cost of ownership without adding additional layers to storage architectures or sacrificing security and performance.
Object storage can deliver on these promises. Object storage breaks down data into objects. These objects are kept in a flat namespace as opposed to within a traditional file hierarchy, and tightly couple both data and its corresponding metadata, thus creating a flat namespace that’s almost infinitely scalable. When an enterprise owns that vast namespace, it can gain immediate access to whatever data they need, whenever they need it.
To prove the scalability point, The Evaluator Group ran a test, commissioned by Red Hat, to see if Red Hat Ceph Storage could scale to 10 billion objects without loss of performance. The project passed the test, setting the stage for future tests of hundreds of billions of objects, or even more. Numbers that would have seemed unimaginable a couple of years ago are now a reality. This makes object stores ideal for large enterprises with tens of thousands of files.
The Ability to Perform Real-Time Data Processing and Analysis at the Edge
Another argument against complete abstracted data management is that the approach often requires the addition of more services, which could lead to a more bloated architecture. This approach also undermines the simplicity that an object store provides. It’s also the opposite of what companies engaged in real-time data analysis at the edge want.
Size is a major factor. An edge server has to be about the size of a pizza box and about as light. It can’t be a heavy, monolithic rack of equipment. Adding more services creates a bigger footprint. That’s no good at the edge, where companies are looking to go much smaller.
Decoupling data and compute also flies in the face of why companies are moving to the edge to begin with. For data services at the edge to be effective, data must be processed as close to the source as possible, reducing latency and bandwidth concerns.
Of course, there are times when this isn’t possible. Deeper learning from the data, for example, may require the power of a central datacenter. In contrast to traditional batch-oriented data processing, which can cause data gravity issues, automated data pipelines can process data at the edge in real-time as data is produced. If further processing of the data is required, it can then be sent to a core location for additional processing and machine learning.
Open Standards to Put Control Where It Belongs
Again, these are different use cases requiring different approaches. In the absence of a “one size fits all” solution, it makes sense for organizations to control their data and how they use and access it.
But proprietary vendors are loath to let this happen. They want to control their customers’ namespaces. By gaining control of the namespace, they’re also effectively gaining control of their customers. Those customers are now at the mercy of whatever approach the vendor chooses, whether that’s decoupling compute from storage, the occasional price hike or something possibly worse.
Relying on open standards can be a great way to solve these challenges. Open standards take namespace control away from the vendors and put it in the hands of enterprises. They also keep customers from getting locked into a single vendor, which gives the customer greater freedom and flexibility.
Just like it should be.
Feature image via Pixabay.