ClusterHQ is working to build a business model around data containers — portable volumes that can act as persistent, though virtual, storage devices, shared between applications in containerized environments. The company’s persistent container system, Flocker, was one of the first to take advantage of Docker’s extensibility model.
Now the company is looking to build on that business model with the addition of a service for deploying and exchanging data volumes: a way for container environments to manage data containers not too dissimilar to how they manage code containers. Named FlockerHub, this software can store, share and organize your Docker data volumes.
“The use of fli, the FlockerHub Line Interface, is not contingent upon use of Flocker,” explained Mohit Bhatnagar, ClusterHQ’s Vice President of Product, in a discussion with The New Stack. “fli is designed so, if you have access to any Docker data volume that is sitting on a Linux operating system, we will be able to access it. So the cloud extends from the public cloud to the private cloud, actually to developers’ laptops. fli and FlockerHub do not require Flocker.”
Not the Hub for Flocker
The only way you can truly comprehend what FlockerHub is and does is to break this relationship in your mind between FlockerHub and Flocker. In Docker, a data volume is any container whose file system clearly shows it contains data. There really isn’t much more specification than just that.
The fli command line tool takes snapshots of data volumes. The data administrator defines when those snapshots are taken, and how often. Any of these snapshots may be mounted to a container as a volume. If you add Flocker to this scenario, it may be used to present such a volume as a persistent data container, and facilitate communication between active code containers and the data volume.
Periodic snapshotting of databases, as ClusterHQ Vice President of Marketing Michael Ferranti, tells us, becomes useful for testing, when auditing the evolution of both the data stores and the code that accesses them. “You might have people who want to use an earlier version of the database,” Ferranti explained. “Say they got a bug report from a user on a Tuesday at 11:32 p.m., and they want to use the data in that database, at the time of that bug report, because they have a hypothesis that the bug report is somehow related to the state. If they were to use state from the last snapshot, that would actually have changed in some meaningful way, and they would no longer be able to diagnose the bug.”
The fli tool can capture deltas of a data volume (just the changes), and the FlockerHub system knows how to catalog these changes so that they appear to the developer as whole, varying versions.
“We believe that this ability to capture the state of any Docker data volume, and push it to an on-premise VPC repo, will have broad applicability in use cases such as fraud,” Bhatnagar added. Backup and recovery for cross-cloud migrations may be another case.
Not Just ‘GitHub for Data’
Although ClusterHQ is not the first vendor to describe one of its products as “GitHub for data,” there’s one critical distinction which Bhatnagar felt compelled to point out, even at the expense of his company’s own metaphor: GitHub is very developer-centric. Meanwhile, the use cases he envisions for FlockerHub, such as fraud detection and forensic system analysis, move outside the developer’s realm and into the broader IT department.
I asked the ClusterHQ executives how an organization avoids having its versioning system spin out of control, with multiple writes and overwrites leading to an almost unintelligible tangle of potentially equally viable versions. It’s the antithesis to the “single view of the truth” that data warehouse engineers aim for.
ClusterHQ believes the answer comes in the form of access control. Its plan is for the repository to enforce fine-grained access control for all volumes, preventing individuals from posting deltas willy-nilly. Theoretically, you can’t stop any developer from using fli to produce these deltas, but the repository can stop her from posting them and thereby declaring them as valid.
“You do not want fraud environments to be written by external sources other than what fraud is intended for,” he said. “That is the beauty of the system. Take a fraud environment: At a given point in time, I can take a snapshot of it. I don’t now write to the snapshot; just like git, I can create a branch of it.”
Ferranti said this branching ability might come in handy for developers performing a schema upgrade for their databases. “A lot of times, there are multiple strategies you could take to make that kind of change,” he said. “One thing that would be useful would be to take a test volume, and with fli, make two copies of it, apply your schema upgrade to both, run tests, and see which one performs better against your test suite. That’s a completely local use case for fli.”
As a tool for disaster recovery, Bhatnagar explained, a data volume representing a desired recovery state could be kept in reserve, then restored into an active environment on-demand. This feature is being reserved for a Q1 2017 release.
Make Sure Your Halo’s On Straight
Initially, FlockerHub will be made available as a hosted service from ClusterHQ. However, Bhatnagar remarked, “there are definitely going to be instances where customers will not want data to leave their DMZ. That could mean they would want data to remain on-premise, or in their VPC [Amazon Virtual Private Cloud] environment.” To address these customers’ privacy concerns, he said, ClusterHQ will release separate versions of FlockerHub for VPC and on-premise deployment, in Q1 2017.
“We believe the container management problem is a hard problem,” ClusterHQ’s Bhatnagar admitted. “In order to solve a tough, technical problem, normally you have to do it within an agnostic ecosystem. So I’m proud and happy to support that: All the capabilities we talked about are agnostic of orchestration frameworks and of the containerizer layer. It is truly multi-cloud capable; you can deploy fli on any Linux machine, including a laptop. And we absolutely think a cloud definition needs to include developers’ laptops. We are doing this in the true spirit of open source, as well as being agnostic of different infrastructures, while ensuring a high level of integration.”
That’s lofty talk, but as we’ve seen in recent months, there are several emerging contenders for the solution to the containerized data dilemma, including Red Hat’s repurposing of Java middleware, Mesosphere’s “Container 2.0” approach, and EMC’s injection of a container client library into the scheme. We’ll see just how agnostic they all are when they clash against one another.
Docker is a sponsor of The New Stack.
Title image of the inside of an antique strongbox by Techarrow, in the public domain.