Data / Development / Sponsored / Contributed

It’s Not Real Engineering Until It’s Boring (to Outsiders)

13 May 2021 6:34am, by

Ted Dunning
Ted is chief technologist officer at MapR, part of Hewlett Packard Enterprise. He is a Ph.D. in computer science and an author of over 10 books focused on data sciences. He has over 25 patents in advanced computing and plays the mandolin and guitar, both poorly.

One of my first jobs was as an intern working on a laser fusion project at Los Alamos. It was exciting, as in giant lasers and high voltage. My job was also exciting because we really didn’t know how much those pellets of deuterium could be smashed and how much neutron flux they would produce. That’s science.

In my most recent job, we build software that allows our customers’ infrastructure teams to provide services that run for years with no user-apparent bobbles or disruption. Hardware fails. Bugs are found. However, the users get a stable platform that performs the same way today as it did yesterday. That’s engineering.

Being on either of these teams is exciting. Both involve working with top-flight thinkers solving hard problems — and a bunch of not-so-hard ones as well — with precision and finesse. Working on the science of fusion had the promise of solving global warming and the energy supply, but working on the engineering of data is actually feeding and protecting people today.

The key to making this engineering valuable is to make it rock solid and reliable, and to make it essentially invisible to those who use the data. Let’s look at how this is done.

Hiding the Exciting Parts

If you look at data systems all the way down to the bits, the complexity of the computing technology that we use every day is actually stupendous. The intellectual challenge to build all that technology is similarly astonishing and exciting.

The incredible triumph of modern computing is, however, not in that complexity or in that intellectual achievement. Instead, it is in fact that almost nobody needs to know about most of it. The triumph is that these incredibly complex systems work almost all the time without most people knowing details about what is actually happening.

That is to say, this technology is mostly amazing because it can be boringly, invisibly, incredibly reliable. Boring is good if you want to get on with your life by using technology rather than inventing it.

In computer science, the name for what makes this possible is separation of concerns. This separation applies not only between the engineering of data infrastructure and the use of it, but also to technologies that enable separation of concerns between different teams in your business. The efficient separation of concerns between developers and system administrators or between data scientists and IT teams is of huge importance to the practical use of data, especially at scale.

To help understand what I mean by separation of concerns, consider this simple situation: a system built to ingest data from many locations for central analysis. This system will have code to deal with data ingestion and code to deal with data analytics. The implementation concerns of ingestion should be isolated to the ingestion code. In other words, the analysis code shouldn’t have to change when new ingestion locations are added. Similarly, the ingestion code shouldn’t need to change when new kinds of analysis are done. To function well, these different concerns should be separated as much as possible.

Furthermore, if the data infrastructure that supports these functions has been engineered well, it should be nearly invisible to both ingestion and analysis teams. This allows each team, including the infrastructure team, to focus on their own specialized work. The exciting stuff should be in what teams are building from data, not in how difficult it is to use the data technologies that make data and computation available.

Real-World Example: Making Data Motion Invisible

A good example of this kind of separation of concerns in more complicated situations is the way that data motion happens in the background with the large-scale file system that I work on. Here’s why that matters in practical terms:

Data on one storage device needs to be on another in a variety of situations we commonly encounter. This need for data motion might be because a user asked for data to be mirrored to another cluster. Alternatively, data motion could be required to recover from a disk, server or network failure. New hardware might have been installed or the read or write operations of one workload have begun to collide with those of another.

Once you start moving data, though, especially when you are moving data for reasons most users don’t need to know about, it’s important to not let that data motion interfere with what users do expect to be happening. This isn’t at all easy since we’re often talking about a system with tens of thousands of read and write operations in flight at one time aimed at many thousands of disks on hundreds of servers.

What we’ve done to protect against interference is extend some of the ideas used to avoid network congestion to the way our data infrastructure handles data motion. The data infrastructure monitors how long each message takes to make a round trip in order to watch and automatically adjust for interference. The really cool thing about this approach is that applying these ideas to globally optimize an ensemble of data transfer processes at scale turns out to be much more effective than it is for the better-known problem of optimizing a network link. The end result is that available network and disk resources can be completely saturated with background transfers, yet operations with higher priority can cut in almost instantaneously when needed. This approach works even when transfers involve substantial latency.

This is tremendously exciting stuff from the internal engineering viewpoint, but what really shakes the ground is that this can all happen without users having to turn any knobs or make any adjustments. The technology for moving data completely recedes from view. Most users don’t even need to know these transfers are happening except in rare situations where a great mass of hardware has suddenly failed. The application-level concerns of users are completely insulated from the systems-level concerns of system maintenance and repair. As a result, system performance, as nearly as possible, is very predictable.

Keep in mind that automated control and prioritization of data transit is only one example of how data infrastructure can be designed and engineered to support separation of concerns on many levels, doing what is needed while receding from view.

Making It Real: A Practical Solution

In the data infrastructure I work on, called HPE Ezmeral Data Fabric, we expose something called a data fabric volume to allow users to manage many aspects of security, compliance and large-scale data motion at a platform level rather than an application level. For most users, almost all of the time, such volumes are indistinguishable from ordinary directories, but they provide the key handle by which data can be managed. Most aspects of management are automated inside the data fabric itself. Embedding these management aspects in the fabric allows them to fade from view, while the functional aspects of working with data that developers and data engineers need to focus on can take on full salience.

To find out more about how data fabric volumes provide separation of concerns, read “What’s Your Superpower for Data Management?” and “How to Discard Data: Solving the Hidden Challenge of Large-Scale Data Deletion.” Data fabric gives you the freedom to make data logistics the kind of boring you want, while we help keep the excitement of building the data fabric to ourselves.

Featured image via Pixabay.

A newsletter digest of the week’s most important stories & analyses.