Reducing Complexity with a Multimodel Database
“Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation).”
With these words, E.F. Codd (known as “Ted” to his friends) began the seminal paper that begat the “relational wave” that would spend the next 50 years dominating the database landscape.
“Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.”
When Codd wrote this paper back in 1969, data access was in its infancy: Programmers wrote code that accessed flat files or tables and followed “pointers” from a row in one file to a row in a separate file. By introducing a “model” of data that encapsulated the underlying implementation (of how data was stored and retrieved) and putting a domain-specific language (in this case, SQL) in front of that model, programmers found their interaction with the database elevated away from the physical details of the data, and instead were free to think more along the logical levels of their problem, code and application.
Whether Codd knew this or not, he was tapping into a concept known today as a “complexity budget:” the idea that developers — any organization, really — can only keep track of so much complexity within their projects or organization. When a project reaches the limits of that budget, the system starts to grow too difficult to manage and all sorts of inefficiencies and problems arise — difficulties in upgrading, tracking down bugs, adding new features, refactoring code, the works. Codd’s point, really, was simple: If too much complexity is spent navigating the data “by hand,” there is less available to manage code that captures the complexities of the domain.
Fifty years later, we find ourselves still in the same scenario — needing to think more along logical and conceptual lines rather than the physical details of data. Our projects wrestle with vastly more complex domains than ever before. And while Codd’s model of relational data has served us well for over a half-century, it’s important to understand that the problem, in many respects, is still there — different in detail than the one that Codd sought to solve, but fundamentally the same issue.
Models in Nature
In Codd’s day, data was limited in scope and nature, most of it business transactions of one form or another. Parts had suppliers; manufacturers had locations; customers had orders. Creating a system of relationships between all of these was a large part of the work required by developers.
Fifty years later, however, data has changed. Not only has the amount of data stored by a business exploded by orders of magnitude (many orders of magnitude), but the shape of the data generated is wildly more irregular than it was in Codd’s day. Or, perhaps fairer to say, we capture more data than we did 50 years ago, and that data comes in all different shapes and sizes: images, audio and video, to start, but also geolocation information, genealogy data, biometrics, and that’s just a start. And developers are expected to be able to weave all of it together into a coherent fabric and present it to end users in a meaningful way. And — oh, by the way — the big launch is next month.
For its time, Codd’s relational model provided developers with exactly that — a way to weave data together into a coherent fabric. But with the growth of and changes to the data with which we have to contend, new tactics, ones which didn’t throw away the relational model but added upon it, were necessary.
We wrought what we could using the concept of “polyglot persistence,” the idea of bringing disparate parts together into a distributed system. But as any experienced architect will be all too familiar, the more different and distinct nodes in a distributed system, the greater the complexity. And the more complexity we must spend on manually stitching together data from different nodes in the database system, the less we have to spend on the complexity of the domain.
Nature of Storage
But complexity doesn’t live just in the shape of the data we weave; it also lives in the very places we store it.
What Codd hadn’t considered, largely because it was 50 years too early, is that databases also carry with them a physical concern that has to do with the actual physical realm — the servers, the disks on which the data is stored, the network and more. For decades, an organization “owning” a database has meant a non-trivial investment into all the details around what that ownership means, including the employment of a number of people whose sole purpose is the care and feeding of those machines. These “database administrators” were responsible for machine procurement and maintenance, software upgrades and patches, backups and restorations and more — all before ever touching the relational schema itself.
Like the “physical” details of data access 50 years ago, devoting time to the physical details of the database’s existence is also a costly endeavor. Between the money and time spent doing the actual maintenance as well as the opportunity cost of it being offline and unavailable for use, keeping a non-trivial database up and running is a cost that can often grow quite sizable and requires deep amounts of ongoing training and learning for those involved.
By this point, it should be apparent that developers need to aggressively look for ways to reduce accidental and wasteful spending of complexity. We seek this in so many ways; the programming languages we use look for ways to promote encapsulation of algorithms and data, for example, and libraries and services tuck away functionality behind APIs.
Providing a well-encapsulated data strategy in the modern era often means two things: the use of a multimodel database to bring together the differing shapes of data into a single model, and the use of a cloud database provider to significantly reduce the time spent managing the database’s operational needs. Which one you choose is obviously the subject of a different conversation — just make sure it’s one that supports all the models your data needs, in an environment that requires the smallest management necessary.
Multimodel brings all the benefits of polyglot persistence, without the disadvantages of it. Essentially, it does this by supporting a document store (JSON documents), a key/value store and other data storage models (multiple databases) into one database engine that has a common query language and a single API for further access. Learn more about Couchbase’s multimodel database here, and try Couchbase for yourself today with our free trial.