Why the Document Model Is More Cost-Efficient Than RDBMS
A relational database management system (RDBMS) is great at answering random questions. In fact, that is why it was invented. A normalized data model represents the lowest common denominator for data. It is agnostic to all access patterns and optimized for none.
The mission of the IBM System R team, creators of arguably the first RDBMS, was to enable users to query their data without having to write complex code requiring detailed knowledge of how their data is physically stored. Edgar Codd, inventor of the RDBMS, made this claim in the opening line of his famous document, “A Relational Model of Data for Large Shared Data Banks”:
“Future users of large data banks must be protected from having to know how the data is organized in the machine.”
The need to support online analytical processing (OLAP) workloads drove this reasoning. Users sometimes need to ask new questions or run complex reports on their data. Before the RDBMS existed, this required software engineering skills and a significant time investment to write the code required to query data stored in a legacy hierarchical management system (HMS). RDBMS increased the velocity of information availability, promising accelerated growth and reduced time to market for new solutions.
The cost of this data flexibility, however, was significant. Critics of the RDBMS quickly pointed out that the time complexity, or the time required to query a normalized data model was very high compared to HMS. As such, it was probably unsuitable for the high-velocity online transaction processing (OLTP) workloads that consume 90% of IT infrastructure. Codd himself recognized the tradeoffs. The time complexity of normalization is also referred to in his paper on the subject:
“If the strong redundancies in the named set are directly reflected in strong redundancies in the stored set (or if other strong redundancies are introduced into the stored set), then, generally speaking, extra storage space and update time are consumed with a potential drop in query time for some queries and in load on the central processing units.”
This would probably have killed the RDBMS before the concept went beyond prototype if not for Moore’s law. As processor efficiency increased, the perceived cost of the RDBMS decreased. Running OLTP workloads on top of normalized data eventually became feasible from a total cost of ownership (TCO) perspective, and from 1980 to 1985, RDBMS platforms were crowned as the preferred solution for most new enterprise workloads.
As it turns out, Moore’s law is actually a financial equation rather than a physical law. As long as the market will bear the cost of doubling transistor density every two years, it remains valid.
Unfortunately for RDBMS technology, that ceased to be the case around 2013 when the cost of moving to 5 nanometers fab for server CPUs proved to be a near-insurmountable barrier to demand. The mobile market adopted 5nm technology to use as a loss leader, recouping the cost through years of subscription services associated with the mobile device.
However, there was no subscription revenue driver in the server processing space. As a result, manufacturers have been unable to ramp up 5nm CPU production and per-core server CPU performance has been flattening for almost a decade.
Last February, AMD announced that it is decreasing 5nm wafer inventory indefinitely going forward in response to weak demand for server CPUs due to high cost. The reality is that server CPU efficiency might not see another order-of-magnitude improvement without a generational technology shift, which could take years to bring to market.
All this is happening while storage cost continues to plummet. Normalized data models used by RDBMS solutions rely on cheap CPU cycles to enable efficient solutions. NoSQL solutions rely on efficient data models to minimize the amount of CPU required to execute common queries. Oftentimes this is accomplished by denormalizing the data, essentially trading CPU for storage. NoSQL solutions become more and more attractive as CPU efficiency flattens while storage costs continue to fall.
The gap between RDBMS and NoSQL has been widening for almost a decade. Fortune 10 companies like Amazon have run the numbers and gone all-in with a NoSQL-first development strategy for all mission-critical services.
A common objection from customers before they try a NoSQL database like MongoDB Atlas is that their developers already know how to use RDBMS, so it is easy for them to “stay the course.” Believe me when I say that nothing is easier than storing your data the way your application actually uses it.
A proper document data model mirrors the objects that the application uses. It stores data using the same data structures already defined in the application code using containers that mimic the way the data is actually processed. There is no abstraction between the physical storage or increased time complexity to the query. The result is less CPU time spent processing the queries that matter.
One might say this sounds a bit like hard-coding data structures into storage like the HMS systems of yesteryear. So what about those OLAP queries that RDBMS was designed to support?
MongoDB has always invested in APIs that allow users to run the ad hoc queries required by common enterprise workloads. The recent addition of an SQL-92 compatible API means that Atlas users can now run the business reports they need using the same tooling they have always used when connecting to MongoDB Atlas, just like any other RDBMS platform via ODBC (Open Database Connectivity).
Complex SQL queries are expensive. Running them at high velocity means hooking up a firehose to the capex budget. NoSQL databases avoid this problem by optimizing the data model for the high velocity queries. These are the ones that matter. The impact of this design choice is felt when running OLAP queries that will always be less efficient when executed on denormalized data.
The good news is nobody really cares if the daily report used to take 5 seconds to run, but now it takes 10. It only runs once a day. Similarly the data analyst or support engineer running an ad hoc query to answer a question will never notice if they get a result in 10 milliseconds vs. 100ms. The fact is OLAP query performance almost never matters, we just need to be able to get answers.
MongoDB leverages the document data model and the Atlas Developer Data Platform to provide high OLTP performance while also supporting the vast majority of OLAP workloads.