Data / Data Science / Technology / Sponsored / Contributed

GraphQL and NoSQL Databases: The Type Mismatch

27 May 2022 7:51am, by and

Bobbie Cochrane
Bobbie is an experienced senior research scientist with a demonstrated history of working in the information technology and services industry. She’s a strong research professional skilled in blockchain, scalability, IBM DB2, cloud, computer science and enterprise software.

Typeless systems, such as NoSQL Databases, JSON, JavaScript and Python, have gained popularity and are very useful in practice because the world’s data does not conform nicely to a consistent, rigid structure.

Even if it did, it is almost impossible to capture all facets of the data and predict future use of this data.

However, the data is also not completely schemaless, and any application using the data will enforce a well-defined schema.

NoSQL opponents complain that this popularity with NoSQL reverts application development back to the hierarchical database days of the ’60s, ignoring the reality that the data in these typeless systems do, in fact, have consistent structure and that their schema is well-designed and evolves orderly with application development processes.

With the advent of GraphQL, we can introduce a type system over these systems without taking away from the flexibility they introduced, bringing order back into what appears to be disorder.

Let’s walk through this using one of our favorite backend databases, MongoDB.

Suppose we have a GraphQL schema for data that represents a customer:

The above says that a customer has five attributes and each must be of type as shown. The presence of “!” in GraphQL indicates the field is non-nullable, akin to NOT NULL in SQL.

Now assume that your MongoDB collection contains the following three JSON-style documents:

As you can see, the data has a fairly consistent structure with a few exceptions. The obvious ones are that each document has differentiating characteristics. John Doe has children, Jane Doe has hobbies, and John Smith has offices. The “joined” property appears to be common, but on closer inspection, we see that one of the documents is missing this field. And finally, we notice that if we just considered Mr. and Mrs. Doe, we might think age could be represented as an integer, but Mr. Smith’s age is clearly not an integer.

Herman Camarena
Herman is a principal software engineer at StepZen and has more than 25 years of experience in tech. Herman was a data scientist and senior software engineer at Google. Before that, he served as CTO at both Spribo and QoS Labs and began his journey in tech in senior roles at Apple, WebMethods and BEA Systems.

How would this data fit our GraphQL model? There are obvious issues:

  1. The field joined cannot be set to be ! since in the last document, it is missing. Easy fix, right? Convert it into joined “Date” as opposed to joined “Date!” But what if, out of the million documents, this is the only record that does not seem to have any value for joined? With this increased perspective, removing the ! now seems like a bad idea as this anomaly could represent bad data.
  2. The documents seem to have different extra fields. This is a pretty common pattern even when data is in strictly typed systems like relational database management systems, which is why BLOBS (Binary Large Objects), XML and JSON were introduced. With GraphQL, we could do something like:

    ensuring that all three records can be mapped. Note that the fields children, hobbies and office do not have ! after them, indicating they are optional. (The String! descriptor in the list means that an item in the list cannot be null). However, this seems suboptimal too; after all, we may get hundreds of such fields over millions of records and that leads to excessive bloat and application logic. We’ll show you a better way below.
  3. Our GraphQL schema assumes type INT for age, which works for Mr. and Mrs. Doe but is not compatible with Mr. Smith’s age, which would be better handled with type FLOAT, unless it is typecast to an INT.

These are just a few simple examples that illustrate these marriage issues between strongly typed (GraphQL) and weakly typed (in this case JSON).

However, our team at StepZen has found some excellent solutions using GraphQL for these and other issues.

  1. A GraphQL type can be JSON. That means that while the data within that JSON is opaque to GraphQL, any data that is shaped like JSON can be assigned to that field, preserving the type flexibility of the underlying typeless system. For example, if there was a field:

    then all records could be passed through the GraphQL layer.
  2. A GraphQL type can be a UNION type. So we could declare:

    and this keeps the representation compact. Furthermore, the records are still strongly typed, except the field extras, which can be one or more of the three types. Additionally, we can use a fragment with a type condition to capture design specific to the Extra type:

In the above, the GraphQL query is explicitly saying, “If an element of extras is of type hobbies, then return the level, etc.” This allows the GraphQL API to do the right thing to return the data in the right shape, removing the burden from the application.

In all of the above, the developer of the GraphQL API is making some explicit choices about what types she wants that approximately match the types of the backends. What if we had the reverse problem? What if we were to inspect the backend and automatically derive the GraphQL types from the underlying data? For such a system:

  1. Sample enough records to see the diversity. MongoDB has a find where you can retrieve multiple records. REST APIs might have pagination models. SQL systems have limits and offsets. Either way, a dozen or so records is a good sample.
  2. Make some sensible choices. As mentioned above, the data in these systems are not completely schemaless; they contain some light schema. Find the largest common sub-structures and declare the uncommon ones as UNION or JSON.
  3. Pick the right types and when in doubt, drop a !

If you check out JSON2SDL, you will find one such implementation built by the company we work for, StepZen. In fact, we have built a set of introspectors that make these kinds of sensible choices, including going against a wide variety of REST, SQL and NoSQL backends. And we are currently releasing to the public our REST2GraphQL introspector.

The problem of type mismatches is an important issue to address for GraphQL to ultimately realize its potential. However, as a language, it has many features that make the mismatch problem less severe. And tools such as StepZen’s JSON2SDL make the job more and more automatic. The paradigm mismatch between a NoSQL database such as MongoDB and GraphQL can be easily bridged. And that is true for other backends like REST and SQL too. All signs point to GraphQL becoming the default, strongly typed API layer for accessing all sorts of backends.

Image by mac231 from Pixabay