Do We Need Data Normalization Anymore?
A database professor from college taught my class to memorize E. F. Codd’s three normal forms of large, shared data as “the key, the whole key and nothing but the key.” Folks with many different roles in information technology — including developers, database administrators, domain modelers, business stakeholders — routinely come into contact with data normalization. For them, normalization is as normal as breathing.
Yet as the database landscape becomes more diverse and hardware becomes more powerful, we might wonder whether the practice of data normalization is necessary anymore. Should we be fretting over optimizing data storage and querying so that we return the minimum amount of data? If so, do certain data structures make it more vital to solve those problems than others?
What Is Data Normalization?
The process of data normalization optimizes data structures in a relational database to ensure data integrity and query efficiency. It reduces redundancy and improves accuracy by way of a prescribed series of steps to normalize the structure, producing what’s called the “normal forms.” At its core, data normalization helps avoid data anomalies that occur when creating new data (Insert), updating existing data (Update), or deleting data (Delete).
Such anomalies cause challenges for keeping data values in sync (integrity). The normalization steps entail verifying foreign keys (links to related data), separating unrelated entities to other tables and inspecting the rows and columns as a unified data object. While the full list of normalization steps is rather rigorous, we will focus on those most commonly applied in business practice:
- First normal form (1NF)
- Second normal form (2NF)
- Third normal form (3NF)
I recently came across a coffee shop data set that seems to be a good fit for us. Table 1 features data that was from the repository. I have also created a GitHub repository using this sample data, which pertains to sales made to customers for food and beverage items.
What makes this data problematic? If we try to insert a new row into this data, we could inadvertently produce a duplicate customer. Worse yet, to avoid doing that, we would have to gather all the information on customer, products and receipt date/time. If we needed to update or delete the products purchased on the receipt, we would need to sort through each product column in the list to find the value.
First Normal Form — The Key
For the first step in our “key, the whole key and nothing but the key,” a table should have a primary key (a single column or set of columns) that ensures each row is unique. Each column in a row should also contain only a single value – in other words, no nested tables. In a written ledger, some of the columns in Table 1 could be considered business keys or natural keys. But they wouldn’t be primary keys, because none of them is guaranteed to be 1) unique, and 2) present.
Our example data set needs some work to get it to first normal form (1NF). While we can get unique rows with a combination of date/time or maybe date/time/customer, it is often much simpler to reference rows with a generated unique value of some sort. Let’s do that by adding a transaction_id field to our receipt table.
There are also several rows that have more than one item ordered (transaction_id 156 and 199), so a few columns have line items with more than one value. We can correct this by separating the rows with multiple values into separate rows. This results in the 1NF data depicted in Table 2.
With 1NF data, a composite (multicolumn) key uniquely identifies a row by the combination of transaction_id and line_item_id, as a single receipt cannot contain multiple line items numbered No. 1. Table 3 depicts a part of Table 2, boiled down to those primary key values.
Each combination of those two values is unique. We have applied first normal form to our data, but we still have some potential data anomalies. If we wanted to add a new receipt, we might need to create multiple lines, depending on how many line items it contained and duplicate transaction_id, date, time, and other information on each row. Updates and deletes cause similar problems because we would need to ensure we get all the rows affected for data to be consistent.
Now, if we were to use a graph database from the outset, the relationships produced to make our table 1NF would look like Figure 1.
Here, you see three separate objects depicting the customer, the receipt and the product. Line-item information is stored in the relationship between each element of receipt and each associated product.
2nd Normal Form — The Whole Key
Second normal form ensures that each non-key column is fully dependent on the whole key. This is more of a concern for tables with more than one column as the primary key, like our coffee receipt table.
For 2NF, we need to evaluate each non-key field from Table 1 to see if we have any partial dependencies — in other words, whether the non-key column depends on only part of the key, rather than the whole key. Since transaction_id and line_item_id make up our primary key, let’s start with the transaction_date field. The transaction date does depend on the transaction ID, as the same transaction ID could not be used again on another day. However, the transaction date doesn’t depend on the line item ID at all. Line items can be reused across transactions, days and even customers.
OK, so we already found that Table 1 does not follow the second normal form, but what about the customer column? The customer is not dependent on both the transaction ID and line-item ID. If someone gave us a transaction ID, we would know which customer made the purchase, but if we were given a line-item ID, we wouldn’t know which single customer that receipt belonged to. After all, multiple customers could have ordered one, two, six items on their receipts. The customer field is linked to transaction_id (assuming multiple customers cannot split receipts), but the customer is not dependent upon the line item. We need to fix these partial dependencies.
The most direct solution is to create a separate table for order line items, leaving the columns that are only dependent on transaction_id in the receipt table. The updated data in second normal form looks like Tables 4 and 5 below.
Now let’s test that our change fixed the issue and follows second normal form. For our receipt data in Table 4, transaction_id becomes the sole primary key. Transaction date is unique based on the transaction_id, as is transaction_time. This way, there can only be one date and time for a transaction ID.
Orders cannot be placed both in store and outside it, so the value of whether a purchase was made in store is dependent upon transaction_id. Since customers cannot split a receipt, each transaction would also give us a unique customer. Finally, if someone gave us a transaction ID, we could identify a single customer loyalty number that is attached to it.
Next is receipt line Item, in Table 5. Line items are dependent upon the transaction (receipt) with which they are associated, so we retained the transaction ID on our line-item table. The combination of transaction_id and line_item_id becomes our composite key on the line-item table. Product_id and product are determined based on the transaction and line item together. A single transaction ID wouldn’t tell us which product (if the receipt contains multiple products purchased), and a single line item ID wouldn’t tell us which purchase was being referenced (different receipts could order the same products). This means the product_id and product values are dependent on the whole key.
We can also associate a quantity with the transaction_id and line_item_id. Quantities could be the same across receipts or line-item IDs, but the combination of both keys gives us a single value for quantity. We also cannot uniquely identify our unit_price or promo_item_yn column values without both the transaction_id and line_item_id fields together.
3rd Normal Form — And Nothing but the Key
Although Tables 4 and 5 have satisfied second normal form, some data anomalies still exist. If we tried to create a new product for purchase or a new customer, we couldn’t create them in our current tables because we may not have receipts tied to them yet. If we needed to update a product or customer (for instance, in the event of a typo or name change), we would need to update all the line-item rows with those values. If we wanted to delete a product or customer, we couldn’t unless we removed receipts or line items that referenced them.
To solve these problems, we need to move to third normal form (3NF), which ensures non-key fields are dependent on nothing but the key. In other words, they are not dependent on other non-key fields, causing a transitive dependency.
On our receipt table (Table 4), we need to check the non-key fields (everything except transaction_id) to see if the values depend on other non-key fields. The values for transaction_date, transaction_time, and in_store_yn do not change either based on each other, or the customer or loyalty number associated, so they are properly dependent upon nothing but the key.
But what about customer info? The value of loyalty_num could change if the customer changed. For instance, if we needed to delete or update the customer who made the purchase, the loyalty number would also need to be deleted or updated along with it. So, loyalty_num is dependent on customer, which is a non-key field. This means our receipt table is not in 3NF.
What about our receipt line-item table (Table 5)? Quantity, unit price and promo item values don’t vary based on the values of each other, nor on the product information because the three fields state the value of an item at the time of purchase. However, product is dependent on product_id, whose value is subject to change based on which product ID was referenced. So this table also needs some updates to comply with 3NF.
Again, the best method to solve these issues is to pull the related columns to separate tables (presented here as Tables 6, 7, 8, and 9) and leave a reference ID (foreign key) to link the original tables with the new ones. We eliminate data anomalies on insert, update and delete, as well as reduce data redundancy and improve efficiency for storage and querying.
Normalization Outside Relational Databases
It’s the same data we started out with, but in its 3NF form, it looks drastically different. Though the 3NF version will maintain better table integrity across refactoring operations over time, we have burned precious project hours to make ordinary, common-sense data operable in a relational environment.
In everyday life, a single receipt might have sections devoted to transaction info, customer details and purchased products. Back when we started this project, we might have drawn up a whiteboard version of such a receipt that would have mapped all these sections into the data schema. Yet we’ve just seen how data that an ordinary person might initially record in a single ledger becomes disassembled into four entities, each with keys representing dependencies between them.
In our earlier graph model based on the 1NF data in Table 2, we designed these same entities quite naturally, without any painstaking process steps. In the graph context, we eliminated the keys that represent dependencies, along with much of the data integrity concern, because relationships are a built-in component of the graph data model. We also cut primary key worries because we used business keys instead of primary keys generated solely for the purpose of ensuring unique rows.
This data normalization process example was for a small coffee shop. Imagine the enormous complications and effort involved for data related to guiding a self-driving automobile down a highway or planning response to a national/global crisis. Surely, we couldn’t entrust such a critical decision to a data model with time-intensive processes such as normalization. However, both legacy projects and new projects rely on relational database structures every day.
As we continue to interact with various data systems, we will encounter the need for some data normalization. Graph is not, and should not be, the only data solution. However, we can certainly qualify data normalization as a liability of the relational model.
Where From Here
- The 3 Underrated Strengths of a Native Graph Database
- Free, self-paced course on graph data modeling fundamentals from Neo4j GraphAcademy
- Register now for Neo4j NODES 2022 Online Developer Education Summit Nov. 16