Modal Title
Data / Data Science

Is This the End of Data Refactoring?

For “extract/transform/load,” are you transforming your application’s data for the purposes of your applications or the requirements of your database?
Nov 18th, 2022 7:52am by
Featued image for: Is This the End of Data Refactoring?
Image via Pixabay.

In one of my early coding projects, I remember going through requirements meetings, hashing out a data model, drawing up a database design and submitting it to a DBA team for review and approval. There were numerous back-and-forth communications on naming, data types and structure conventions. Weeks later, the tables were created in the development environment so that I could ingest test data, and build and test the code against it.

When requirements change, or when stakeholders’ understanding of the data model evolve, or test data iterations produce different results, or scope-creep get into the project, we would start the data model —> database design —> DBA review/approval —> development creation process all over again.

A project like this isn’t a one-off. Typically, a lot of project hours are burned during each iteration of development, production, as well as maintenance and enhancement. By reducing or eliminating the translation step between a data model and database design, you can dramatically improve time to market and maintenance costs down the road. That’s something I wish I could’ve done before.

What Is Data Refactoring?

The roots of data refactoring likely point back to code refactoring applied to computer programs. As Martin Fowler defines refactoring in his book by that title, it’s “a disciplined technique for restructuring an existing body of code, altering its internal structure without changing its external behavior.”

But how does refactoring apply to data and databases?

Code refactoring is often done after an initial draft, or when improvements or features need to be made to the code. Any of these changes can affect a project before or after its initial implementation, in either case making code cleaner, more efficient and more maintainable. Some refactorings may also occur because of changes to the data structure supporting the code. Such changes often take place due to a deeper understanding of the data model, as well as additions and subtractions to the collected data.

Refactoring takes place in both the data and data model as well. While code refactoring can happen without data model refactoring, code is nearly always affected by changes to the underlying data model. Data refactoring becomes the source of refactors for the model, storage system and code. Each piece needs careful planning and thorough understanding, which adds considerable project resources.

We see positive benefits from refactoring through better data and code. So isn’t the time spent well worth it? What’s the trouble here?

Looking at the data refactoring component of a project, there are several tasks that accompany it, including:

  • Making changes to the data model to fit the new case, feature, etc.
  • Aligning the data model with the storage technology’s format, such as relational, graph or other. Each type of data storage format comes with its own set of rules, so we often end up making further changes to the data to accommodate the required structure.

This “translation” step between data model and data storage steps is what can be cumbersome, especially when it’s accompanied by change approvals from project managers for properly aligning data structures between the real-world configuration of data and the structured format of the database. Graph databases can shorten or cut the translation phase because they more naturally model data as it exists in the real world.

Data Refactoring for Databases

In a previous article, we used data sets for a coffee shop with sales receipts, products and customers. We can use this same data set to look at data refactoring, but in this instance, focus on a different section of the data: stores and their staff assignments. All of the data, change scripts and more in these examples may be downloaded from this GitHub repository.

Our example data set is in Tables 1 and 2 below. Shop locations include data about the location type, size, address details, phone number and assigned manager. The staff table contains names, positions, start date and assigned shop location for each associate.

Table 1: The shop location table

Table 2: The staff table

A graph representation of this data could look something like Figure 1 below.

Figure 1: Graph representation of shop locations and staff

In this graph model, we have two main entities (nodes): Shop and Staff. The relationships between these nodes tell us how they are connected. Either a staff member is assigned to work at a location or a shop is managed by a particular staff member.

Next, let’s look at how a few specific refactorings would affect each of these models.

Refactoring 1: Adding a New Column

Storing additional information is a common change among database projects. In our coffee shop case, we might want to also track the open date for a shop location (how long a location has been in operation). For a relational format, the change process would involve adding a new column to the table. This likely also entails:

  • Explaining the change to stakeholders.
  • Writing and executing a data definition language (DDL) statement to alter the table structure, or dropping the entire table and rebuilding it with the new column in the DDL.
  • Adding the new data to the population set.
  • Finally, importing the actual data to the table.

Between certain of these tasks, additional change approval steps may be required. This change does not affect the staff table, so no changes would be required there. The result of this refactoring would look like Table 3.

Table 3: Refactor No. 1 for shop location table

Meanwhile, for a graph format, we would need to add a new property to the Shop node. Similar to our relational process above, we would need to explain the change to stakeholders and get any necessary change approvals. However, dropping the data structure, making the addition and setting the new data structure are all eliminated with graph methodology because there is no strict DDL. With graph, the structure is not forced on the data. Rather, the data itself determines the structure and can be adapted when the information represented by the data shifts, as depicted in Figure 2.

Figure 2: Refactor No. 1 for graph

Example scripts for both relational and graph processes are included in our code repository on GitHub.

Refactoring 2: Adding a New Table and Relationship

Next, suppose we have issues with staffing coverage at our locations. We want to retain employee addresses to help us determine who might be able to cover a shift at another location.

While we could store employee addresses directly in the Staff table, addresses are more likely to change than other data, and we might want to keep staff’s personal details separate from their business information. We can create a separate table to store addresses, which means creating a foreign key relationship between the staff row and related address with a new column, as well as statements for the new table structure and data insert.

Table 4: Adding foreign key column address_id to Staff table

Table 5: New staff_address table

For graph, we would need to add the data for a new StaffAddress node and relationship to Staff nodes, as shown in Figure 3. Existing data would not be affected, so we would not need to alter Staff entities in the database.

Figure 3. Refactor No. 2 for graph

Refactoring 3: Adding Data to Existing Tables

For our third and final refactoring, business is booming, and we might want to add newly hired staff members to a new shop location.

With the relational structure, creating a new staff member in the table means adding a shop location assignment to make the row complete. We would likely need some sort of dependency rule (constraint) that ensures we could not insert a value in the Staff table’s location column if it does not exist in the Shop_Location table. If we want to add the new staff member’s address to the Staff_Address table, we’d need to set up the same guardrails in that table, as well. This means any new staff member assigned to a new location requires us to first create the location, and then the staff member, then their address. Doing these steps in the wrong order would result in errors.

Table 6: Shop_Location table with new row

Table 7: Staff table with 2 new staff

Table 8: Staff_Address table with 2 new addresses

For our graph version in Figure 4, we simply need to add the new data. Structure remains the same, and existing data is not affected.

Figure 4. Refactor No. 3 for graph

Graphs Reduce Data Refactoring

You’ve seen how data refactoring affects both relational and graph databases. Relational databases require a more intensive process for making changes because of the separation of table structure and actual data. By contrast, graphs remove the extra translation step between real-world data and database structure because they more naturally model data as it exists in the real world.

This example might have seemed so small-scale as to appear trivial compared to the real world. Yet what happens when you have dozens of shops, thousands of staff members and hundreds of thousands of delivery addresses? Business-critical systems discourage teams from making changes due to the labor involved and potential impacts to their organizations’ workflow. Graphs keep the data and model we already have and alter only what has changed.

Graph refactoring allows businesses to be adaptable and agile, giving them the power to morph as the industry or data around them changes. Moving existing projects to graphs can reduce time to market right now, while improving future maintainability, risk mitigation and additional feature development.

Where to Begin

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.