Will real-time data processing replace batch processing?
At Confluent's user conference, Kafka co-creator Jay Kreps argued that stream processing would eventually supplant traditional methods of batch processing altogether.
Absolutely: Businesses operate in real-time and are looking to move their IT systems to real-time capabilities.
Eventually: Enterprises will adopt technology slowly, so batch processing will be around for several more years.
No way: Stream processing is a niche, and there will always be cases where batch processing is the only option.
Data / Software Development

7 Data Modeling Best Practices

By planning how you are going to organize your data, you can improve performance, reduce errors and make analysis easier for everyone.
Jun 22nd, 2023 8:25am by
Featued image for: 7 Data Modeling Best Practices

While you can often get away without much planning on smaller projects, when building complex software, planning becomes essential for success. And probably the most important thing you want to get right is the data model for your application.

The best-selling book “Designing Data-Intensive Applications” gives a solid rationale for why data models are so critical: “Data models have such a profound effect not only on how the software is written but also on how we think about the problem that we are solving.”

This quote gets at the root of why we even make software in the first place: to solve problems. And doing that properly requires communication between engineering and business people who know the users and problems that are being solved. From there, software engineers can take that knowledge and transform it into a data model for the software that will help solve the problem.

Why Is Data Modeling Important?

Proper data modeling takes time, so a big question might be why it’s important and worth all the effort? Let’s look at some of the direct and indirect benefits of data modeling.

The most obvious benefits of data modeling is that by planning how you are going to organize your data, you have the ability to improve performance, reduce errors and reduce the chances of designing things wrong and having to rewrite code. All of these things combined means you should save time and money in the long run.

Some secondary benefits of data modeling are that the end result should make data analysis easier for everybody in your organization, which will allow for better insights that will help your business. Another benefit is that you can choose the best type of database for your workload, which will make life easier for you in the future because you won’t have to worry about scaling or performance issues.

Data Modeling Best Practices

1. Keep Business Objectives in Mind

The most important thing when data modeling is to remember how it relates to the overall business. As a software engineer it can be easy to forget the bigger picture, so make sure you are working with business analysts and any other relevant stakeholders to make sure the data model reflects reality.

2. Properly Document Your Data Model

Having the best data model in the world is pointless if nobody knows how to use it. Having solid documentation in place will allow others to know what data is available in the system and make it easier to use that data to generate valuable insights. Proper documentation of your data model can also help to onboard new team members faster and make maintaining or modifying the data model easier in the future because you have an up-to-date single source of truth for the current state of the data model via your documentation.

3. Design Your Data Model to Be Adjustable Over Time

You should make sure to design your data model so that it can be updated and modified with minimal difficulty. The odds of nailing things on the first attempt are slim, and even if you do, there are going to be changes in the future that require new types of data to be added over time. So taking this into consideration from the start will save you pain in the future.

4. Have Different Levels of Abstraction for Different People

The standard process for data modeling breaks down into three types: conceptual, logical and physical data modeling. These tiers all have different levels of abstraction that can be helpful for allowing even nontechnical users to understand the data model. They also allow for technology decisions to be isolated from the higher-level conceptual data models, where that complexity isn’t needed. Here’s a bit more detail on the three types of data modeling:

  • Conceptual: Objects are defined in relation to the business itself and how they interact, or any relevant constraints and business rules.
  • Logical: More detail is added to the objects that were determined at the conceptual stage of data modeling. Logical data models usually include what specific data types and attributes each object should have.
  • Physical: This model defines the actual technical details for storing the data, including database-specific implementation details. Using a relational database as an example, the physical model would include the SQL to create tables and define data types for each column in the tables representing the objects defined during the conceptual design stage.

5. Choose the Right Data Modeling Technique

There are many different data modeling techniques. Which one you choose should be based on the type of data you are working with and type of workload. The simplest example would be whether you are creating a data model for an end-user application or for data analysts who will be doing mostly analytics queries.

For the first situation you might want to go with relational data modeling, a common choice. For the second, it might make more sense to use dimensional modeling. There are trade-offs to every decision, the key is to figure out which data model provides the most benefits with downsides that are mitigated by the nature of your use case.

6. Think about Data Governance and Security

In recent years data privacy and security became major topics, with lots of regulation put in place as well. As part of your data modeling process, it makes sense to take the time to also consider how data will be protected, who will have access to it and your overall data governance strategy as a whole.

7. Avoid Premature Optimization

While this advice might seem to go against a lot of what was written above, the idea here is not to overthink things and slow down your development process. Diminishing returns are a thing; what’s important is getting some of the major decisions right and then iterating. Another thing to keep in mind is not to over-engineer your system. Keep things as simple as possible and worry about scaling your system when it becomes an issue.

Data Modeling Tutorial

Now let’s look at how data modeling can be done with a practical example. To make it interesting, we will use time series data because that adds another piece of complexity that needs to be taken into account when designing the data model. For this example, we will use a smart home that collects temperature data to automatically turn the A/C on and off to be more energy efficient.

First let’s define the entities and relationships at the conceptual level. The entities would be the following:

  • Home
  • Room
  • Sensor
  • Temperature
  • Energy consumption

The basic relationships between these entities would be the following:

  • A home contains multiple rooms.
  • Each room contains one or more sensors.
  • A sensor generates multiple temperature readings.
  • Energy consumption is associated with each home.

Now the extra dimension that comes with this being time series data is how to define the relationship between the entities with data that change over time. In this case it would be temperature and energy consumption. Because the room temperature and energy consumption are obviously going to be related, this is another type of relationship that could be defined in the data model.

With time series data, you may want to break down your entities into static entities that don’t change (the name of a room) and time-varying entities that do change (temperature). You then need to think about how to model the different types of relationships that can result from this:

  • Relationships between two static entities (room and home)
  • Relationships between time-varying entities and static entities (temperature and room)
  • Relationships between two time-varying entities (temperature and energy consumption)

How your data is related and what data your end users will be querying has a significant impact on how you will index your data at the physical level of data modeling, so considering this early is critical for a successful project. If you don’t get your indexes right because you don’t understand the data access patterns of your users, performance will suffer and users won’t be happy.

For more details on data modeling, check out these articles:

Next Steps

Data modeling is a valuable tool for ensuring you get the most value possible out of your data. Following the best practices covered in this article should set you on the right path for getting started with data modeling on your own projects.

One of the most valuable aspects of data modeling is understanding how your data will be accessed. Once you know that, you can make the right choice for the physical model in terms of which database makes sense for storing your data to optimize for cost and performance. For example, if you are working with time series data like application metrics, IoT sensor data or real-time analytics, a specialized time series database like InfluxDB might make sense. If you are working with highly connected data, a graph database would be ideal. For situations where you need guarantees on data integrity, a relational database with ACID transaction support would be a good choice.

The key takeaway here is that to make educated decisions you need to first put in the work to understand the data generated by your business or application, and data modeling is a crucial part of that process.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Pragma.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.