Hey Service, Where Is Your Data?
When you build a large, multiservice application, deciding where you put the data is just as important as determining how you architect the application itself.
An essential but often overlooked aspect of architecting service and microservice-based architectures is deciding where your application data resides. Does the data reside with the service? Is it shared with other services? Is it in a shared central database?
Whether you are building a new application or migrating an existing application to a service-based architecture, it is critically important to be mindful of where you store data — and the rest of the application state — within your application or system.
Not all services make use of stored data. Many services do not use any stored data — they do not maintain state information. All the data they require to perform their work is passed into the service when the service is called, or the data is referenced from another source. The service itself does not maintain state.
Take, for example, a service that performs a simple mathematical calculation. In this case, a service that takes a pair of Latitude/Longitude coordinates and determines the distance between those two points. A call to this service might look like this:
find-distance?(start:(48.590870, -122.937424),end:(37.333041, -121.932043)) -> (miles)
This service takes two sets of coordinates and converts them into a distance. This service performs calculations, but other than the data passed into the service (the coordinates), no additional data is required. The service does not need to maintain any state information.
Stateless services offer a huge advantage for scaling. Because they are stateless, it is usually easy to add additional server capacity to a service to scale it to a larger capacity, both vertically and horizontally. You get maximum flexibility in how and when you can scale your service if your service does not maintain state.
Additionally, specific caching techniques on the frontend of the service become possible if the cache does not need to concern itself with service state. This caching lets you handle higher scaling requirements with fewer resources.
Not all services can be made stateless, obviously, but it is a considerable advantage for scalability for those services that can be stateless.
A stateful service is a service that requires data (application state) retained during the life of the application, and multiple requests to the service use the data.
Take, for example, a service that tracks the location of delivery vehicles in a fleet. A call to such a service could be a call that tells the service where a specific vehicle is located, such as:
set-vehicle-location(vehicle-id: 133928, location: (48.590870, -122.937424))
Then, a specific vehicle can be located by requesting the location of it:
get-vehicle-location(vehicle-id: 133928) -> (lat,long)
Or, find the vehicle that’s closest to a given location:
locate-a-nearby-vehicle(location: (37.483577 , -122.225983 )) -> (vehicle-id)
This service performs a useful function, but to implement these commands, the service would have to maintain data — a list of vehicles and their current location. This data is stored in a database and used by the service to perform its operations.
This is a stateful service.
Stateful services are harder to scale because it is not simply a matter of adding CPU power to make the service grow to handle more requests. You also have to consider where the data is stored and how you scale the database that is holding the data. This complicates the ability to scale a service.
Where to Store Data
When you are building services that require data, it might seem obvious to store data in as few services and systems as possible — making as many services as possible stateless services. This might lead you to keep all data together in a centralized location. In theory, keeping the data close together reduces the number of services that store data.
Nothing could be farther from the truth.
Instead, it’s important to localize your data as much as possible when building a service-based architecture. Have services and data stores manage only the data they need to manage to perform their jobs. In the above example, store the data that specifies where the vehicles are located in the vehicle location service.
This tends to spread out your application data across a larger number of services, putting the data closer to the services that require the data.
Localizing data this way provides a few benefits:
- Reduce the size of individual datasets. Because your data is split across datasets spit across multiple services, each dataset is individually smaller in size. Smaller dataset size means reduced interaction with the data, making the scalability of the database easier. This is called functional partitioning. You are splitting your data based on functional lines rather than on the size of the dataset.
- Localized access. When you access data in a database or data store, you often access all the data within a given record or set of records. Often, much of that data is not needed for a given interaction. By using your data in multiple datasets that are smaller in size, you reduce the amount of unneeded/unused data from each of your queries.
- Optimized access methods. By splitting your data into different datasets, you can optimize the type of data store appropriate for each dataset. Does a particular dataset need a relational data store? Or is a simple key/value data store acceptable? Keeping your data associated with the services that consume the data will create a more scalable solution, and easier to manage architecture, and allow your data requirements to expand more easily as your application grows.
Architecting Your Data with Your Services
Architecting a large, highly scalable web application is a complex task. Sometimes, you have to make decisions that seem wrong but ultimately improve your application scalability — and hence the availability — of your application or service.
Determining your data architecture is one of those tasks. When architecting the structure of your application and the services that make up the application, it is essential to consider the data needs and requirements of those services.
Scaling your data storage and access is hard, and your data architecture can dramatically affect your data scalability. Even if you use a highly scalable database, such as AWS DynamoDB or Cockroach Labs’ CockroachDB, you need to be mindful of your data architecture to meet your scaling needs.
A free two-chapter excerpt of O’Reilly’s Architecting for Scale is available for download.