How to Deal with Service Failures, So Your Customer Never Notices
New Relic sponsored this post.
It’s simple, really — services call other services and they take actions based on the responses from those services. Sometimes, that action is a success, sometimes it’s a failure. But whether it is a success or a failure depends on if the interaction meets certain requirements. In particular, the response must be predictable, understandable and reasonable for the given situation. This is important so that the service reading the response can make appropriate decisions and not propagate garbage results. When a service gets a response it does not understand, it can take actions based on the garbage response and those actions can have dangerous side effects to your service and your application.
A Failing Service
Let’s take a look at an example. Let’s assume we have an application that requires customers to maintain an account to use the application. This is, of course, the case with the majority of web based applications. We have a series of services that control and maintain these customer’s accounts.
Now let’s take a look at a subset of those services in more detail. Let’s assume we have the following set of services:
- ExpiredAccountDeterminer. The Expired Account Determiner Service. The purpose of this service is to return a list of expired accounts. This might be because the customer has abandoned their account, or stopped paying, or whatever. It has a single method, “GetListOfExpiredAccounts,” that returns a list of expired account IDs that need to be deleted.
- AccountDeletion. The Account Deletion Service. The purpose of this service is to take a given account ID and delete the corresponding account. It has a single method, “DeleteAccount,” which deletes the specified account ID.
- ExpiredAccountManager. The Expired Account Manager Service. The purpose of this service is to get the list of expired accounts, then delete each of these expired accounts one by one.
The ExpiredAccountManager service calls the ExpiredAccountDeterminer service to get a list of accounts to delete. The manager service then takes this list and calls the AccountDeletion service, one by one, for each account ready to be deleted. This is illustrated in Figure 1. In this case, the determiner returns three accounts and the manager dutifully deletes those three accounts. All is well.
But what happens if the ExpiredAccountDeterminer service fails and returns garbage instead? What happens to the entire application then? Well, the answer depends on how the ExpiredAccountManager service responds to that condition. It’s certainly possible that the garbage answer generates an error in the manager service. But it is equally likely that the manager service interprets the garbage incorrectly as a set of account IDs. This garbage result list is essentially a large list of random accounts.
So, the ExpiredAccountManager service dutifully goes off and tells the AccountDeletion service to delete all specified accounts, one by one, until they are all gone. The fact that they are valid accounts that should not be deleted is beside the point, they are deleted anyway.
Definitely not a reasonable response for this situation. This situation is illustrated in Figure 2. Here, the GetListOfExpiredAccounts method returns a garbage response and the ExpiredAccountManager service goes off and calls theDeleteAccount method, one by one, for each account ID it recognizes. The AccountDeletion service deletes them all indiscriminately.
This, obviously, is a bad situation and it’s a situation that can and should be avoided by applying proper handling of service failures in all of your services.
Handling Service Failures
How do you handle service failures? When a service you depend on fails, how should you respond? The solution starts with setting expectations on what a response should be and how those responses should be interpreted by the calling service.
The response and your interpretation of the response must be:
● Reasonable for the situation
Let’s look at each of these three in turn.
Having a predictable response is an important aspect of services to be able to depend on other services. If a response is received that was not predicted, the receiving service won’t have a framework for deciding what to do with the response. Without this framework, miscommunications can lead to mistakes and mistakes can lead to serious errors.
If a service’s downstream dependencies fail, it still has a responsibility to respond in a predictable manner. This predictable response could very reasonably be to generate an appropriate error message. But it would not be reasonable to generate a garbage and unreadable response.
For example, if a service is asked to perform the operation “42 + 39,” the response is expected to be the number “81”. But if the service was asked to perform the operation “35 / 0,” then a predictable response would be “not a number” or “error, invalid request.” An unpredictable response would be if the service returned the result “392838383” one time and “192838388329” a different time.
In Figure 2, it was not predictable for the ExpiredAccountDeterminer service to generate a garbage set of data, even when the service is in a failure mode. It should predictably generate an error message if it can’t do what is expected to be done. If it can’t return an error code, it is better to return nothing. But under no conditions is it acceptable for it to return random, unpredictable or nonsensical results.
Understandable means that dependent services have an agreed-upon format and structure for responses between services. This constitutes a contract between dependent services. A service’s response must fit within the bounds of that contract, even if it has misbehaving dependencies. It is never acceptable for it to violate the API contract with its consumers just because a dependency violated its API contract. Instead, make sure the contracted interfaces provide enough support to cover all contingencies of action, including that of failed dependencies.
For example, if a service is asked to perform the operation “42 + 39,” the response is expected to be a number or a valid error message. These restrictions on the expected response are what makes the response understandable. It would not be understandable for the service to return the answer “cherry,” or “yes”.
In Figure 2, the garbage returned from the ExpiredAccountDeterminer service is a response that is not understandable.
A reasonable response is a response that is indicative of what is actually happening with your service.
For example, if a service is asked to perform the operation “42 + 39,” the response is expected to be the result of adding 42 and 39. It would not be reasonable for the service to return the result of subtracting 39 from 42. Even if it correctly calculated the valid result of “42 – 39,” it would not be reasonable for the service to return that result.
In Figure 2, if you call the ExpiredAccountDeterminer service, it would not be reasonable for the service to return a list of all accounts, or a list of all valid accounts. It would only be reasonable for the service to return a list of accounts ready to be deleted, or return a valid error message if it couldn’t determine the right answer. In this particular case, it might even be reasonable for it to return an empty list (indicating that no accounts are ready to be deleted). But it would not be reasonable to return anything else, even in an error situation.
The very popular expression “Garbage In, Garbage Out” should never be an acceptable action for a service to take. It is not the way to build resiliency into your services.
Your upstream dependencies expect you to provide a predictable response. Your upstream dependencies expect your response to be understandable and reasonable. Don’t output garbage if you’ve been given garbage as input. Don’t output garbage if your service is failing or has a dependency that is failing.
If you provide an unpredictable response to an unpredictable reaction from a downstream service, you just propagate the unpredictable nature up the value chain. Sooner or later, that unpredictable reaction injects invalid data into your business processes and systems. This can impact your business systems and your customer’s experience. In the case of the example in Figure 2, the impact can be catastrophic.
Even if a service or its dependencies are failing, it is important that these problems are never apparent to the end-user or customers. Even if a service can’t do what it is supposed to do, it needs to return a predictable, understandable and reasonable response, even if that response is to indicate an expected and appropriate error condition exists.
To do otherwise reduces the resiliency of your application and puts your application and your customers at risk.
Feature image via Pixabay.