How to Overcome Challenges in an API-Centric Architecture
This is the second in a two-part series. For an overview of a typical architecture, how it can be deployed and the right tools to use, please refer to Part 1.
Most APIs impose usage limits on number of requests per month and rate limits, such as a maximum of 50 requests per minute. A third-party API can be used by many parts of the system. Handling subscription limits requires the system to track all API calls and raise alerts if the limit will be reached soon.
Often, increasing the limit requires human involvement, and alerts need to be raised well in advance. The system deployed must be able to track API usage data persistently to preserve data across service restarts or failures. Also, if the same API is used by multiple applications, collecting those counts and making decisions needs careful design.
Rate limits are more complicated. If handed down to the developer, they will invariably add sleep statements, which will solve the problem in the short term; however, in the long run, this leads to complicated issues when the timing changes. A better approach is to use a concurrent data structure that limits rates. Even then, if the same API is used by multiple applications, controlling rates is more complicated.
An option is to assign each API a portion of the rates, but the downside of that is some bandwidth will be wasted because while some APIs are waiting for capacity, others might be idling. The most practical solution is to send all calls through an outgoing proxy that can handle all limits.
Apps that use external APIs will almost always run into this challenge. Even internal APIs will have the same challenge if they are used by many applications. If an API is only used by one application, there is little point in making that an API. It may be a good idea to try to provide a general solution that handles subscription and rate limits.
Overcoming High Latencies and Tail Latencies
Given a series of service calls, tail latencies are the few service calls that take the most time to finish. If tail latencies are high, some of the requests will take too long or time out. If API calls happen over the internet, tail latencies keep getting worse. When we build apps combining multiple services, each service adds latency. When combining several services, the risk of timeouts increases significantly.
Tail latency is a topic that has been widely discussed, which we will not repeat. However, it is a good idea to explore and learn this area if you plan to run APIs under high-load conditions. See , , ,  and  for more information.
But, why is this a problem? If the APIs we expose do not provide service-level agreement (SLA) guarantees (such as in the 99th percentile in less than 700 milliseconds), it would be impossible for downstream apps that use our APIs to provide any guarantees. Unless everyone can stick to reasonable guarantees, the whole API economy will come crashing down. Newer API specifications, such as the Australian Open Banking specification, define latency limits as part of the specification.
If the use case allows it, the best option is to make tasks asynchronous.
There are several potential solutions. If the use case allows it, the best option is to make tasks asynchronous. If you are calling multiple services, it inevitably takes too long, and often it is better to set the right expectations by promising to provide the results when ready rather than forcing the end user to wait for the request.
When service calls do not have side effects (such as search), there is a second option: latency hedging, where we start a second call when the wait time exceeds the 80th percentile and respond when one of them has returned. This can help control the long tail.
The third option is to try to complete as much work as possible in parallel by not waiting for a response when we are doing a service call and parallelly starting as many service calls as possible. This is not always possible because some service calls might depend on the results of earlier service calls. However, coding to call multiple services in parallel and collecting the results and combining them is much more complex than doing them one after the other.
When a timely response is needed, you are at the mercy of your dependent APIs. Unless caching is possible, an application can’t work faster than any of its dependent services. When the load increases, if the dependent endpoint can’t scale while keeping the response times within the SLA, we will experience higher latencies. If the dependent API can be kept within the SLA, we can get more capacity by paying more for a higher level of service or by buying multiple subscriptions. When that is possible, keeping within the latency becomes a capacity planning problem, where we have to keep enough capacity to manage the risk of potential latency problems.
Another option is to have multiple API options for the same function. For example, if you want to send an SMS or email, there are multiple options. However, it is not the same for many other services. It is possible that as the API economy matures, there will be multiple competing options for many APIs. When multiple options are available, the application can send more traffic to the API that responds faster, giving it more business.
If our API has one client, then things are simple. We can let the client use the API as far as our system allows. However, if we are supporting multiple clients, we need to try to reduce the possibility of one client slowing down others. This is the same reason why other APIs will have a rate limit. We should also define rate limits in our API’s SLA. When a client sends too many requests too fast, we should reject their requests using a status code such as HTTP status code 503. Doing this communicates to the client that it must slow down. This process is called backpressure, where we communicate to upstream clients that the service is overloaded and the message will eventually be handed out to the end user.
It is important to have enough tracing and logs to help you find out whether an error is happening on our side of the system or the side of third-party APIs.
If we are overloaded without any single user sending requests too fast, we need to scale up. If we can’t scale up, we still need to reject some requests. It is important to note that rejecting requests, in this case, makes our system unavailable, while rejecting requests in the earlier case where one client is going over his SLA does not count as unavailable time.
Cold start times (the time for the container to boot up) and service requests are other latency sources. A simple solution is to keep a replica running at all times; this is acceptable for high-traffic APIs. However, if you have many low-traffic APIs, this could be expensive. In such cases, you can guess the traffic and warm up the container before (using heuristics, AI or both). Another option is to optimize the startup time of the servers to allow for fast bootup.
Latency, scale and high availability are closely linked. Even a well-tuned system would need to scale to keep the system running within acceptable latency. If our APIs need to reject valid requests due to load, the API will be unavailable from the user’s point of view.
Managing Transactions across Multiple APIs
If you can run all code from a single runtime (such as JVM), we can commit it as one transaction. For example, premicroservices-era monolithic applications could handle most transactions directly with the database. However, as we break the logic across multiple services (hence multiple runtimes), we cannot carry a single database transaction across multiple service invocations without doing additional work.
One solution for this has been programming language-specific transaction implementations provided by an application server (such as Java transactions). Another is using Web Service atomic transactions if your platform supports it. Yet another has been to use a workflow system (such as Ode or Camunda), that has support for transactions. You can also use queues and combine database transactions and queue system transactions into a single transaction through a transaction manager like Atomikos.
This topic has been discussed in detail under microservices, and we will not repeat those discussions here. Please refer to ,  and  for more details.
Finally, with API-based architectures, troubleshooting is likely more involved. It is important to have enough tracing and logs to help you find out whether an error is happening on our side of the system or the side of third-party APIs. Also, we need clear data we can share in case help is needed from a third-party API to isolate and fix the problem.
I would like to thank Frank Leymann, Eric Newcomer and others for their thoughtful feedback to significantly shape these posts.