Differences between Rate Limiting and Throttling

When dealing with cloud APIs, there will typically be the concept of Throttling when API consumption moves beyond a specific limit. The particular model of throttling (leaky bucket, burstable, etc) is outside the scope of this post. For the purposes of this post, we will assume a simple, numeric rate limit (ie: x API calls per second).

The key concept is that the API is expensive either time, or compute and so there is a need to restrict the rate that the calls are made. Most developer code assumes that all APIs are free. Tight loops are common to operate code as fast as possible. Cloud APIs that distribute compute are not free and need special handling.

Rate Limiting is a client side response to the maximum capacity of a channel. If a channel has capacity to consume requests at a given rate/sec then a client should be prepared to limit their request rate. A common response to avoid implementing rate limiting on the client is that the server should allow processing at the appropriate rate. However, n the case where the API triggers asynchronous operations the response rate to the API may be fast, but the operation in the background is much slower.

Throttling is a server side response where feedback is provided to the caller indicating that there are too many requests coming in from that client or that the server is overloaded and needs clients to slow down their rate of requests. When a throttling event happens, the general client side response use exponential backoff to ensure that the system can recover even with multiple making requests at the same time.

An argument is often made that exponential backoff is a type of rate limiting. Yes it is, but exponential backoff is a partial solution with a number of problems.

AWS uses throttling heavily across its APIs and can serve as a good background for this discussion. Some AWS APIs associated with EMR have very low default API request limit (2-4 requests/second). These APIs have typical response times around 100ms. As you can see the sustained rate that an a client can access the API exceeds request limit. This represents an interesting (but easy) challenge for systems that need to call an API repeatedly. So for this example, I’m going to focus on the AWS Elastic Map Reduce DescribeCluster API. A common system management operation will be to gather data for each active cluster. In this example, assume we have 1000 clusters, and that we can’t hook into the EMR events and take a transactional on-change approach.

With an assumed maximum rate of 10 requests/second and an API limit of 4 requests/second. We can immediately see that calling the API 1000 times we can sustain 100 seconds of requests. However the API itself would take 250 to complete the scan. This of course assumes that our client is the only caller inter that API, in a lot of cases AWS internally is making requests, you may have a secondary orchestrator making requests and finally you may have an infosec scanner making requests. So the in reality our client may only be able to get 1 or 2 requests/second.

So let’s look at the naive approach. We keep hammering until we’re done at max rate. This will place us into heavy contention with other clients and will basically make life miserable for all the clients.

Following AWS Guidance we can implement exponential backoff. The general idea is that every time we receive a Throttling exception, we will back off in an exponential way, typically 2 ^ retry * base wait unit. Most approaches will have a max retry before failing or have a max retry before maxing out at some delay. Some AWS SDKs also have a transparent retry that is hidden from the API caller meaning that when you receive a Throttling exception, the SDK implementation has already backed off at least 3 times. Now if we look at our case above, we can see that we will almost immediately hit a throttling exception within the first second. Assuming we take the exponential backoff with a max wait, we will likely get 4 requests in and then wait for may 10 seconds (assuming we have max retries of 10 and base wait of 10 ms 2^10 * 10 = 1024*10 = 10 seconds), and then we try again. So effectively we’re getting 0.4 requests per second, so our full scan will take 2500 seconds. This is also ignoring other clients. Our bad behavior will also be triggering throttling events for those clients as well, likely diminishing service for all those services as well.

So currently we’ve got a poorly acting client that is ruining it for everyone. So what do we do?

We can rate limit the client. This would involve either having prior knowledge or adapting our rate based of feedback from the server. A simple form of rate limiting is to use a RateLimiter implementation that blocks code until it is permitted to continue. In our example if we had a RateLimit of 4 permits per second, then we can guarantee that we will stay below the API limit. In a quiet environment we could work through our 1000 clusters without hitting any throttling exception.

Of course hard coding 4 requests/second is naive to both the actual limit and also to any other clients. So we would likely want it to be made adaptive. We’d start of with a slightly aggressive limit of say 10 requests/second. As we hit throttling exceptions then we would adjust the RateLimiter by an appropriate amount (say halving) from 10 requests/second to 5 requests/second to 2.5 requests/second. I haven’t come any strong guidance on adapting rate limiting, but my intuition says negative exponential is probably too aggressive, but linear is not aggressive enough. We do get to a sustained un-throttled request rate by using rate limiting that is also sensitive to any sustained requests by other clients.

Planning and respecting the API request limit is the best solution. All APIs are not free to call.

So our exploration has demonstrated that sustained calls need something other than exponential backoff, and that we can get to a more predictable API rate by using rate limiting. However, we’re not there yet. There will be transient peaks that mean a period of load doesn’t represent the long term use of the API.

So to deal with transient spikes we will still need to tell our client to back off slightly when the API is under load. This does slow us down slightly, but we will definitely know that our rate limiting is preventing our client from being the problem, and that it is general external load on the API that is causing the issue.

So with a combination of exponential backoff to deal with transient spikes and rate limiting to treat the API limit with respect we have a much better solution.

But wait, there’s more.

Our past throttling doesn’t indicate that our future is throttled. So we need to ensure that our code is adaptive to both events and other clients utilizing the API. Unfortunately through most APIs there is only feedback on our behavior, so we need to either periodically optimistically improve our backoff and our rate limiting to ensure that we are close to, but not exceeding the limits.

For the exponential backoff, we need reduce our retry count periodically so that we get back to zero retries after a period of success. In our example above, we could assume that after 10 un-throttled API requests that we can discount our retry, ultimately getting back to our 0 retry state. This will allow us to deal with short term events (seconds of throttling) as well as sustained periods of loads (minutes of throttling).

For our rate limiting, we need to periodically increase our rate to feel for what the total load on the API is and find the peak sustained. Remember that in the case of AWS, any other client could be creating sustained load on the API and so we cannot assume that our sustainable 1 request/sec on the past represents our future sustained request rate, particularly when there may be other clients that are consuming some portion of the API capacity which may no longer be there. I’d likely implement a negative exponential limit increase (1/2 each time), with an incremental linear return based on the number of successful requests.

So by including adaptive rate limiting that backs off quickly and then feels for the right level, and exponential backoff that reduces the backoff after no throttling we will end up with much better overall throughput and will also end up being a good API citizen.

Unfortunately, I have never seen implementations that hand both rate limiting and exponential backoff. Most of the implementations that I have seen do exponential backoff but then go back to hitting the API at the maximum rate once the’ve done their back off. Less common, is a naive rate limiting approach that is neither adaptive or allows for backoff.

Thoughts, comments, seen a good rate limiting and backoff implementation? Add them below.