Wednesday 18 April 2018

Rate Limiting

At work I was recently tasked with integrating a rate limiting solution into out API. The purpose of the rate limiter was to cap the number of requests a user can make against an API over a specific period of time. This can help:

  • Prevent spam attacks.
  • Stop spikes in traffic from overloading our servers and degrading performance.
  • Help identify and stop misbehaving clients.

A user in the context of rate limiting is any entity that you wish to limit. Users can be identified by any unique identifier. Some examples include:
  • IP Address
  • User ID
  • Device Identifier
  • Customer Name (for B2B customers)
  • URL. E.g. only X number of POST requests to /withdrawl URL
Or any combination of the above.

The main requirement for the solution was that our front end services can run on multiple containers in a distributed manner so our rate limiting must work across servers. This would involve using an external key-value store (redis) to store the rate limiting information. As a result of this many of the details described below are directly related to how redis works. These details can normally be applied to other cache services as required.

For the remainder of this blog post I will discuss the various rate limiting algorithms I investigated, our chosen algorithm and some additional implementation details.

Rate Limiting Algorithms

Fixed Window Counters

Fixed window counters are the simplest method of rate limiting.  They work by maintaining a counter with the number of times an entity has hit the endpoint within a fixed window. If the user makes more than the allocated number of requests, then new requests are rejected. At the start of the new window the counter is reset to 0 and the user may make additional requests. A window is typically associated with a time and the counter is reset at the start of the period. Some examples include:
  • Day window: Reset at the same time every day, e.g. midnight
  • Hour window: Reset on the hour, e.g. 12:00:00, 13:00:00 ...
  • 15 minute window: Reset every 15 minutes e.g. 12:00:00, 12:15:00 ...
The key for the cache service would be {id}_{windowTimestamp}, where id could be the user id of the user and windowTimestamp would be the timestamp at the start of the window.

 In the following example the user is allowed to make 3 requests in a 60 second window.

Time
Action
Counter Value
Additional
12:00:05user makes requestuser1_1515120000: 0 → 1Key is user1_1515120000
12:00:15user makes requestuser1_1515120000: 1 → 2
12:01:01user makes requestuser1_1515120100: 0 → 1
Counter is reset as the 60 second period is is over. New key is user1_1515120100
12:01:10user makes requestuser1_1515120100: 1 → 2
12:01:40user makes requestuser1_1515120100: 2 → 3
12:01:50user makes requestuser1_1515120100: 3 → 4User is rejected as they are over limit
12:02:20user makes requestuser1_1515120200: 0 → 1Allowed as counter reset and new key is user1_1515120200

Note: Old keys can be set to automatically expire a fixed period after they are done. This may require 2 calls to redis to have an INCR, then EXPIRE command called.

Pros:

The advantages of this approach are:
  • Simple to implement
  • INCR is an atomic redis command.
  • Low memory requirement.

Cons:

  • It can allow a user to go above their allowed quota in a rolling 60 second period. For example, in the above limit of 3 requests per 60 seconds, if the user made 3 requests at 12:00:59 and a further 3 requests at 12:01:00, this would allow the user to make 6 requests in a 2 second period. 
  • It may also cause bursts of traffic across many clients. For example, if every client has used their quota for the pervious minute they may retry until they are allowed. This could cause many users to hit the server in the first second of the new window.

 Sliding Log

Sliding log rate limiting involves storing a history of requests for each user along with the associated time stamp. As new requests come in you count the number of requests for a period. Logs can be stored in a sorted set per user, where the key and value are the time stamp of the requests. Logs older than the allowed period are dropped.

To recreate the previous example where the user is allowed 3 requests per 60 second window:

Time
Action
Set Value
Additional
12:00:05user makes request
user1: {
15151200050000: 15151200050000
}

12:00:15user makes request
user1: {
15151200050000: 15151200050000
15151200150000: 15151200150000
}

12:01:01user makes request
user1: {
15151200050000: 15151200050000
15151200150000: 15151200150000
15151201010000: 15151201010000
}

12:01:10user makes request
user1: {
15151200150000: 15151200150000
15151201010000: 15151201010000
15151201100000: 15151201100000
}
User is still under threshold as the initial request (15151200000000) was more than 60 seconds ago
12:01:40user makes request
user1: {
15151201010000: 15151201010000
15151201100000: 15151201100000
15151201400000: 15151201400000
}

12:01:50user makes request
user1: {
15151201010000: 15151201010000
15151201100000: 15151201100000
15151201400000: 15151201400000
15151201500000: 15151201500000
}
User is rejected as they have had 4 request in the last 60 seconds
12:02:20user makes request
user1: {
15151201400000: 15151201400000
15151201500000: 15151201500000
15151202200000:15151202200000
}
Request is allowed as user is below threshold.

Pros:

  • Allows high precision on the limits
  • Avoids bursts of data at the start of a window

Cons:

  • It requires a set item per request so has a large memory requirement.
  • Failed requests may still be added to the set. This could mean that even rate limited requests that perform no action could block a user.
  • May require multiple commands to perform the update and expire of old keys. 
    • This can be made atomic by using a redis MULTI command or lua script.

 Sliding Window

In a sliding window rate limiter, each user is given limits to be used within a time frame. These are broken up into smaller buckets that allow us to control the way limits are available over a more evenly distributed window. As requests are used they are removed from the smaller windows and as those small windows expire the tokens are re-added. You can add more precision to your requests by having multiple windows that work in parallel to even out he flow of traffic. The data would be stored in a has per user to allow access to the individual buckets. The key for each bucket would be the time stamp at the start of the window.

A simplified version allowing 3 requests per 60 seconds in 15 second buckets is:

Time
Action
Set Value
Additional
12:00:05user makes request
user1: {
1515120000: 1
}

12:00:15user makes request
user1: {
1515120000: 1
1515120015: 1
}

12:01:01user makes request
user1: {
1515120015: 1
1515120100: 1
}
Timestamp for 1515120000 is expired so is deleted
12:01:10user makes request
user1: {
1515120015: 1
1515120100: 2
}

12:01:40user makes request
user1: {
1515120100: 2
1515120130: 1
}
Timestamp for 1515120015 expires
12:01:50user makes request
user1: {
1515120100: 2
1515120130: 1
1515120145: 1
}
User is rejected as they have had 4 request in the last 60 seconds
12:02:20user makes request
user1: {
1515120130: 1
1515120145: 1
1515120215: 1
}
Timestamp for 1515120100 expires.

Request is allowed as user is below back threshold.

Pros:

  • Allows varying degrees of precision depending on bucket time frames.
  • Lower memory constraints than sliding log.
  • Avoids bursts of data at the start of a new window.

Cons:

  • The update is not atomic so would require a redis lua script to control key counting / expiry.

 Sliding Tail

 Sliding tail rate limiting is an alternative implementation of the sliding window algorithm. It can be also be thought of as a fixed window rate limiting with history. In this case, we store a count of the current fixed window and the previous window. When a request comes in we calculate usage based on the count of the number of requests in the current window + a weighted count of the previous window. For example, if the current window is 25% through it's time, then we use a weighted count including 75% of the previous count + the current window count. This count is used to create the current number of tokens used by the user.

The key for the information is the same as the fixed window example where we use {id}_{windowTimestamp}. A server can do an atomic INCR for the current window time stamp and a GET for the previous window. To improve efficiency after the first request for a time stamp the server could cache the previous window data as this would no longer change.

To redo the example of 3 requests per 60 seconds:

Time
Action
Counters Value
Additional
12:00:05user makes requestuser1_1515120000: 0 → 1No previous window so only using current window
12:00:15user makes requestuser1_1515120000: 1 → 2
12:01:01user makes request
user1_1515120000: 2
user1_1515120100: 0 → 1
Weighted count is (2 * ((60-1)/60)) + 1 = 2.9
12:01:10user makes request
user1_1515120000: 2
user1_1515120100: 1 → 2
Weighted count is (2 * ((60-10)/60)) + 2 = 3.6

12:01:40user makes request
user1_1515120000: 2
user1_1515120100: 2 → 3
Weighted count is (2 * ((60-40)/60)) + 3 = 3.6
12:01:50user makes request
user1_1515120000: 2
user1_1515120100: 3 → 4
Weighted count is (2 * ((60-50)/60)) + 4 = 4.3
User is above threshold so rejected
12:02:20user makes request
user1_1515120100: 4
user1_1515120200: 0 → 1
Updated to a new window so user1_1515120000 is dropped and we move to using the weighted count from user1_1515120100

Weighted count is (4 * ((60-20)/60)) + 1 = 3.6

In the above example you can see that the effect is the same as the fixed window for normal usage. However, if the user sent any more requests before 12:2:31, they would be rejected.

Pros:

  • Low memory usage
  • Increment is atomic, although there are 2 extra commands
    • EXPIRE of the current key
    • GET of the previous key
  • Prevents bursts of data at the start of a new window
  • Could be tuned to be more or less permissive based on weighting of the previous window
  • Could be tuned based on whether to round up or down

Cons:

  • Only an approximation of the last window as it assumes there was a constant rate of traffic.
  • New clients could send bursts of traffic at their first window
  • If using atomic increment rejected requests could still add to the count for the window

Leaky Bucket

The leaky bucket as a meter algorithm (related to token bucket) describes how an API can be limited to avoid burstiness. Each user has a count of tokens which is incremented as a request comes in. If the counter is above the threshold (the bucket size), then additional requests are discarded. To "empty" the bucket and give the user more tokens we decrement the counter by a set amount every period until it reaches zero.

An implementation could keep a hash which includes, the token count, and the time stamp of the last time the bucket was emptied. As each request comes in, it performs 2 operations:
  1. Decrement the token based on a steady counter
  2. Add a token
If the number is below the bucket size, the request is allowed.

In the following example, we allow a bucket size of 3 and a refresh rate of 1 per 20 seconds:

Time
Action
Set Value
Additional
12:00:05user makes request
user1: {
tokens: 1
ts: 1515120000
}
Add a token to the bucket
12:00:15user makes request
user1: {
tokens: 2
ts: 1515120000
}

12:01:01user makes request
user1: {
tokens: 1
ts: 1515120100
}
The previous ts was 1515120000 which means we should decrement the token count by up to 3.
Then add this token
12:01:10user makes request
user1: {
tokens: 2
ts: 1515120100
}

12:01:40user makes request
user1: {
tokens: 1
ts: 1515120140
}
The previous ts was 1515120100 which means we should decrement the token count by up to 2.
Then add this token
12:01:50user makes request
user1: {
tokens: 2
ts: 1515120140
}

12:02:20user makes request
user1: {
tokens: 1
ts: 15151200220
}
The previous ts was 1515120140 which means we should decrement the token count by up to 2.
Then add this token

 Pros:

  • Memory efficient
  • Prevents bursts of traffic

Cons:

  • Requires a redis lock or lua script to update.
  • Harder to tune requirements and implement.

 Implementation

For our purposes we decided to choose the sliding tail algorithm as it provided some protection against bursts of traffic while having low processing and memory usage. It is also easy to implement using standard redis commands that allow for atomic operations.

Additional Details

HTTP Responses

When rejecting a user who is over the limit we choose a 429 Too Many Requests response.

In addition to the 429 rejection, all responses include HTTP headers to inform the user what their limit is, and how many tokens are remaining. There appears to be no consensus on the HTTP header to use for this information. From looking at responses to other APIs we decided to go with:
  • X-Rate-Limit-Limit - to describe the limit the user has for this request
  • X-Rate-Limit-Remaining - to describe the number of tokens left for the user.

Tokens Used for each Request

As some operations are related but have different processing requirements, we decided to allow a configurable number of tokens per request type. This means that for certain endpoints we can allow the same bucket of tokens to rate limit the different requests. For example, a request to GET /user would use 1 token from the {id}:user:{timestamp} bucket, however a request to POST /user would remove 2 tokens from the same bucket.