Recently one of our websites was hit by a scraper. We could see the requests in our logs as they were querying our site for different keywords. Instead of adding a bunch of IPs to our firewalls we decided to implement more intelligent throttling.
rack-attack allows us to
limit the number of requests our application will accept from the same IP in a given time
period. It then builds a Redis key based on
request.ip. On each request it does Redis
INCR operation (which will either create a key if it doesn’t exist or increment it). During creation it sets
TTL equal to our time
Once the Redis key value exceeds the limit it will block the request at rack middleware layer. When the key expires access will be allowed again. Here is the wiki page with more details. Data in Redis will look like this (
This approach will keep out most scrapers but someone determined can easily figure out the thresholds. It also depends whether we want to truly restrict someone from abusing the system or just limit the stress on out servers.
To keep out more malicious users we can implement exponential backoff. This will create multiple keys for each IP and time period (using more Redis RAM). There is a clever example on the wiki page showing us to how create multiple levels in the same loop.
But what if we have lots of legitimate users behind the same IP? We can add IPs to safelist or blocklist. We could put IPs in config file but that would require a code deploy to change. Why not use Redis to store these IPs in separate keys?
To add/remove these records we built a simple GUI so our internal users can respond quickly if needed. We also set default TTL of 1 week so these IPs do not remain in the system permanently.
Customer specific configuration for APIs
IP throttling can be used for websites but it is also very common for APIs. We may have multiple customers using our API and we want to control access for each one. The configuration examples above apply to entire application so we need something more flexible. Full confession - I have not implemented this solution in production so be careful and please share feedback in comments below.
Let’s assume that when request hits our servers there is a
customer_id param. Let’s also assume that we have Free, Pro and Enterprise tiers with the following limits:
- Free - 100 requests per hour.
- Pro - 100 requests per minute and 5K requests per hour.
- Enterprise - 200 requests per minute and 10K requests per hour.
We do not want to query our primary DB during the IP check so we will store this data in Redis with the help of redis-objects gem.
We are storing
tier in both primary DB and in Redis (with
before_save callback) because we need to query customers by
tier. Data in Redis will look like this:
throttle check can be modified. The challenge is that this check occurs in initializer in Rack layer and we need to grab customer_id from request to dynamically determine throttling.
To have even more flexibility we can store unique configuration for each customer in Redis hashes.
This will allow 100 requests per minute, 1K requests per hour and 10K requests per day. Key is
period (number of seconds) and value is
limit (max requests). We would then use hash to configure
The problem is that we would need to restart the app to pick up these custom configurations. Honestly I am not sure the custom Hash approach really delivers much value and significantly complicates things. If anyone has suggestions feel free to share them.