I am a Sr. Software Developer at Oracle Cloud. The opinions expressed here are my own and not necessarily those of my employer.
Using Redis probabilistic data structures to track unique events
Aug 19, 2018
On a previous project we had to develop a process to de-dupe events from our web log files as we were processing them through a data pipeline. We decided to use a hash of IP & UserAgent combination to “uniquely” identify users responsible for the events and temporarily stored data in Redis.
Storing hash of IP & UserAgent as separate Redis keys
Here is a sample Python code implementing this logic. It generates “random” IPs and UserAgents, hashes the combination and checks if it exists in Redis as a separate key. If the key exits, it returns False (dupe). Otherwise it creates a key (with 900 seconds TTL per our time window requirements), increments a new_event counter and returns True (new event). To force “dupe” events this code uses a small array of UserAgents and limited IP ranges.
This solution worked but the more data we were processing the more RAM it required. We wanted a more memory efficient alternative.
HyperLogLog
Redis HyperLogLog is probabilistic data structure which gives approximate number of unique values using a fraction of memory with a standard error rate of 0.81% (99%+ accurate).
Redis HyperLogLog can count up to 2^64 items and in honor of Philippe Flajolet the commands begin with PF.
To determine if this was a new event or not we checked the count of HLL data structure, added the event and checked the count again. Notice that we are using a separate Redis connection client (r_hll) pointed at different Redis DB (1 vs 0).
The problem with this approach is that in a multi-threaded environment there is very high likelihood of another thread (or process) added a different item to our HLL and therefore changing the count. We needed to guarantee that nothing will happen between the three operations (pfcount, pfadd, pfcount). And we could not use MULTI/EXEC because we need to store the count somewhere.
Lua script
Redis will execute Lua script atomically (everything else will be blocked while the script is running so it must be fast). This approach also makes only one Redis API call per event (vs three) which will be faster.
Lua script will accept counter and uniq_hash as params (in a real world applications counters are likely to be time-series, such as daily). The script will encapsulate pfcount / pfadd / pfcount logic and return true or false.
We will load Lua script from our Python code with script_load command and then execute evalsha passing in the arguments.
Now we are saving $ on memory and still achieving 99%+ accuracy.
ReBloom module
Bloom Filter is probabilistic data structure designed to tell us with reasonable degree of certainty whether this is a new event or not. Redis does not support them natively but a there is a ReBloom module that enables that functionality.
In this Python code we first reserve a Bloom Filter for 10,000 entries with error rate of 1 in 1000. The more accurate we want this to be the more memory and CPU will be required. When we execute BF.ADD the response comes back as 0 if it’s not a new item and 1 if it is new.
Combined Python code
Now we create a new_event.py script combining the approaches.
After running the code we can launch redis-cli and compare results. Note - results may vary depending on how the random IPs and UserAgents are generated.
Memory usage varied widely. Storing 1 million unique IP & UserAgent combinations took almost 100MB with separate keys vs ~ 1MB as HyperLogLog. Bloom Filter provisioned for 1 million items used about 3MB.
Which approach is best depends on the situation. Leveraging ReBloom module requires either hosting provider that supports it or full control of the server where Redis is running. Using separate keys gives the most granularity in expiring data after a time window but used the most RAM. HyperLogLog can be done with a standard Redis but Lua scripts introduce slightly more complexity.