Usually our applications have a DB (MySQL, Postgres, etc) that we use to permanently store information about our users and other records. But there are also situations where we need to temporary store data used by a background process. This data might be structured very differently and would not fit into our relational DB.
Recently we were doing large scale analysis in our system (built on Rails 4.2 framework) to determine which users might have duplicate records. Long story why we have such records but I assure you there is a legitimate reason. What we needed was a process that would flag likely duplicates so humans could make a decision on whether to merge them or not.
As “unique” identifier for likely users we decided to use combination of “first_name last_name”. Obviously we knew that there would be many false positives but this was a starting point. In reality our business logic was much more complex but I am omitting many confidential details.
We decided to use Redis to store the ephemeral data as we were running the analysis. For that we created a Redis connection with separate key namespace ‘record_match’.
We created a PORO service object.
One issue to be aware of is
REDIS_RM.del("*") will cause problems if there are 2 separate analysis processes running at the same time.
First we loop through all user records creating “unique” first and last name combinations. Then we use Redis SET datatype to store user IDs. SET guarantees uniqueness of its members.
Results will look like this when stored in Redis:
Now we go through Redis keys and delete them if there is only 1 user ID in the SET (unique records). This could be combined with
process_results to make code a little faster (no need to loop through all Redis keys and check SET size).
Then we loop through remaining keys and members in the SETs to create
potential match records in the main DB. These
potential match records will go through the manual review process and corresponding user records could then be merged (or not). All MEMBERS in each SET need unique comparisons to each other. SET with key
john smith and members
[id1, id2, id3] will become 3 separate records comparing
id1 to id2,
id1 to id3 and
id2 to id3.
We store the user ID comparisons in the main DB is because that dataset is much smaller in size so it does not take very long to persist to disk . Plus we want to use relational DB validations and the data structure fits into our DB model.
Here are the various Ruby gems we used:
The examples above do not go deep into actual details but instead focus on the usage of Redis and its flexible and fast data structures.