I am a Sr. Software Developer at Oracle Cloud. The opinions expressed here are my own and not necessarily those of my employer.
Redis and ETL
May 28, 2017
Frequently our application captures highly volatile data in Redis but we also need to ETL some of those data elements to a different DB or data warehouse. We can change the same value (increment a counter) tens of thousands of times per second in Redis but we cannot (and don’t really need to) make the same updates in our SQL DB (where data is persisted to disk).
What we often need is to keep our SQL DB in near real-time sync with Redis. Our business users might not care if this data is 10-15 minutes delayed. How would we design such a system? I will explore that below using Ruby on Rails framework.
What if we had a blogging platform system where we tracked unique visitors? We want to give different experience to new vs. returning visitors. For example, comScore tracks unique monthly visitors using combination of IP and user agent. In a previous job I helped build very similar functionality.
We can take IP & UserAgent and hash the combination using MurmurHash.
We are using Rails helpers but alternatively code could be moved into different classes or modules.
return_visitor_check will be extremely fast and now we can use if @returning_visitor == true in our controllers or view templates. Data will be automatically purged at the end of the month using Redis TTL.
But what if our business users need to see how many total visitors site had that month and how many of them were returning? And they want to see this data by date. For that we aggregate data separately in Redis using a different namespace.
Now we need to move it to our SQL DB. Sidekiq is a great library for running background jobs and it also uses Redis as queue. We can wrap it in ActiveJob and use sidekiq-cron to run the process every 15 minutes. In our SQL DB we will have Visits table with date, total_count and unique_count columns.
We will need to create a couple Sidekiq config files. Since this ETL process is not very time sensitive we can put it in low priority queue so jobs in default and high queues will be processed first.
But what if we don’t want that 15 minute delay? Why not wrap that VisitCount class in a daemon running w/in our application? A couple of useful articles here and here.
There are also times when we need to move data to Redis. Redis will have no problem keeping up with updates made in relational DB (which have to be persisted to disk) so latency is unlikely to be an issue. Often the simplest choice is to implement default Rails caching
Another option is to use model callbacks. What if we wanted keep a counter in Redis for how many times article was commented on? We can utilize redis-objects. It will create methods directly in our model that will access data in Redis.
And we could use a background job. Our blogging platform allows internal admins to fetch data from remote server and update cached content in Redis.
A third possibility is that we need to move data w/in Redis transforming it in the process. We are storing REDIS_VISIT_COUNT_DATE in distinct Redis keys. What if we wanted to create one hash record of all dates and the number of visitors just before data was purged w TTL? We can call redis-objects library directly from our job class.
This article was inspired by mosql but I could not think of a way to extract these ideas into a gem. There is also Pentaho ETL tools for Redis but does not seem be maintained.