I am a Sr. Software Developer at Oracle Cloud. The opinions expressed here are my own and not necessarily those of my employer.
Redis and ETL
Frequently our application captures highly volatile data in Redis but we also need to ETL some of those data elements to a different DB or data warehouse. We can change the same value (increment a counter) tens of thousands of times per second in Redis but we cannot (and don’t really need to) make the same updates in our SQL DB (where data is persisted to disk).
What we often need is to keep our SQL DB in near real-time sync with Redis. Our business users might not care if this data is 10-15 minutes delayed. How would we design such a system? I will explore that below using Ruby on Rails framework.
From Redis
What if we had a blogging platform system where we tracked unique visitors? We want to give different experience to new vs. returning visitors. For example, comScore tracks unique monthly visitors using combination of IP and user agent. In a previous job I helped build very similar functionality.
We can take IP & UserAgent and hash the combination using MurmurHash.
# config/initializers/redis.rb
redis_conn = Redis.new(host: 'localhost', port: 6379, db: 0, driver: :hiredis)
REDIS_VISIT_COUNT = Redis::Namespace.new('vst', redis: redis_conn)
# app/controllers/application_controller.rb
class ApplicationController < ActionController::Base
include ApplicationHelper
before_action :return_visitor_check
end
# app/helpers/application_helper.rb
module ApplicationHelper
def return_visitor_check
# https://github.com/ksss/digest-murmurhash
key = Digest::MurmurHash1.hexdigest("#{request.remote_ip}:#{request.user_agent}")
@returning_visitor = true if REDIS_VISIT_COUNT.get(key).present?
REDIS_VISIT_COUNT.pipelined do
REDIS_VISIT_COUNT.incr(key)
REDIS_VISIT_COUNT.expireat(key, Time.now.end_of_month.to_i)
end
end
end
# data in Redis
{"db":0,"key":"86164f23","ttl":-1,"type":"string","value":"5","size":1}
We are using Rails helpers but alternatively code could be moved into different classes or modules.
return_visitor_check
will be extremely fast and now we can use if @returning_visitor == true
in our controllers or view templates. Data will be automatically purged at the end of the month using Redis TTL.
But what if our business users need to see how many total visitors site had that month and how many of them were returning? And they want to see this data by date. For that we aggregate data separately in Redis using a different namespace.
# config/initializers/redis.rb
REDIS_VISIT_COUNT_DATE = Redis::Namespace.new('vst_date', redis: redis_conn)
# app/helpers/application_helper.rb
def visit_count_date_helper
key = Time.now.strftime("%Y%m%d")
REDIS_VISIT_COUNT_DATE.pipelined do
REDIS_VISIT_COUNT_DATE.incr(key)
REDIS_VISIT_COUNT_DATE.expire(key, Time.now.end_of_month.to_i)
end
end
# data in Redis
{"db":0,"key":"vst:YYYYMMDD","ttl":2592000,"type":"string","value":"17834",...}
Now we need to move it to our SQL DB. Sidekiq is a great library for running background jobs and it also uses Redis as queue. We can wrap it in ActiveJob and use sidekiq-cron to run the process every 15 minutes. In our SQL DB we will have Visits
table with date
, total_count
and unique_count
columns.
# app/models/visit.rb
class Visit < ApplicationRecord
end
# app/jobs/visit_count_job.rb
class VisitCountJob < ApplicationJob
queue_as :low
def perform(date: Time.now.strftime("%Y%m%d"))
VisitCount.new.perform
end
end
# app/services/visit_count.rb
class VisitCount
# can run this directly and specify a date
def initialize(date: Time.now.strftime("%Y%m%d"))
@date = date
end
def perform
total_count = 0
unique_count = 0
# grab keys that match pattern for today's date
REDIS_VISIT_COUNT_DATE.keys("*:#{@date}").each do |key|
value = REDIS_VISIT_COUNT_DATE.get(key)
total_count += value
unique_count += 1
end
# persist data
visit = Visit.where(date: date).first_or_create
visit.update(total_count: total_count, unique_count: unique_count)
end
end
We will need to create a couple Sidekiq config files. Since this ETL process is not very time sensitive we can put it in low
priority queue so jobs in default
and high
queues will be processed first.
# config/sidekiq.yml
---
:queues:
- [high, 3]
- [default, 2]
- [low, 1]
# config/initializers/sidekiq.rb
schedule_array =
[
{'name' => 'VisitCountJob',
'class' => 'VisitCountJob',
'cron' => '*/15 * * * *',
'queue' => 'low',
'active_job' => true },
]
Sidekiq.configure_server do |config|
config.redis = { host: 'localhost', post: 6379, db: 0, namespace: 'sidekiq' }
Sidekiq::Cron::Job.load_from_array! schedule_array
end
Sidekiq.configure_client do |config|
config.redis = { host: 'localhost', post: 6379, db: 0, namespace: 'sidekiq' }
end
But what if we don’t want that 15 minute delay? Why not wrap that VisitCount
class in a daemon running w/in our application? A couple of useful articles here and here.
# lib/visit_count_daemon.rb
class VisitCountDaemon
def perform
while true
VisitCount.new.perform
sleep(1)
end
end
end
Into Redis
There are also times when we need to move data to Redis. Redis will have no problem keeping up with updates made in relational DB (which have to be persisted to disk) so latency is unlikely to be an issue. Often the simplest choice is to implement default Rails caching
class User
def my_method
cache_key = # derive from object class, ID & timestamp or build our own
Rails.cache.fetch([cache_key, __method__]) do
...
end
end
end
# data in Redis
{"db":0,"key":"namespace:User/1-1495838092/my_method","ttl":150,"type":"string",...}
Another option is to use model callbacks. What if we wanted keep a counter in Redis for how many times article was commented on? We can utilize redis-objects. It will create methods directly in our model that will access data in Redis.
# app/models/article.rb
class Article < ApplicationRecord
has_many :comments
include Redis::Objects
counter :num_comments
end
class Comment < ApplicationRecord
belongs_to :article
after_save :update_num_comments
private
def update_num_comments
article.num_comments.incr
end
end
And we could use a background job. Our blogging platform allows internal admins to fetch data from remote server and update cached content in Redis.
# app/controllers/my_controller.rb
class UpdateRedisCacheController < ApplicationController
def create
UpdateRedisJob.perform_later
render status: 200
end
end
# app/jobs/update_redis_job.rb
class UpdateRedisJob < ApplicationJob
def perform
# grab the file and process each row
CSV.parse(File.read('...'), headers: true).each do |row|
key = # generate key
value = # transform row into appropriate Redis structure
REDIS.pipelined do
REDIS.del(key)
REDIS.set(key, value)
end
end
end
end
Within Redis
A third possibility is that we need to move data w/in Redis transforming it in the process. We are storing REDIS_VISIT_COUNT_DATE
in distinct Redis keys. What if we wanted to create one hash record of all dates and the number of visitors just before data was purged w TTL? We can call redis-objects library directly from our job class.
# app/jobs/redis_visit_count_date_job.rb
class RedisVisitCountDateJob
def perform
date_hash = Redis::HashKey.new('redis_visit_count_date')
dates_of_this_month.each do |each_date|
date_total_count = 0
REDIS_VISIT_COUNT_DATE.keys("*:#{each_date}").each do |key|
value = REDIS_VISIT_COUNT_DATE.get(key)
date_total_count += value
end
date_hash[each_date] = date_total_count
end
end
end
# data in Redis
{"db":0,"key":"redis_visit_count_date","ttl":-1,"type":"hash",
"value":{"YYYYMMDD":"10","YYYYMMDD":"12",...},...}
This article was inspired by mosql but I could not think of a way to extract these ideas into a gem. There is also Pentaho ETL tools for Redis but does not seem be maintained.