Redis and ETL

Frequently our application captures highly volatile data in Redis but we also need to ETL some of those data elements to a different DB or data warehouse. We can change the same value (increment a counter) tens of thousands of times per second in Redis but we cannot (and don’t really need to) make the same updates in our SQL DB (where data is persisted to disk).

What we often need is to keep our SQL DB in near real-time sync with Redis. Our business users might not care if this data is 10-15 minutes delayed. How would we design such a system? I will explore that below using Ruby on Rails framework.

From Redis
Into Redis
Within Redis

From Redis

What if we had a blogging platform system where we tracked unique visitors? We want to give different experience to new vs. returning visitors. For example, comScore tracks unique monthly visitors using combination of IP and user agent. In a previous job I helped build very similar functionality.

We can take IP & UserAgent and hash the combination using MurmurHash.

# config/initializers/redis.rb
redis_conn = Redis.new(host: 'localhost', port: 6379, db: 0, driver: :hiredis)
REDIS_VISIT_COUNT =  Redis::Namespace.new('vst', redis: redis_conn)
# app/controllers/application_controller.rb
class ApplicationController < ActionController::Base
  include ApplicationHelper
  before_action :return_visitor_check
end
# app/helpers/application_helper.rb
module ApplicationHelper
  def return_visitor_check
    # https://github.com/ksss/digest-murmurhash
    key = Digest::MurmurHash1.hexdigest("#{request.remote_ip}:#{request.user_agent}")
    @returning_visitor = true if REDIS_VISIT_COUNT.get(key).present?
    REDIS_VISIT_COUNT.pipelined do
      REDIS_VISIT_COUNT.incr(key)
      REDIS_VISIT_COUNT.expireat(key, Time.now.end_of_month.to_i)
    end
  end
end
# data in Redis
{"db":0,"key":"86164f23","ttl":-1,"type":"string","value":"5","size":1}

We are using Rails helpers but alternatively code could be moved into different classes or modules.

return_visitor_check will be extremely fast and now we can use if @returning_visitor == true in our controllers or view templates. Data will be automatically purged at the end of the month using Redis TTL.

But what if our business users need to see how many total visitors site had that month and how many of them were returning? And they want to see this data by date. For that we aggregate data separately in Redis using a different namespace.

# config/initializers/redis.rb
REDIS_VISIT_COUNT_DATE =  Redis::Namespace.new('vst_date', redis: redis_conn)
# app/helpers/application_helper.rb
def visit_count_date_helper
  key = Time.now.strftime("%Y%m%d")
  REDIS_VISIT_COUNT_DATE.pipelined do
    REDIS_VISIT_COUNT_DATE.incr(key)
    REDIS_VISIT_COUNT_DATE.expire(key, Time.now.end_of_month.to_i)
  end
end
# data in Redis
{"db":0,"key":"vst:YYYYMMDD","ttl":2592000,"type":"string","value":"17834",...}

Now we need to move it to our SQL DB. Sidekiq is a great library for running background jobs and it also uses Redis as queue. We can wrap it in ActiveJob and use sidekiq-cron to run the process every 15 minutes. In our SQL DB we will have Visits table with date, total_count and unique_count columns.

# app/models/visit.rb
class Visit < ApplicationRecord
end
# app/jobs/visit_count_job.rb
class VisitCountJob < ApplicationJob
  queue_as :low
  def perform(date: Time.now.strftime("%Y%m%d"))
    VisitCount.new.perform
  end
end
# app/services/visit_count.rb
class VisitCount
  # can run this directly and specify a date
  def initialize(date: Time.now.strftime("%Y%m%d"))
    @date = date
  end
  def perform
    total_count = 0
    unique_count = 0
    # grab keys that match pattern for today's date
    REDIS_VISIT_COUNT_DATE.keys("*:#{@date}").each do |key|
      value = REDIS_VISIT_COUNT_DATE.get(key)
      total_count += value
      unique_count += 1
    end
    # persist data
    visit = Visit.where(date: date).first_or_create
    visit.update(total_count: total_count, unique_count: unique_count)
  end    
end

We will need to create a couple Sidekiq config files. Since this ETL process is not very time sensitive we can put it in low priority queue so jobs in default and high queues will be processed first.

# config/sidekiq.yml
---
:queues:
  - [high, 3]
  - [default, 2]
  - [low, 1]
# config/initializers/sidekiq.rb
schedule_array =
[
  {'name' => 'VisitCountJob',
    'class' => 'VisitCountJob',
    'cron'  => '*/15 * * * *',
    'queue' => 'low',
    'active_job' => true },
]
Sidekiq.configure_server do |config|
  config.redis = { host: 'localhost', post: 6379, db: 0, namespace: 'sidekiq' }
  Sidekiq::Cron::Job.load_from_array! schedule_array
end
Sidekiq.configure_client do |config|
  config.redis = { host: 'localhost', post: 6379, db: 0, namespace: 'sidekiq' }
end

But what if we don’t want that 15 minute delay? Why not wrap that VisitCount class in a daemon running w/in our application? A couple of useful articles here and here.

# lib/visit_count_daemon.rb
class VisitCountDaemon
  def perform
    while true
      VisitCount.new.perform
      sleep(1)
    end
  end
end

Into Redis

There are also times when we need to move data to Redis. Redis will have no problem keeping up with updates made in relational DB (which have to be persisted to disk) so latency is unlikely to be an issue. Often the simplest choice is to implement default Rails caching

class User
  def my_method
    cache_key = # derive from object class, ID & timestamp or build our own
    Rails.cache.fetch([cache_key, __method__]) do
      ...
    end  
  end
end
# data in Redis
{"db":0,"key":"namespace:User/1-1495838092/my_method","ttl":150,"type":"string",...}

Another option is to use model callbacks. What if we wanted keep a counter in Redis for how many times article was commented on? We can utilize redis-objects. It will create methods directly in our model that will access data in Redis.

# app/models/article.rb
class Article < ApplicationRecord
  has_many :comments
  include Redis::Objects
  counter :num_comments
end
class Comment < ApplicationRecord
  belongs_to :article
  after_save :update_num_comments
private
  def update_num_comments
    article.num_comments.incr
  end
end

And we could use a background job. Our blogging platform allows internal admins to fetch data from remote server and update cached content in Redis.

# app/controllers/my_controller.rb
class UpdateRedisCacheController < ApplicationController
  def create
    UpdateRedisJob.perform_later
    render status: 200
  end
end
# app/jobs/update_redis_job.rb
class UpdateRedisJob < ApplicationJob
  def perform
    # grab the file and process each row
    CSV.parse(File.read('...'), headers: true).each do |row|
      key = # generate key
      value = # transform row into appropriate Redis structure
      REDIS.pipelined do
        REDIS.del(key)
        REDIS.set(key, value)
      end
    end
  end
end

Within Redis

A third possibility is that we need to move data w/in Redis transforming it in the process. We are storing REDIS_VISIT_COUNT_DATE in distinct Redis keys. What if we wanted to create one hash record of all dates and the number of visitors just before data was purged w TTL? We can call redis-objects library directly from our job class.

# app/jobs/redis_visit_count_date_job.rb
class RedisVisitCountDateJob
  def perform
    date_hash = Redis::HashKey.new('redis_visit_count_date')
    dates_of_this_month.each do |each_date|
      date_total_count = 0
      REDIS_VISIT_COUNT_DATE.keys("*:#{each_date}").each do |key|
        value = REDIS_VISIT_COUNT_DATE.get(key)
        date_total_count += value
      end
      date_hash[each_date] = date_total_count
    end
  end
end
# data in Redis
{"db":0,"key":"redis_visit_count_date","ttl":-1,"type":"hash",
  "value":{"YYYYMMDD":"10","YYYYMMDD":"12",...},...}

This article was inspired by mosql but I could not think of a way to extract these ideas into a gem. There is also Pentaho ETL tools for Redis but does not seem be maintained.

I am a Sr. Software Developer at Oracle Cloud. The opinions expressed here are my own and not necessarily those of my employer.

From Redis

Into Redis

Within Redis