Last week I presented at RedisConf18. My talk was about integrating Redis with Elasticsearch (here are my slides). I spoke on how to use Redis as a temporary store during data processing and touched on Redis Streams (new data type coming soon).

I was also able to attend interesting presentations and spend time with other engineers discussing Redis roadmap features. There were great keynote talks from Scott McNeally, Joel Spolsky and others. Here are my personal highlights for the overall conference.

Redis Streams and Consumer Groups

I used Streams as part of my presentation but I also had a chance to attend training with @antirez on the more advanced features. The easiest way to think about Streams is key value pairs of data with unique IDs (based on timestamps). I wrote about it in this post.

In a simple implementation we would have one or more Producers adding items to a Stream (just a Redis key). Then we will have a single Consumer grabbing data from this Stream using either XREAD or XRANGE commands. In this very basic example Consumer class would be ran via an hourly process and grab data for the last hour (this is NOT production quality code).

class Producer
  def perform
    stream_key = "streams:#{Time.now.strftime("%Y-%m-%d")}"
    fields = ['key', 'value']
    REDIS_CLIENT.xadd(stream_key, '*', fields]
  end
end
class Consumer
  def perform
    stream_key = "streams:#{Time.now.strftime("%Y-%m-%d")}"
    start_id = Time.now - 1.hour
    end_id = Time.now
    items = REDIS_CLIENT.xrange(stream_key, start_id, end_id)
    items.each do |item|
      # code here
    end
  end
end

But what if we wanted to send this data to multiple consumers? For that we will be able to leverage Consumer Groups and XREADGROUP command. First we would create group via XGROUP create my_stream my_group $. Then we can XREADGROUP GROUP my_group NAME consumer1 STREAMS my_stream >. Each consumer must have unique name w/in the group.

Redis will track which messages were sent to which consumer. So we could stop and start individual consumers. There is also support for acknowledging receipt of messages with XACK and checking which messages are stilt pending with XPENDING.

Consumer Groups are still under development and the final syntax might change so I do not want to provide information that may turn out to be incorrect. Stay tune for formal announcement on redis.io.

Probabilistic Data Structures

One of the first projects I did with Redis many years ago was using it to track monthly unique visitors. We did it by hashing IP and UserAgent and using Redis TTL to purge records at the end of the month.

ttl = Time.end_of_month
unique_visitor = Digest::SHA1.hexdigest("#{ip}-#{user_agent}")
if REDIS_CLIENT.get unique_visitor
  # returning visitor
else
  REDIS_CLIENT.setex unique_visitor, ttl, ''
  # 1st time visitor
end

The downside is the more visitors came to our site the more RAM we needed in Redis to store all those keys. I did a quick test and it took roughly 100MB of RAM to store 1 million keys.

Currently I am doing something similar to track unique web requests and I am concerned about long term memory usage. Speaking to other engineers at RedisConf I became intrigued with idea of using HyperLogLog data structure. It allows us can trade lots of RAM for a little bit of accuracy (~99% accurate).

unique_visitor = Digest::SHA1.hexdigest("#{ip}-#{user_agent}")
before = REDIS_CLIENT.pfcount 'unique_visitors_hll'
REDIS_CLIENT.pfadd 'unique_visitors_hll', unique_visitor
after = REDIS_CLIENT.pfcount 'unique_visitors_hll'
if before == after
  # returning visitor
else
  # new visitor
end

We count the size of our HyperLogLog, add the new element (hash of IP & UserAgent) and count again. If the count has not changed that likely means we already seen this element before. We also need a process to delete the HyperLogLog key monthly via REDIS_CLIENT.del 'unique_visitors_hll' or by using Redis TTL. The upside is now storing similar 1 million IP & UA hashes only takes about 1 MB of RAM.

But using HyperLogLog to do this uniqueness check is not very clean. The best data structure is a Bloom Filter. Redis even has a ReBloom module that implements it. I might be trying out one of these solutions in the near future if our memory usage increases.

Redis Memory Optimizations

I also attended a great presentation by Sripathi Krishnan, CTO of HashedIn and creator of rdbtools. He spoke about ways on structuring Redis keys to make the best use of Redis memory.

For example, Redis Streams allow us to have different key / value fields but it means that Redis will need to store the keys in each item. It can be more memory efficient to keep all the keys in the stream the same (even storing null values) because Redis will reference the common keys for all items.

Another idea is to use MessagePack format instead of JSON or to leverage Redis Hashes instead of JSON (if JSON objects can be flattened). Here is a link to a worksheet he generously provided.

Overall the conference was another great event and now I am back at work applying some of the things I learned.

  • Slides for my presentation http://bit.ly/2I5Fp7R
  • HyperLogLog - http://antirez.com/news/75
  • Streams - http://antirez.com/news/114 and http://antirez.com/news/116
  • Consumer Groups - https://gist.github.com/antirez/68e67f3251d10f026861be2d0fe0d2f4
  • https://github.com/sripathikrishnan/redis-rdb-tools