I am a Sr. Software Developer at Oracle Cloud. The opinions expressed here are my own and not necessarily those of my employer.
Elasticsearch and Redis
Jan 4, 2018
Elasticsearch and Redis are powerful technologies with different strengths. They are very flexible and can be used for a variety of purposes. We will explore different ways to integrate them.
ELK is Elasticsearch, Logstash and Kibana. Elasticsearch stores data in indexes and supports powerful searching capabilities. Logstash is an ETL pipeline to move data to and from different data sources (including Redis). Kibana helps us build rich dashboards and do adhoc searches. These tools are used not just by developers but by data analysts and devops engineers who often have different skillset.
Redis has speed and powerful data structures. It can almost function as an extension of application memory but shared across processes / servers. The downside is that records can ONLY be looked up by key. Our applications can easily store all kinds of interesting data in Redis. But if this data needs to be extracted and aggregated in different ways that requires writing code. There is no easy way to do adhoc analysis (like writing SQL queries).
We are building a website for a nationwide chain of stores. The first requirement is enabling users to search for various products (in our case coffee brands which were generated using randomly). We will use Ruby on Rails with searchkick library to simplify Elasticsearch integration. We set callbacks: :async option. If we configure Sidekiq it will use Redis to queue a background job to update the documents in products index when record in primary DB is modified.
We are also caching JSON output of the ProductSearch.new.perform method call in Redis using query param to generate cache_key.
The downside is that any changes to products will take up to an hour to appear in cached search results. We can build a callback so when a record is updated in primary DB it not only updates the index but also flushes cache. To keep things simple we will delete all Redis keys matching ProductSearch:perform:* pattern but this needs to be improved for scaling. To be honest this caching technique might be more trouble than it’s worth.
Search by zipcode
Another important feature is enabling users to find stores by zipcode. Both Redis and Elasticsearch support geolocation searches. We need to map zipcodes to lon/lat coordinates. Here is a free data source
Redis geo
One option is to use Redis to find zipcodes w/in 5 mile radius and then query primary DB for stores in those zipcodes.
Elasticsearch geo
Alternatively we can use Elasticsearch geo search. We need to create an index and specify lon/lat for each zipcode. Since we already have lon/lat stored in Redis we can use it for quick lookup (vs parsing CSV file).
We run Store.reindex, verify that data shows up in Elasticsearch and modify StoreLocator.
Each document looks like this in Elasticsearch:
Now we can take advantage of rich Elasticsearch querying capabilities including geo queries, get the IDs of matching stores and display data from the primary DB.
Search by product AND geo
Now our users want to know which stores in specific area sell particular products. And they are not sure how to exactly spell the product name. We also want to make our indexes more powerful first class objects, not just something related to model.
Fist we create a model mapping which products are available in which stores. Then we will integrate with chewy library which is a little different than searchkick we used before.
update_index method ensures that Elasticsearch documents get updated when we update DB records. Chewy supports async updates to indexes via background jobs. Data in Elasticsearch looks like this:
We modify our search code
Now we can do StoreLocator.new(zipcode: 98174, query: 'kowboy').perform to find stores near 98174 zipcode that sell American Cowboy coffee.
Autocomplete
This problem can also be solved with both Elasticsearch and Redis.
Redis
To keep data in-sync between primary DB and Redis autocomplete keys we will implement a separate class and leverage it from model callbacks. We will begin our keys with the first 2 letters of each term and store up to last latter. We will also be using Sorted Set scores to give higher weight to more common terms.
We can add / remove all keys by running AutocompleteRedis.new.add_all('product', 'name') (or remove_all). Data in Redis will be stored in multiple sorted sets.
We can call AutocompleteRedis.new.search prefix: 'am' and get back JSON ["american select", "american cowboy"].
Elasticsearch
We will build a special index in Elasticsearch using Chewy library. Read here about filter and analyzer configuration.
Now AutocompleteIndex.query(match: {name: 'am'}) returns American Cowboy, American Select AND Old America products. Elasticsearch is able to use the second word in the product name to match against.
ETL
Until now we were moving data between primary DB and Redis or Elasticsearch. Now we will ETL data between Redis and Elasticsearch.
Redis to Elasticsearch
The next requirement is to record which zipcodes are searched most often and when searches are performed (by hour_of_day and day_of_week). To capture data in Redis we will use leaderboard library to track searches and minuteman to count when those searches occur.
This will be very fast and data in Redis will be stored like this:
But our internal business users do not want to look at raw data. Our choice is writing custom dashboard or pulling data into Elasticsearch and leveraging Kibana. Once it’s in Elasticsearch we can also combine it with other data sources. We will use elasticsearch-ruby library directly since this data does not related to our application models.
We are specifying our aggregation metrics (zipcode, hour_of_day, day_of_week) as the ID of Elasticsearch document.
Elasticsearch to Redis
In our Elasticsearch cluster we have captured data from logs that contain IP and UserAgent. Combination of IP and UserAgent can be used to fairly uniquely identify users. Our next business requirement is to implement functionality where our website displays slightly different UI to users that we believe have visited our site before.
Now we will leverage Logstash with various plugins as our ETL pipeline. We will be using elasticsearch input plugin, redis output plugin and ruby filter plugin to transform the data into format expected by ActiveJob background job framework and pushing it straight into a Redis List data structure.
Now we create a very simple job ran via Sidekiq that will hash IP & UA and also set Redis keys to expire in a week.