I am a Sr. Software Developer at Oracle Cloud. The opinions expressed here are my own and not necessarily those of my employer.
Redis Workflow Engine
When we need to scale applications Redis can be a great tool. Slow tasks such as sending emails can be done via background process which is an easy win for user experience. In many situations we do not care about the order in which different jobs are executed.
- Simple workflow
- More complex workflow
- Redis data storage
- Scheduling
- Avoid overlapping workflows
- Alternative queues
- Links
But sometimes we do. What if we are working on an application that periodically runs data import processes? Data comes in from various sources in different formats. We build several classes (using Ruby on Rails ActiveJob) to appropriately handle the imports. Then we generate report summarizing the data.
One option is to schedule the GenReportJob
to run well after all the import jobs. But that may lead to problems when the individual imports take longer than expected. Or we need to start running imports more frequently. Now we are generating reports based on incomplete data and customers are upset.
Simple workflow
What we need is a workflow process that will run GenReportJob
after ALL import
jobs complete successfully. gush library can help with that.
More complex workflow
What if our CSV imports grow very large? Downloading a file and processing many thousands of records is slow. We can break it up into one job to download the data and then separate jobs to process each row (which will run in parallel). To keep things simple we will always save the file to the same location.
How do we know when ALL ImportCsvRowJob
complete so we can run GenReportJob
? We change our workflow.
Redis data storage
If the number of jobs in our workflow were static we could have reused the same flow by calling flow.start!
using the same ID. But we would need to store that ID somewhere and if Redis data were deleted the ID would be useless. So it is best to re-create the workflow every time. Each workflow and job has a separate key in Redis. Gush serializes everything as JSON and stores it as Redis strings. Each job contains reference to workflow.
But each workflow record contains the list of ALL jobs in that workflow so it will get pretty big. It also slows things down as the imports grow to thousands of records and ImportWorkflow
redis key needs to be serialized on each job completion. In a few perf tests I did it got slow past a few hundreds jobs and not practical past a few thousand. So I do not recommend this approach for something like importing records.
Scheduling
Now we want to schedule this workflow for regular execution. Gush does not provide scheduling functionality so we will create a wrapper job to manage the entire process and kick it off it using something like sidekiq-cron. This job can contain additional logic to query DB, start other workflows, etc. It can also be used to cleanup Redis data created by previous workflows.
Avoid overlapping workflows
Since we do not know how long our workflow execution will take we might want to avoid starting the next scheduled workflow iteration while the current one with same class is still running.
Alternative queues
The great thing about ActiveJob is that we can easily switch between different queue backends. We might decide to use AWS SQS with shoryuken or RabbitMQ with sneakers. In that case gush will use Redis to only store data on workflow and jobs but NOT use Redis as a queue. All we need to do is create gush
queue in SQS, set config.active_job.queue_adapter = :shoryuken
and provide AWS creds. Even though SQS does not guarantee order of messages the workflow managed by gush will ensure that GenReportJob
runs at the very end.
Since SQS does not support scheduling jobs we would need to use a different mechanism. Or we could run WorkflowManagerJob
via Sidekiq with sidekiq-cron by setting self.queue_adapter = :sidekiq
inside that job class. Other jobs will run via SQS.