I am a Sr. Software Developer at Oracle Cloud. The opinions expressed here are my own and not necessarily those of my employer.
When we need to scale applications Redis can be a great tool. Slow tasks such as sending emails can be done via background process which is an easy win for user experience. In many situations we do not care about the order in which different jobs are executed.
- Simple workflow
- More complex workflow
- Redis data storage
- Avoid overlapping workflows
- Alternative queues
But sometimes we do. What if we are working on an application that periodically runs data import processes? Data comes in from various sources in different formats. We build several classes (using Ruby on Rails ActiveJob) to appropriately handle the imports. Then we generate report summarizing the data.
One option is to schedule the
GenReportJob to run well after all the import jobs. But that may lead to problems when the individual imports take longer than expected. Or we need to start running imports more frequently. Now we are generating reports based on incomplete data and customers are upset.
What we need is a workflow process that will run
GenReportJob after ALL
import jobs complete successfully. gush library can help with that.
More complex workflow
What if our CSV imports grow very large? Downloading a file and processing many thousands of records is slow. We can break it up into one job to download the data and then separate jobs to process each row (which will run in parallel). To keep things simple we will always save the file to the same location.
How do we know when ALL
ImportCsvRowJob complete so we can run
GenReportJob? We change our workflow.
Redis data storage
If the number of jobs in our workflow were static we could have reused the same flow by calling
flow.start! using the same ID. But we would need to store that ID somewhere and if Redis data were deleted the ID would be useless. So it is best to re-create the workflow every time. Each workflow and job has a separate key in Redis. Gush serializes everything as JSON and stores it as Redis strings. Each job contains reference to workflow.
But each workflow record contains the list of ALL jobs in that workflow so it will get pretty big. It also slows things down as the imports grow to thousands of records and
ImportWorkflow redis key needs to be serialized on each job completion. In a few perf tests I did it got slow past a few hundreds jobs and not practical past a few thousand. So I do not recommend this approach for something like importing records.
Now we want to schedule this workflow for regular execution. Gush does not provide scheduling functionality so we will create a wrapper job to manage the entire process and kick it off it using something like sidekiq-cron. This job can contain additional logic to query DB, start other workflows, etc. It can also be used to cleanup Redis data created by previous workflows.
Avoid overlapping workflows
Since we do not know how long our workflow execution will take we might want to avoid starting the next scheduled workflow iteration while the current one with same class is still running.
The great thing about ActiveJob is that we can easily switch between different queue backends. We might decide to use AWS SQS with shoryuken or RabbitMQ with sneakers. In that case gush will use Redis to only store data on workflow and jobs but NOT use Redis as a queue. All we need to do is create
gush queue in SQS, set
config.active_job.queue_adapter = :shoryuken and provide AWS creds. Even though SQS does not guarantee order of messages the workflow managed by gush will ensure that
GenReportJob runs at the very end.
Since SQS does not support scheduling jobs we would need to use a different mechanism. Or we could run
WorkflowManagerJob via Sidekiq with sidekiq-cron by setting
self.queue_adapter = :sidekiq inside that job class. Other jobs will run via SQS.