I am a Sr. Software Developer at Oracle Cloud. The opinions expressed here are my own and not necessarily those of my employer.
What is the right size for background jobs?
In previous post I wrote about pre-generating cache via background jobs. I described an example of an online banking app where we pre-generate cache of recent_transactions
. This helps even load on the system by pushing some of the data into cache before visitors come to the site.
- One job for all records
- One job for each record
- Loop through records in slices
- Different queues and workers
One job for all records
The simplest design is to loop through all records in one job.
The downside of this approach is that if we have millions of MyModel
records it can take a very long time for this job to complete. And what if we need to deploy code which restarts background job workers? We won’t know which records have been processed and which have not. Best practices for background jobs recommend keeping them small and idempotent.
One job for each record
We can queue one job per record by separating our code into two jobs.
Each job will complete very quickly and they will run in parallel. Since it is not recommended to serialize complete objects into queue we will use some kind of record identifier (like globalid). But this will cause lot of queries against the primary DB to look up records one at a time.
Loop through records in slices
And now we come to the Goldilocks solution - not too big and not too small. We want to break up the process into smaller chunks but instead of processing one record at a time we will process several (let’s say 10).
One downside of this approach is that pluck
will request IDs for ALL records from the primary DB. Then it will store them in array and loop through them. Different ORMs support batch_size
for querying records so we can do equivalent of select id from TableName limit 10 offset ...
.
Different queues and workers
The same approach can be applied to other situations (not just cache pre-generating). When a record is created/updated we might have a callback (previous post) to update various reports. The primary UpdateReportsJob
will be called from after_save
callback. We want it to complete as quickly as possibly and queue separate UpdateEachReportJob
passing appropriate report ID. We can process these jobs through separate queues.
We can even assign dedicated Sidekiq workers to watch only specific queues. Here is sample configuration for capistrano-sidekiq:
This way each server will have a dedicated process watching only the high
queue to ensure that those jobs complete as quickly as possible and not get backlogged. The other three workers will process the default
(used for other jobs) and low
queues (used for reports).