In previous post I wrote about using Redis and Sidekiq to do bulk data imports. But as with all scalability challenges this solution works up to a certain level. What if we have very large imports with millions of records?
At that point even queuing Sidekiq jobs (one per record) can take a long time. What if we re-deploy code and restart our application server? The jobs that were queued up will get processed but it will be hard to tell how many and which records were not placed in the queue.
Here is one way to improve this process. When a file with records to be imported is uploaded, we first save it to AWS S3. We then fire
QueueRecordImportJob.perform_later and send “import has began” message back to the user.
QueueRecordImportJob downloads the file from S3 and starts iterating through it. It keeps a counter (also stored in Redis) of which row it finished. If the Sidekiq process restarts,
QueueRecordImportJob will begin anew. It will download the file from S3 again, check the counter and start processing the file from the next row.
This creates a very long running
QueueRecordImportJob which usually is not a good practice. If this job fails to complete and the process restarts Sidekiq will try to push it back into the queue (details here).
QueueRecordImportJob does not actually import the records. It simply calls
RecordImportJob.perform_later passing each row. This speeds the
QueueRecordImportJob and now we can have multiple Sidekiq workers processing individual records via
To ensure that
QueueRecordImportJob job starts right away after Sidekiq restart we set it to run in a different queue with higher priority.