I am a Sr. Software Developer at Oracle Cloud. The opinions expressed here are my own and not necessarily those of my employer.
Bulk data import - part two
In previous post I wrote about using Redis and Sidekiq to do bulk data imports. But as with all scalability challenges this solution works up to a certain level. What if we have very large imports with millions of records?
At that point even queuing Sidekiq jobs (one per record) can take a long time. What if we re-deploy code and restart our application server? The jobs that were queued up will get processed but it will be hard to tell how many and which records were not placed in the queue.
Here is one way to improve this process. When a file with records to be imported is uploaded, we first save it to AWS S3. We then fire QueueRecordImportJob.perform_later
and send “import has began” message back to the user.
QueueRecordImportJob
downloads the file from S3 and starts iterating through it. It keeps a counter (also stored in Redis) of which row it finished. If the Sidekiq process restarts, QueueRecordImportJob
will begin anew. It will download the file from S3 again, check the counter and start processing the file from the next row.
This creates a very long running QueueRecordImportJob
which usually is not a good practice. If this job fails to complete and the process restarts Sidekiq will try to push it back into the queue (details here).
But QueueRecordImportJob
does not actually import the records. It simply calls RecordImportJob.perform_later
passing each row. This speeds the QueueRecordImportJob
and now we can have multiple Sidekiq workers processing individual records via RecordImportJob
.
To ensure that QueueRecordImportJob
job starts right away after Sidekiq restart we set it to run in a different queue with higher priority.