Structuring Your ETL like a Queue

This is a follow-on to [[Performant ETL and SSIS Patterns]].  I really need to do a longer, explanatory post on this.  Two of the largest performance problems I see with SSIS packages is their lack of parallelism and the fact that they are written to run large batches during a defined time window.  The former is totally unnecessary, the latter is unecessary if you structure your processing smartly.  

There are lots of tricks to parallelize your SSIS workflows, but the easiest way to do it is to rethink your problem.  Look at your ETL processing as a series of manageable chunks that you can pick-off throughout the day.  Then you won't have one giant long running end-of-day task.  It will then be much easier to parallelize your chunks.  This is semantically the same as queue processing in your data tier or building your SSIS packages using the "Map Reduce" paradigm.

What does an ETL queue look like? 

  • The "queue" is a SQL Server table (or you can use Service Broker) that has a list of items to be processed.  This can be rows that have changed, have been inserted, ranges of new IDs to process, etc.  
  • Each package has a loop that picks off some number of elements from the queue and begins processing them 
  • A lock is held on those queue items to ensure no other instance of the package begins processing those rows.  
  • Do the necessary ETL
  • Mark the queue elements as processed and restart the loop.  

There are some things to look out for:

  • Many times parallel package execution will require certain things to run serially.  This can be handled with applocks that I've written about in the past.  
  • If you are batch-window constrained then use priority queues to ensure the longest-running, most important ETL tasks take precedence.