DaveWentzel.com            All Things Data

ETL Best Practices

This is the summary post of my ETL "Best Practices" I've learned over the years that I wanted to gather up into one place.  These are geared primarily towards SQL Server and DTS/SSIS but the principles are useful using any ETL tool/DBMS.  

Support/Testing

  • Consider using database snapshots for testing data loads.  When there is failure it's easy to rollback.  Performance will probably hinder you from implementing this technique for production ETL loads.  
  • Handle as many data issues programmatically.  Keep support costs down. 
  • Error handling.  Lots of it.  Each error must be immediately identifiable and traceable.  I like using an error number system where blocks of numbers are assigned to each package/subpackage. 
  • Installability.  Don't hard code paths, server names, passwords.  Everything should be configurable via an XML file.  Also consider building an MSI package to install it.  Also, have a clear backup plan for the ETL process, as well as a tested recovery plan. 

Development

 

  • Use stored procedures.  Yeah, you can use nifty .NET code in a SSIS package, but what happens when a dataflow can't be reused because the columns aren't identical?  We don't really have this problem with a stored proc. Further, stored procs are not compiled code so I can quickly make changes on the fly.  There are so many other reasons why stored procedures really are the way-to-go in most cases.  
  • Workflows should be simple.  You don't get a raise for a nifty, convoluted graphical workflow.  If it is simple it is more likely to be tolerant to changes in the source data and even timing.  It is likely a failed step can be restarted much more easily.  Think incremental updates. 
  • Most people design a "full load" set of SSIS packages and then "incremental load" packages.  Avoid that work by assuming every load will be an incremental load.  The full load should merely "clear down" the destination tables first.  
  • Using TSQL's MERGE Statement eliminates the need for the Lookup Transformation SSIS component.  

 

SSIS Package Structure

  • Have one master package that calls only subpackages, one each for each dimension and fact table.  This follows good OO principles.  
  • Each subpackage should have a clear delineation between the E, T, and L portions of the "process".  This makes adding a new dimension/fact table a simple matter of using an existing subpackage as your template.  
  • Your subpackages are all "table-oriented", so what happens when you need to run just all of the "Extracts" or all of the "Transforms"?  Do I need to visit each subpackage?  No.  Each subpackage needs to be "process-oriented", ie they need to accept variable that tells the subpackage to execute just the E, just the T, just the L, or everything.  So in the case where we just want to test the T phase we set the E and L variables to "Skip".  
  • Back to the master package...we create 3 Sequence Containers, one each for E, T, and L phases.  Now add the Execute Package Tasks for all of the subpackages in each Sequence Container.  This means each subpackage is referenced/partially-executed 3 times.  Each Sequence Container has an associated execution phase variable that it sends in to the subpackage.  

Performance

  • When using a Data Flow Task make sure the Source object is returning data as quickly as possible.  This is different than making sure your query is tuned to run as fast as possible.  The sooner some data is passed from the source to the Data Flow Task the sooner it can begin its transforms.  In most cases the buffer is 10000 rows in the data transform task and it can begin its work once it fills a buffer.  So consider adding OPTION (FAST 10000) to your query.  Also, avoid using an ORDER BY clause, or anything else (GROUP BY, TOP, etc), that would cause that hint to be ignored.  The OPTION will make the query potentially more expensive, but should lead to better overall ETL performance.  YMMV. 
  • Consider using RunInOptimizedMode.  
  • If you have a high volume of INSERTs relative to UPDATEs consider removing your indexes prior to the load and re-indexing afterwards.  
  • Use Hashes to Speed Up Incremental Data Loads
  • Know your Perfmon Counters, especially:
    • Logical disk
    • Processor
    • Bulk copy rows/sec
    • log bytes flushed/sec to measure tran log activity
    • bytes total/sec for bandwidth
  • Understand your physical environment, especially storage.  
  • Know your waits

Pre-Sorting Data, Indexing, and Partitioning

  • Consider pre-sorting data before doing any INSERTs.  The data should be sorted based on the clustered index on the table, or if you have removed your indexes prior to the load, the eventual clustered index.  Use the ORDER hint so the optimizer knows this fact.  This is minimally-logged to an empty table, or to any table if TraceFlag 610 is set.  Single streams will be faster with TABLOCK too.  All of this helps with fragmentation as well. 
  • Avoid fragmentation in your destination tables as much as possible.  You want your data pages as packed with data as possible.  Ultimately this means you will keep as much data in the buffer cache as possible.  You accomplish this by presorting the data prior to the INSERT. 
  • Use a different filegroup on a different set of spindles to reduce fragmentation even further during INSERTs. 
  • Insert your data without a clustered index...in most cases.  Then add your clustered index with MAXDOP=1 (we want sequential processing of data) with the clustered index against the partition on the destination filegroup.  Then use SWITCH which will be a metadata only operation. 
  • Use Partitioned Tables whenever possible.  

Logging and Locking

  • Know how to bulk load using various methods and their benefits/drawbacks.  There are many different tools including:  SSIS, BCP, BULK INSERT, SELECT INTO, SQLBulkCopy(ADO.NET). 
  • Know everything you can about minimally-logged operations.  Hopefully you can take advantage of them.  
  • Have a batching/"chunking" strategy.  BATCHSIZE vs ROWS_PER_BATCH.  
  • Understand Lock Escalation
  • You want to try to design your loads to be multi-threaded.  First, ensure you can load multiple tables concurrently.  For very large loads to a single table you want multiple streams.  Pre-sort the incoming data streams, ensuring no overlapping data across streams, which would otherwise cause streams to block each other.  

DTS and SSIS

Add new comment