DaveWentzel.com            All Things Data

ETL

SSIS Subpackages and Version Control

I use subpackages in SSIS for just about everything.  But I've noticed that I'm in the minority on this.  Instead, I see these huge monolithic SSIS/DTS packages that attempt to do a lot of different things.  Many people have written much about why you should consider moving more logic from your main package(s) to subpackages.  I'll briefly rehash that here.  One reason for using subpackages I rarely see mentioned is that it really helps with version control.  

Common Reasons for Subpackages

  1. Separation of concerns.  It is based to keep your "extract" logic separate from your "transform" logic.  This is just good fundamental design.  If you would normally use a separate function or module in your code, consider using a new SSIS package. 
  2. It aids unit testing, which is already extremely difficult with SSIS.  

One Big Pitfall with Subpackages

Just like with any other modern programming language, subpackages/modules/functions have certain scoping and lifetime issues that you need to account for.  A child package has access to everything declared in the parent package.  The problem is that child packages cannot alter certain variables in parent packages.  Also, child packages cannot pass messages back to the parent other than a simple SUCCESS or FAILURE.  These issues are not very common and with some creativity you can bypass these issues by storing "state" someone else instead of passing it around.  

Version Control

The most overlooked reason for using subpackages is version control.  SSIS is horrible when it comes to version control.  If you use any of the source control plug-ins for Visual Studio you'll immediately notice that even to VIEW the properties of a task requires the .dtsx file to be checked out.  How ridiculous is that?  And even simple changes like changing a default package variable value can result in many esoteric changes occurring to the underlying .dtsx file (which is XML).  So a visual diff is rendered worthless if you wanted to do something like a "svn blame" (determine who broke what, when, using a visual compare).  

SSIS Team Development Best Practices

  1. Avoid having multiple team members work on the same dtsx packages during a given release/sprint.  Due to the visual merge/diff issues just mentioned you WILL have a difficult time merging discrete changes to different branches.  Plan your development accordingly.  In fact, don't even bother to try to merge some changes to other branches.  It won't work.  For merging purposes, treat your .dtsx files as binary objects, not as source code.  
  2. Use BIDS Helper and it's SmartDiff utility.  It will help you pinpoint the really key changes you have made from simply formatting changes and internal SSIS versioning within the XML.  I can't say enough good things about this.  
  3. Your developers should make small, discrete changes per commit and avoid mixing layout changes with functional changes.  You want to limit how much change is done per commit.  This is why subpackages are also so critical.  If you must go in and re-organization the control flow layout, just make sure you do that in a separate commit.  This makes finding logic bugs easy later.  
  4. Every package have various Version properties.  Depending on your build tool and your environment these may be autopopulated for you.  Regardless, I find it helps to populate these values whenever I make a change to a package.  If you automate this, or at least try to enforce this rule on your team, you have a nice version control mechanism embedded into the file.  I know that embedding versioning info in source code is generally frowned upon...we can get that information from the VCS repo...I think it makes sense with dtsx XML files that have inherent versioning problems like I've mentioned thus far in this blog post.  
  5. It's also interesting to note that if you use the project-based deployment mechanism (.ispac) using Integration Services Catalogs you get some additional versioning information that can be helpful when troubleshooting.  You can see this by right clicking your Project under Integration Services Catalogs and selecting "Versions...".  In the screenshot I can see that Version: 2.6.1.13 was deployed at '4/1/2014 at 10:38AM'.  

You have just read "SSIS Subpackages and Version Control" on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

Getting SSIS Packages to Run Faster in Visual Studio

I thought this was commonly-known but I find that quite a few SSIS developers do not know this.  SSIS packages are known to be horrendously slow when running in debug mode (F5) in Visual Studio.  It probably has a lot to do with all of the UI painting.  If you really don't need to see debugging messages (perhaps you have a good logging framework you can rely on instead) and are really just executing your packages for a smoke test or whatever, there is a faster way.  CTRL+F5.  (Start without Debugging from the Debug menu).  The package will execute in a cmd window using dtexec.  Radically better runtimes (YMMV but mine is an order of magnitude faster) and also far less system resource consumption (memory utilization especially).  

This is a huge timesaver for me.  

CHANGE_TRACKING_CURRENT_VERSION() not always reliable with Log Shipping

There may be a "bug" with CHANGE_TRACKING_CURRENT_VERSION() when using Change Tracking and log shipping.  The returned value will not be accurate if ALLOW_SNAPSHOT_ISOLATION is set to OFF.  Change Tracking is one of a million ways to query for data changes and then do interesting things with those changes...such as ETL.  The best article I've ever read on Change Tracking (CT) is by the folks at Solidq.  Here are parts 1, 2, and 3.  

With the ETL use case the process is to take the CT value from the last ETL run and use that as the basis for the new changes you want to see using CHANGETABLE (CHANGES blah, @last_synchronization_version).  The process Microsoft outlines specifically says that, "...we recommend that you use snapshot isolation. This will help to ensure consistency of change information and avoid race conditions..."  

That's the problem...not EVERY db in the world can easily be converted to snapshot isolation without a lot of code rewrite and regression testing.  

It's also common for your ETL to run on a "copy" of your OLTP prod system.  For instance, I've seen lots of customers run their ETL processes on readable secondaries, transactional replication subscribers, or log shipping destinations.  In these situations you want to be able to run your CT queries against your "replicas" to reduce load on your primary server.  Ah, but that may not work if you are not using ALLOW_SNAPSHOT_ISOLATION ON.  

I don't have the time to fully repro this for a blog post or a Microsoft Connect bug, but here are some screenshots of the issue and a possible solution.  

First, it's important to understand CHANGE_TRACKING_CURRENT_VERSION().  By definition it should always give you the same answer as SELECT MAX(commit_ts) from sys.dm_tran_commit_table, regardless of isolation levels.  Here is a screenshot of my prod system that has many, active databases ... some with snapshot isolation ON, some with it OFF.  Note that the values are the same, regardless of whether snapshot isolation is ON or OFF.  There are cases where the values will be off, slightly, because transactions may be committing between the when the first and second queries run.  Note that the MAX() query can take a bit of time to run on a busy system, but I've never seen it cause blocking.  

So on a "writeable" db, regardless of isolation level, CHANGE_TRACKING_CURRENT_VERSION() works great.  Here is the result from running the same queries on my log shipped replica.  We use this replica as the basis for ETL daily loads.  We only restore log backups around 3 AM daily, so during the day we can assume that the replica is out of sync because we have the log restores turned OFF.  

Note the wild discrepancy between CTCV() and the MAX query.  Knowing what you know about my setup, clearly the MAX query is right but CTCV() is not in cases where ALLOW_SNAPSHOT_ISOLATION is OFF.  

Why?  

When I first saw this symptom I had no idea what could possibly cause this.  We have about 50 dbs on these servers and each is configured slightly differently regarding snapshot isolation settings, ansi settings, Service Broker enabled, etc.  After an hour or so I realized the pattern was that the erroneous data only happened on our older "legacy" dbs where we can't/didn't ALLOW snapshot isolation.  

I doublechecked the MS documentation for Change_Tracking_Current_Version() to see if I missed anything in the documentation that would account for this anamoly.  Otherwise this would be a bug.  I found nothing. 

But I was interested in why this would happen.  We all know that we can't view the source code for system functions to see if we can spot the bug, but perhaps instead we could see what sys.dm_tran_commit_table is "looking at" under the covers.  Sometimes you can do sp_helptext on DMVs...and this is one of them.  Here's what you'll see:  

--sp_helptext 'sys.dm_tran_commit_table'

       SELECT commit_ts, xdes_id, commit_lbn, commit_csn, commit_time

                     FROM sys.syscommittab (READCOMMITTEDLOCK)

       UNION

       SELECT commit_ts, xdes_id, commit_lbn, commit_csn, commit_time

                     FROM OpenRowset (table SYSCOMMITTABLE, db_id(), 0, 0)

That's an interesting little query, isn't it?  Frankly, I found myself doing all kinds of googling on what OpenRowset (table) does under the covers.  I definitely got a little shiny object syndrome looking at that code.  But that's another blog post.  Trust me, you can't run that query from ssms directly unfortunately.  

My thinking is that it is POSSIBLE that CTCV() is doing something like this under the covers and is getting confused and hence is returning the wrong information.  Again, I have no way of knowing without reviewing Microsoft source code...and that ain't happenin'.  

My suggestion is that if you are:

  • using Change Tracking
  • with Log Shipping
  • and are not using at least ALLOW_SNAPSHOT_ISOLATION

...then you should not rely on Change_Tracking_Current_Version on your log shipped copy to accurately determine the last committed row change that exists on your subscriber.  MAX(commit_ts) from sys.dm_tran_commit_table

I hope this helps someone in the future.  

Obnoxious SSIS Designer Layout Issue

I don't want to bloviate too much, but there are some annoyances with SSIS, most of them well documented on the blogosphere.  My biggest annoyance is that the mouse is required for EVERYTHING.  Want to do a task conditionally?  That would be an IF branch in any programming language, but because SSIS is so visual it requires lots of mouse clicks.  

Structuring Your ETL like a Queue

This is a follow-on to Performant ETL and SSIS Patterns.  I really need to do a longer, explanatory post on this.  Two of the largest performance problems I see with SSIS packages is their lack of parallelism and the fact that they are written to run large batches during a defined time window.  The former is totally unnecessary, the latter is unecessary if you structure your processing smartly.  

Performant ETL and SSIS Patterns

It's a helluva job market out there right now if you have ETL, SSIS, DataStage, or equivalent experience.  I guess you can make some generalized deductions about this:

  • more companies are trying to integrate their data stores.
  • more companies need to copy data from OLTP to OLAP systems.  
  • It's hard to find good ETL people.  

Unfortunately, too many job postings ask candidates to have specific ETL tooling experience such as SSIS or DataStage.  This is unfortunate.  Too many candidates have great tooling experience but have very little grounding in ETL best practices, regardless of chosen tool.  I've been called in a lot lately to help fix various ETL processes.  Each one is using a different ETL tool and each one is exhibiting the same terrible ETL anti-patterns.  When I fix those anti-patterns everything just magically runs better.  I have yet to touch actual ETL code.  

To quickly summarize the most egregious issue...developers are doing too much work in the ETL tool and not enough work in their RDBMS.  The RDBMS will almost always do things faster than the ETL tool can.  There are few exceptions to this rule (string manipulation and regexp is better in most ETL tools than in SQL for instance).   

I've written tons of blog posts (here's a link to an entire series of ETL Best Practices) about how to do performant ETL with good patterns.  However, I find myself constantly searching my blog to find a succinct list of things to check whenever I'm brought into another ETL engagement.  Here's the biggies:

  • ETL Best Practices.  I'm not going to list all of them.  That's a long blog post.  You should go reference that.  
  • Do more in SQL and less in the ETL tool.  Examples:
    • SSIS is not available in SQL Express 
    • Sorts are better handled in the RDBMS.  Use an ORDER BY clause on your SQL statement instead of relying on your ETL tool to sort.  If using SSIS, mark your OLEDB source metadata on the data source as sorted.  
    • Any set-based, relational operation will be faster in SQL than in your ETL tool.  Your RDBMS will likely automatically determine the best parallelism and memory management to use.  On the contrary, you'll never get this right in your ETL tool.  
    • Big packages are not good.  If your SSIS package is swapping to disk then you are not running efficiently.  There is a SSIS performance counter called "Buffers Spooled".  It should always stay at 0.  Always.  Otherwise you are using your swap file.  
    • If you run Integration Services on your SQL Server then use the "SQL Server Destination" vs the "OLEDB destination".  Performance is markedly better.  
  • Understand and optimize your data structures.  Examples:
    • Understand locking, Lock Escalation, and transaction isolation semantics for your given RDBMS.  If using SSIS then understand the difference between BATCHSIZE vs ROWS_PER_BATCH and ensure they are optimized for your destination RDBMS.  
    • Partitioned Tables.  Understand them and try to use them if your destination is SQL Server.  The SWITCH statement is your friend, maybe your best friend.  
    • Make sure you are always doing Minimally Logged Operations whenever possible.  And always verify with testing and performance monitoring.  Every RDBMS has different rules for what equates to a minimally logged operation.
    • Determine whether it is faster to disable/drop your indexes before your data loads.  There is lots of conflicting guidance and there is no substitute for testing.  I've found that every engagement a different setting is needed.   Here is some additional guidance you may not find elsewhere:
      • A commit size of 0 is fastest on heaps
      • if you can't use 0 because you have concurrency concerns then use the highest value you can to reduce the overhead of multiple batches and transaction control.  
      • A commit size of 0 on a clustered index is a bad idea because all incoming rows must be sorted.  This will likely cause spooling to tempdb which should be avoided.  
  • There are network-related settings that are invaluable for performance:
    • jumbo frames will increase your packet payload from 1500 bytes/frame to 9000 bytes/frame.  
    • Change sqlConnection.PacketSize.  The default is 4096 bytes.  That's small for moving around large amounts of data.  32767 would be better.  
    • use network affinity at the OS level.  
  • When using the OLEDB source, always set the Data Access Mode to "SQL Command".  Never "table or view".  Performance is almost always better.  The reason is that sp_prepare is called under the covers using the latter and therefore has a better shot at getting a good execution plan.  The former does the equivalent of a SET ROWCOUNT 1 to get its column metadata.  That can lead to bad execution plans.  
  • You should seriously consider restructuring your ETL processes so they work like a queue.  I'll cover this in the next post, Structuring Your ETL like a Queue.  

ETL Best Practices

This is the summary post of my ETL "Best Practices" I've learned over the years that I wanted to gather up into one place.  These are geared primarily towards SQL Server and DTS/SSIS but the principles are useful using any ETL tool/DBMS.  

Support/Testing

  • Consider using database snapshots for testing data loads.  When there is failure it's easy to rollback.  Performance will probably hinder you from implementing this technique for production ETL loads.  
  • Handle as many data issues programmatically.  Keep support costs down. 
  • Error handling.  Lots of it.  Each error must be immediately identifiable and traceable.  I like using an error number system where blocks of numbers are assigned to each package/subpackage. 
  • Installability.  Don't hard code paths, server names, passwords.  Everything should be configurable via an XML file.  Also consider building an MSI package to install it.  Also, have a clear backup plan for the ETL process, as well as a tested recovery plan. 

Development

 

  • Use stored procedures.  Yeah, you can use nifty .NET code in a SSIS package, but what happens when a dataflow can't be reused because the columns aren't identical?  We don't really have this problem with a stored proc. Further, stored procs are not compiled code so I can quickly make changes on the fly.  There are so many other reasons why stored procedures really are the way-to-go in most cases.  
  • Workflows should be simple.  You don't get a raise for a nifty, convoluted graphical workflow.  If it is simple it is more likely to be tolerant to changes in the source data and even timing.  It is likely a failed step can be restarted much more easily.  Think incremental updates. 
  • Most people design a "full load" set of SSIS packages and then "incremental load" packages.  Avoid that work by assuming every load will be an incremental load.  The full load should merely "clear down" the destination tables first.  
  • Using TSQL's MERGE Statement eliminates the need for the Lookup Transformation SSIS component.  

 

SSIS Package Structure

  • Have one master package that calls only subpackages, one each for each dimension and fact table.  This follows good OO principles.  
  • Each subpackage should have a clear delineation between the E, T, and L portions of the "process".  This makes adding a new dimension/fact table a simple matter of using an existing subpackage as your template.  
  • Your subpackages are all "table-oriented", so what happens when you need to run just all of the "Extracts" or all of the "Transforms"?  Do I need to visit each subpackage?  No.  Each subpackage needs to be "process-oriented", ie they need to accept variable that tells the subpackage to execute just the E, just the T, just the L, or everything.  So in the case where we just want to test the T phase we set the E and L variables to "Skip".  
  • Back to the master package...we create 3 Sequence Containers, one each for E, T, and L phases.  Now add the Execute Package Tasks for all of the subpackages in each Sequence Container.  This means each subpackage is referenced/partially-executed 3 times.  Each Sequence Container has an associated execution phase variable that it sends in to the subpackage.  

Performance

  • When using a Data Flow Task make sure the Source object is returning data as quickly as possible.  This is different than making sure your query is tuned to run as fast as possible.  The sooner some data is passed from the source to the Data Flow Task the sooner it can begin its transforms.  In most cases the buffer is 10000 rows in the data transform task and it can begin its work once it fills a buffer.  So consider adding OPTION (FAST 10000) to your query.  Also, avoid using an ORDER BY clause, or anything else (GROUP BY, TOP, etc), that would cause that hint to be ignored.  The OPTION will make the query potentially more expensive, but should lead to better overall ETL performance.  YMMV. 
  • Consider using RunInOptimizedMode.  
  • If you have a high volume of INSERTs relative to UPDATEs consider removing your indexes prior to the load and re-indexing afterwards.  
  • Use Hashes to Speed Up Incremental Data Loads
  • Know your Perfmon Counters, especially:
    • Logical disk
    • Processor
    • Bulk copy rows/sec
    • log bytes flushed/sec to measure tran log activity
    • bytes total/sec for bandwidth
  • Understand your physical environment, especially storage.  
  • Know your waits

Pre-Sorting Data, Indexing, and Partitioning

  • Consider pre-sorting data before doing any INSERTs.  The data should be sorted based on the clustered index on the table, or if you have removed your indexes prior to the load, the eventual clustered index.  Use the ORDER hint so the optimizer knows this fact.  This is minimally-logged to an empty table, or to any table if TraceFlag 610 is set.  Single streams will be faster with TABLOCK too.  All of this helps with fragmentation as well. 
  • Avoid fragmentation in your destination tables as much as possible.  You want your data pages as packed with data as possible.  Ultimately this means you will keep as much data in the buffer cache as possible.  You accomplish this by presorting the data prior to the INSERT. 
  • Use a different filegroup on a different set of spindles to reduce fragmentation even further during INSERTs. 
  • Insert your data without a clustered index...in most cases.  Then add your clustered index with MAXDOP=1 (we want sequential processing of data) with the clustered index against the partition on the destination filegroup.  Then use SWITCH which will be a metadata only operation. 
  • Use nodetitle whenever possible.  

Logging and Locking

  • Know how to bulk load using various methods and their benefits/drawbacks.  There are many different tools including:  SSIS, BCP, BULK INSERT, SELECT INTO, SQLBulkCopy(ADO.NET). 
  • Know everything you can about minimally-logged operations.  Hopefully you can take advantage of them.  
  • Have a batching/"chunking" strategy.  nodetitle.  
  • Understand Lock Escalation
  • You want to try to design your loads to be multi-threaded.  First, ensure you can load multiple tables concurrently.  For very large loads to a single table you want multiple streams.  Pre-sort the incoming data streams, ensuring no overlapping data across streams, which would otherwise cause streams to block each other.  

DTS and SSIS

Lock Escalation

Understanding lock escalation can help you design more performant ETL data loads (or really design any SQL data access in a more performant manner).  I mentioned briefly here some information on lock escalation.  Here I want to cover some additional details.  

Partitioned Tables

When you are designing your ETL process try to use partitioned tables whenever possible.  Here is a little primer I wrote on Partitioning a long time ago.   A fact table can be loaded offline then added to the partition scheme in a matter of milliseconds with no performance penalty for users executing queries.   This involves SWITCH (requires one side of the swap to be empty), SPLIT, and MERGE.  

A switched out partition is just a regular table.  Ensure you first add the necessary constraint to match the target partition immediately after the switch out. SWITCH is metadata only so your old partition can still be active while your ETL process is running without fear of locking issues. This increases concurrency dramatically, which may someday be a requirement for you, even if it isn't today. 

You have to design everything correctly so that data does not move from one partition to another.  SWITCH won't cause data movement, you are really just updating the metadata.  SPLIT can cause data movement if new, empty partitions are not the result of the SPLIT.  Same with MERGE, use it with at least one empty partition.  Always use the same filegroup for the source and target partitions.  Again, we don't want data to move from another file.  A SWITCH does require a SCH-M lock on both the source and target, meaning that you can have blocking issues if you do this on a busy transactional table.  But again, this should occur quickly.  

After a SWITCH run UPDATE STATS since stats are calculated at the table level. 

REBUILD INDEX can be done while the parition is switched out. If you need to load data to multiple partitions, switch out each partition to its own table, making concurrent load operations much easier. 

One drawback...it seems like a sort always happens regardless of whether you presort. This could be because I'm using SQL 2005, or could be because I'm doing something incorrectly.  I only determined this because I always check my load performance.  

BATCHSIZE vs ROWS_PER_BATCH

There are two interesting SSIS settings that can really help with performance, BATCHSIZE and ROWS_PER_BATCH.  

Pages

Subscribe to RSS - ETL