DaveWentzel.com            All Things Data

August 2011

Free Space on Mount Points

It seems overly difficult to write a quick script to check the free space on a mount point.  It turns out there is a simple way to do this using PowerShell.  

get-wmiobject win32_volume -filter "FileSystem='NTFS'" | select name, capacity, freespace

Sequential or Random Reads?

Hopefully you already know the difference between a sequential and random read.  Sequential reads are better, especially on rotational media (SSDs seem to be less sensitive to the issue in my limited experience).  As DBAs we are taught to be careful and avoid random read access whenever possible.  A simple example is that we all learn to separate data files from log files.

Do You Explicitly Drop Your Temp Tables?

Do you explicitly drop your #TempTables after you are through with them in your stored procedures?  Whenever I'm asked to performance tune a stored procedure and I see an explicit DROP #TempTable at the bottom of the code, I remove it.  Another developer asked me today why I do this.  I just assumed this would be common sense, but I guess not.  

4GL coders were taught to close object references at the end of their code block.  This isn't taught as much these days since most languages have "garbage collection" that handles this for you, generally.  This same rule holds true for stored procedures.  I could provide you with some test scripts using the standard FOO and BAR tables, but just thinking through the logic should be enough to convince you that a specific DROP is not needed.  

Performing a DROP, no matter how small the temp table, still takes some amount of time.  While the DROP is executing the procedure still may be holding locks (depending on your isolation levels, etc) that it may not need to.  Instead, let the proc complete and allow SQL Server to naturally clean up the temporary objects for you.  Remember that #TempTables are scoped to the code block that created the object, as well as any children, and when the procedure execution completes they are destroyed naturally since they are then out of scope.  

One More Reason 

Here's one more reason...many data access utilities and ORMs require the temp tables to NOT be DROP'd at the end of the procedure.  Most people don't know this.  A very popular ORM (which will remain nameless) cannot generate a "model" against a stored procedure that SELECTs data out of a #TempTable.  The ORM "model" is basically a way to map the underlying OO model to your stored procedure inputs and outputs.  The "model generator" reads metadata about the stored procedure to determine what its output result set looks like (column names, order, and data types).  Suffice it to say that natively SQL Server does not provide the necessary metadata to satisfy the ORM model generator, so it barfs.  The vendor even states that #TempTables are not supported, use @TableVariables instead.  Nevermind that this could have drastic performance ramifications.  Please don't think this is just bad ORM vendor code...SSRS has similar problems and the advertised workaround is to SET FMTONLY OFF within the procedure.  This trick also works with the ORM model generator, but only if you do NOT explicitly DROP your #TempTables.  

Again, granted, the performance improvement likely won't be much, but it is still a Best Practice to not explicitly DROP your #TempTables.  In my opinion anyway.  

Temporary Objects

Performance Implications of UDFs

I have a brief article I wrote on UDFs if you are unfamiliar with them in SQL Server.  I've never liked UDFs, especially the scalar-valued variety.   

I've always just blindly told people, "Don't encapsulate reuseable business logic in a scalar UDF, at a minimum use an inline TVF."  I've never bothered to put together a blog post or repro script to prove this to people.  Usually I show performance metrics before and after converting a scalar UDF and they are satisfied.  Today I stumbled across an excellent blog post that illustrates the performance problem very succinctly with scalar UDFs.  

If you are not familiar with the performance problems with UDFs, then go read that article now, especially if you are one of those TSQL developers that loves to encapsulate code in UDFs.  

After reading that post you are probably thinking, "where is the performance problem?"  The post did not specifically get into that, but let's use some common sense.  The post demonstrates that a new date value (or a decrypted key in the second example) was calculated for every row in dbo.TableWithManyRows.  That's a lot of work.  And that's exactly what happens when you encapsulate logic in a UDF.  It's not set-based, it's RBAR (Row By Agonizing Row).  

August 15...Happy 40th Anniversary Fiat Currency

This year has been marked by the constant talk of the US "defaulting" on its debt obligations by not allowing the debt ceiling to rise causing the U.S. to not pay some of its obligations including interest payments on bonds.  It's interesting to note that this has occurred in the past, albeit in a different form.  This is the 40th anniversary of the end of the Bretton Woods System, which severed totally any convertibility of the US Dollar into gold.  

Set up after WWII, the Bretton Woods System allowed other countries, but not individuals, to convert their dollars into gold upon demand.  This made the dollar the "world reserve currency" that we all hear about today.  In the early 1970's the US had a huge trade deficit and needed to pay its bills (a huge war in Vietnam) so it decided to print more dollars than it had gold.  Other countries saw this and immediately also tried to devalue their currencies to maintain convertibility equilibrium and protect their exports.  But inflation was getting out-of-hand and soon West Germany pulled out of Bretton Woods entirely and other countries such as France demanded huge amounts of gold from the U.S.  On August 15, 1971 Nixon made the dollar a true, undebatable fiat currency, it had no backing by gold, not even for other countries.  Essentially this means the US defaulted on its debt obligations to other countries, one US dollar was worth, well, one US dollar.  Nixon's famous quote was, "we are all Keynesians now."  

What followed was stagflation (out-of-control inflation coupled with high unemployment...an impossibility according to Keynesians) and the slow march to worthlessness of the US Dollar.  If you want to see how easily and stealthily the government can make your savings worthless and ruin an economy, just watch this video from August 15, 1971.  

At the time an ounce of gold cost $35/oz.  Many pundits predicted gold would fall to a few dollars per ounce since the dollar wasn't propping up gold.  Just last week gold hit an intraday high of about $1800/oz.  Mathematically, that is a 15% annualized return.  However, I don't look at the price of gold as rising 15% per year, I look at it as the value of the dollar has fallen by 15% per year.  Essentially gold isn't really a good investment...but it is a good store of value, something the dollar is not.  

Happy Anniversary!

 

SAN and Storage AntiPatterns

I've been writing a lot of blog posts over the years about various AntiPatterns in data architecture.  A "pattern" is simply a solution to a common problem.  Antipatterns are the problems associated with common solutions.  I see the same common mistakes being repeated using the same faulty logic and then touting these solutions as "best practices".  I think the most egregious antipatterns in IT are in the storage realm these days.  These storage anti-patterns are applicable mostly to systems where a performant IO subsystem is critical.

ETL Best Practices

This is the summary post of my ETL "Best Practices" I've learned over the years that I wanted to gather up into one place.  These are geared primarily towards SQL Server and DTS/SSIS but the principles are useful using any ETL tool/DBMS.  

Support/Testing

  • Consider using database snapshots for testing data loads.  When there is failure it's easy to rollback.  Performance will probably hinder you from implementing this technique for production ETL loads.  
  • Handle as many data issues programmatically.  Keep support costs down. 
  • Error handling.  Lots of it.  Each error must be immediately identifiable and traceable.  I like using an error number system where blocks of numbers are assigned to each package/subpackage. 
  • Installability.  Don't hard code paths, server names, passwords.  Everything should be configurable via an XML file.  Also consider building an MSI package to install it.  Also, have a clear backup plan for the ETL process, as well as a tested recovery plan. 

Development

 

  • Use stored procedures.  Yeah, you can use nifty .NET code in a SSIS package, but what happens when a dataflow can't be reused because the columns aren't identical?  We don't really have this problem with a stored proc. Further, stored procs are not compiled code so I can quickly make changes on the fly.  There are so many other reasons why stored procedures really are the way-to-go in most cases.  
  • Workflows should be simple.  You don't get a raise for a nifty, convoluted graphical workflow.  If it is simple it is more likely to be tolerant to changes in the source data and even timing.  It is likely a failed step can be restarted much more easily.  Think incremental updates. 
  • Most people design a "full load" set of SSIS packages and then "incremental load" packages.  Avoid that work by assuming every load will be an incremental load.  The full load should merely "clear down" the destination tables first.  
  • Using TSQL's MERGE Statement eliminates the need for the Lookup Transformation SSIS component.  

 

SSIS Package Structure

  • Have one master package that calls only subpackages, one each for each dimension and fact table.  This follows good OO principles.  
  • Each subpackage should have a clear delineation between the E, T, and L portions of the "process".  This makes adding a new dimension/fact table a simple matter of using an existing subpackage as your template.  
  • Your subpackages are all "table-oriented", so what happens when you need to run just all of the "Extracts" or all of the "Transforms"?  Do I need to visit each subpackage?  No.  Each subpackage needs to be "process-oriented", ie they need to accept variable that tells the subpackage to execute just the E, just the T, just the L, or everything.  So in the case where we just want to test the T phase we set the E and L variables to "Skip".  
  • Back to the master package...we create 3 Sequence Containers, one each for E, T, and L phases.  Now add the Execute Package Tasks for all of the subpackages in each Sequence Container.  This means each subpackage is referenced/partially-executed 3 times.  Each Sequence Container has an associated execution phase variable that it sends in to the subpackage.  

Performance

  • When using a Data Flow Task make sure the Source object is returning data as quickly as possible.  This is different than making sure your query is tuned to run as fast as possible.  The sooner some data is passed from the source to the Data Flow Task the sooner it can begin its transforms.  In most cases the buffer is 10000 rows in the data transform task and it can begin its work once it fills a buffer.  So consider adding OPTION (FAST 10000) to your query.  Also, avoid using an ORDER BY clause, or anything else (GROUP BY, TOP, etc), that would cause that hint to be ignored.  The OPTION will make the query potentially more expensive, but should lead to better overall ETL performance.  YMMV. 
  • Consider using RunInOptimizedMode.  
  • If you have a high volume of INSERTs relative to UPDATEs consider removing your indexes prior to the load and re-indexing afterwards.  
  • Use Hashes to Speed Up Incremental Data Loads
  • Know your Perfmon Counters, especially:
    • Logical disk
    • Processor
    • Bulk copy rows/sec
    • log bytes flushed/sec to measure tran log activity
    • bytes total/sec for bandwidth
  • Understand your physical environment, especially storage.  
  • Know your waits

Pre-Sorting Data, Indexing, and Partitioning

  • Consider pre-sorting data before doing any INSERTs.  The data should be sorted based on the clustered index on the table, or if you have removed your indexes prior to the load, the eventual clustered index.  Use the ORDER hint so the optimizer knows this fact.  This is minimally-logged to an empty table, or to any table if TraceFlag 610 is set.  Single streams will be faster with TABLOCK too.  All of this helps with fragmentation as well. 
  • Avoid fragmentation in your destination tables as much as possible.  You want your data pages as packed with data as possible.  Ultimately this means you will keep as much data in the buffer cache as possible.  You accomplish this by presorting the data prior to the INSERT. 
  • Use a different filegroup on a different set of spindles to reduce fragmentation even further during INSERTs. 
  • Insert your data without a clustered index...in most cases.  Then add your clustered index with MAXDOP=1 (we want sequential processing of data) with the clustered index against the partition on the destination filegroup.  Then use SWITCH which will be a metadata only operation. 
  • Use Partitioned Tables whenever possible.  

Logging and Locking

  • Know how to bulk load using various methods and their benefits/drawbacks.  There are many different tools including:  SSIS, BCP, BULK INSERT, SELECT INTO, SQLBulkCopy(ADO.NET). 
  • Know everything you can about minimally-logged operations.  Hopefully you can take advantage of them.  
  • Have a batching/"chunking" strategy.  BATCHSIZE vs ROWS_PER_BATCH.  
  • Understand Lock Escalation
  • You want to try to design your loads to be multi-threaded.  First, ensure you can load multiple tables concurrently.  For very large loads to a single table you want multiple streams.  Pre-sort the incoming data streams, ensuring no overlapping data across streams, which would otherwise cause streams to block each other.  

DTS and SSIS

Lock Escalation

Understanding lock escalation can help you design more performant ETL data loads (or really design any SQL data access in a more performant manner).  I mentioned briefly here some information on lock escalation.  Here I want to cover some additional details.  

Partitioned Tables

When you are designing your ETL process try to use partitioned tables whenever possible.  Here is a little primer I wrote on Partitioning a long time ago.   A fact table can be loaded offline then added to the partition scheme in a matter of milliseconds with no performance penalty for users executing queries.   This involves SWITCH (requires one side of the swap to be empty), SPLIT, and MERGE.  

A switched out partition is just a regular table.  Ensure you first add the necessary constraint to match the target partition immediately after the switch out. SWITCH is metadata only so your old partition can still be active while your ETL process is running without fear of locking issues. This increases concurrency dramatically, which may someday be a requirement for you, even if it isn't today. 

You have to design everything correctly so that data does not move from one partition to another.  SWITCH won't cause data movement, you are really just updating the metadata.  SPLIT can cause data movement if new, empty partitions are not the result of the SPLIT.  Same with MERGE, use it with at least one empty partition.  Always use the same filegroup for the source and target partitions.  Again, we don't want data to move from another file.  A SWITCH does require a SCH-M lock on both the source and target, meaning that you can have blocking issues if you do this on a busy transactional table.  But again, this should occur quickly.  

After a SWITCH run UPDATE STATS since stats are calculated at the table level. 

REBUILD INDEX can be done while the parition is switched out. If you need to load data to multiple partitions, switch out each partition to its own table, making concurrent load operations much easier. 

One drawback...it seems like a sort always happens regardless of whether you presort. This could be because I'm using SQL 2005, or could be because I'm doing something incorrectly.  I only determined this because I always check my load performance.  

BATCHSIZE vs ROWS_PER_BATCH

There are two interesting SSIS settings that can really help with performance, BATCHSIZE and ROWS_PER_BATCH.  

Pages