DaveWentzel.com            All Things Data

Data Architecture

Performant ETL and SSIS Patterns

It's a helluva job market out there right now if you have ETL, SSIS, DataStage, or equivalent experience.  I guess you can make some generalized deductions about this:

  • more companies are trying to integrate their data stores.
  • more companies need to copy data from OLTP to OLAP systems.  
  • It's hard to find good ETL people.  

Unfortunately, too many job postings ask candidates to have specific ETL tooling experience such as SSIS or DataStage.  This is unfortunate.  Too many candidates have great tooling experience but have very little grounding in ETL best practices, regardless of chosen tool.  I've been called in a lot lately to help fix various ETL processes.  Each one is using a different ETL tool and each one is exhibiting the same terrible ETL anti-patterns.  When I fix those anti-patterns everything just magically runs better.  I have yet to touch actual ETL code.  

To quickly summarize the most egregious issue...developers are doing too much work in the ETL tool and not enough work in their RDBMS.  The RDBMS will almost always do things faster than the ETL tool can.  There are few exceptions to this rule (string manipulation and regexp is better in most ETL tools than in SQL for instance).   

I've written tons of blog posts (here's a link to an entire series of ETL Best Practices) about how to do performant ETL with good patterns.  However, I find myself constantly searching my blog to find a succinct list of things to check whenever I'm brought into another ETL engagement.  Here's the biggies:

  • ETL Best Practices.  I'm not going to list all of them.  That's a long blog post.  You should go reference that.  
  • Do more in SQL and less in the ETL tool.  Examples:
    • SSIS is not available in SQL Express 
    • Sorts are better handled in the RDBMS.  Use an ORDER BY clause on your SQL statement instead of relying on your ETL tool to sort.  If using SSIS, mark your OLEDB source metadata on the data source as sorted.  
    • Any set-based, relational operation will be faster in SQL than in your ETL tool.  Your RDBMS will likely automatically determine the best parallelism and memory management to use.  On the contrary, you'll never get this right in your ETL tool.  
    • Big packages are not good.  If your SSIS package is swapping to disk then you are not running efficiently.  There is a SSIS performance counter called "Buffers Spooled".  It should always stay at 0.  Always.  Otherwise you are using your swap file.  
    • If you run Integration Services on your SQL Server then use the "SQL Server Destination" vs the "OLEDB destination".  Performance is markedly better.  
  • Understand and optimize your data structures.  Examples:
    • Understand locking, Lock Escalation, and transaction isolation semantics for your given RDBMS.  If using SSIS then understand the difference between BATCHSIZE vs ROWS_PER_BATCH and ensure they are optimized for your destination RDBMS.  
    • Partitioned Tables.  Understand them and try to use them if your destination is SQL Server.  The SWITCH statement is your friend, maybe your best friend.  
    • Make sure you are always doing Minimally Logged Operations whenever possible.  And always verify with testing and performance monitoring.  Every RDBMS has different rules for what equates to a minimally logged operation.
    • Determine whether it is faster to disable/drop your indexes before your data loads.  There is lots of conflicting guidance and there is no substitute for testing.  I've found that every engagement a different setting is needed.   Here is some additional guidance you may not find elsewhere:
      • A commit size of 0 is fastest on heaps
      • if you can't use 0 because you have concurrency concerns then use the highest value you can to reduce the overhead of multiple batches and transaction control.  
      • A commit size of 0 on a clustered index is a bad idea because all incoming rows must be sorted.  This will likely cause spooling to tempdb which should be avoided.  
  • There are network-related settings that are invaluable for performance:
    • jumbo frames will increase your packet payload from 1500 bytes/frame to 9000 bytes/frame.  
    • Change sqlConnection.PacketSize.  The default is 4096 bytes.  That's small for moving around large amounts of data.  32767 would be better.  
    • use network affinity at the OS level.  
  • When using the OLEDB source, always set the Data Access Mode to "SQL Command".  Never "table or view".  Performance is almost always better.  The reason is that sp_prepare is called under the covers using the latter and therefore has a better shot at getting a good execution plan.  The former does the equivalent of a SET ROWCOUNT 1 to get its column metadata.  That can lead to bad execution plans.  
  • You should seriously consider restructuring your ETL processes so they work like a queue.  I'll cover this in the next post, Structuring Your ETL like a Queue.  

Are some new SQL Server features a response to the NoSQL movement?

Are some new SQL Server features a response to the NoSQL movement?


The last few releases of SQL Server, in my mind, were focused heavily on business intelligence.  And I believe the best SQL Server jobs in the last few years are focused on BI and data warehousing.  OLTP does not seem to be where the new features and money are these days.  Here is a quick list of relatively new SQL Server features targeted to "bigger data":

  • Partitioning.  Yes, you can use it for OLTP, but specific features like SWITCH are meant specifically for data warehouses.   
  • Sparse Columns.  Great for denormalized data structures.  
  • compression.  Again, great for OLTP, but seems to be more tailored for OLAP applications.  
  • Parallel Data Warehouse
  • And there are just way too many obvious BI and SSIS enhancements that I don't even want to list them.  PowerPivot for instance.  

But there is a shift.  I see many more new features centered around "BigData" and as a response to the NoSQL movement (if it is a movement, is it of type religious...or bowel?).  Here are some examples:

  • StreamInsight:  allows you to analyze streams of incoming data in real-time.  
  • HDInsight:  The HD is for Hadoop, and is available only on Azure.  
  • Columnstore indexes:  This is where you data tuples are organized by column vs by row.  This aids in aggregations of bigger data sets that tend to be read-mostly.  First MS gave us this feature, how long will it be befoe these are updated (it looks like this may be SQL Server 2014)?  Or like SAP HANA...will magically migrated from row to column oriented, and back, as the engine optimizes your data for you.  
  • Hekaton.  In-Memory tables.  Microsoft removed the DBCC PINTABLE feature a few releases ago, but it was a good idea at the time, and it still is.
  • QL Server 2014 will introduce "delayed transaction durability" to support "extreme write workloads."  http://msdn.microsoft.com/en-us/library/dn449490(v=sql.120).aspx

    The latest CTP for SQL Server 2014 has a "delayed transaction durability feature".  According to the article this is meant to assist with "extreme write workloads".  It looks like this is only for Hekaton, for now.  This feature is clearly meant to help SQL Server compete with NoSQL products with "eventual consistency" features.    

  • It looks like SQL Server 2014 will support the in-memory buffer pool being extended to SSDs.  This is a great alternative to a RAMdrive.  I hope this feature makes it to RTM.  
  • Hybrid cloud features.  One of the touted benefits of NoSQL is your ability to deploy it in-house and then migrate it into the cloud.  This is definitely something Microsoft has latched on to.  

Will MS Licensing Drive Customers to NoSQL?

Will MS Licensing Drive Customers to NoSQL?

Yep.  But that's just my opinion.  I have been asked to assist my client with evaluating various alternatives to SQL Server for an enterprise-class, multi-TB transactional database.  Both the client and me have very realistic expectations...we don't think we will find a good alternative, but we can always begin rolling out new features under a second DBMS.  Clearly if we stayed relational then something like PostgreSQL or MySQL would be great choices.  But perhaps now is the time to think about alternative data organization paradigms that are not relational.  Why not a network model or graph model?  The next bunch of blog posts I'm going to do are my experiences working hands-on with various NoSQL solutions.  

For this post I want to talk about Microsoft practices that I believe are pushing more customers towards SQL Server alternatives.  SQL Server 2012 now has an option for core licensing for Standard and Enterprise.  You must buy 4 cores at a minimum and additional packs are a minimum of 2 cores.  Core licensing gets a little hairy for VMs.   Each VM CPU is considered a core, and the 4 core minimum still applies.  However, if you have Ent with Software Assurance and license all cores on the host then you can run unlimited VMs with SQL Server configured however you like.  

Prior to SQL Server 2012, the processor license cost was kinda high, so the accepted strategy was to get as much performance and capacity as you could afford for each expensive processor socket license that you purchased. In other words, max out each socket.  But socket licensing is gone.  With Enterprise Edition, which is what I see in-use almost exclusively at my customers, you only have core-based licensing.  Standard and BI Edition have "Server + CAL"-based licensing as well.  But BI Edition does not have core-based licensing.  Got all that?  Yeah yeah, Microsoft eased some confusion by ditching Datacenter Edition, so I guess it's a wash.  And there's still Web Edition for hosting providers only.  Whew.  

But again, my guess is core-based licensing is what most enterprise customers are choosing.  

Frankly I don't see what the point of Standard Edition is.  In both 2008 R2 and 2012 the RAM limit is 64GB and 16 physical cores.  The RAM limit is the killer.  Right now my laptop has 32GB of RAM.  RAM is so dirt cheap that anyone running Standard with less than 64GB is, frankly, nuts.  The licensing costs for the software vs the price for the RAM is just that skewed.  Here I'm a fan of using Microsoft's rules to my advantage.  If you have, say, 128GB RAM in your server, why not run two instances of SQL Server with the same core-based licensing?  Granted, you'll need some kind of "sharding" strategy to do this, but isn't it time you had this anyway?  

So that means Enterprise Edition is going to be what, I believe, most enterprise customers will deploy.  And I gotta say, online index operations and data compression, by themselves, easily justifies the cost difference.  And then if you really need AlwaysOn availability groups, Enterprise is your only choice.  

Let's get back to the mathematics of core-based licensing.  Again, in the past the best practice was to pay the price premium for the absolute best processor available for each socket in your database server.  That may kill your budget with SQL 2012 EE.  In core-based licensing each physical socket in your server must use a minimum of four core licenses. Let's say you have some old dual-core processor hardware lying around...I really wouldn't run EE on that since you'll need to pay MS for four core licenses for each socket.  So if your intention is to recycle old hardware and maybe just upgrade it from SQL 2008 to 2012, you might want to rethink that.  Granted, this kind of hardware is probably 5 years old anyway and should be retired anyway.  

So, what are some LEGAL ways to cut down on licensing fees?

  • Use Standard Edition and forego the EE features.  
  • You can still use lots of RAM, but you'll need to run multiple SE instances 
  • Use SE and any RAM over 64GB consider using a RAMdrive.  For instance, put tempdb on your RAMdrive in any physical RAM you may have over and above 64GB.  Microsoft even graciously gave us a KB article about it.  
  • Pay nothing and use Express Edition.  Granted, you are limited to about 1.5GB of RAM and 10GB databases, but if you are CAREFUL and shard properly this can actually be viable.  I worked for a company that sold software run entirely on Express and we had "databases" that were well over 100GB because we used sharding and FILESTREAM effectively.  I wrote about this in my blog post Express vs Standard and How to save a *lot* of money with SQL Server Express.  

I have a client that is actively looking to replace SQL Server with less costly alternatives.  I doubt they'll find a replacement that has a better TCO after factoring in the massive SQL code rewriting that will need to happen. But my next few posts will be my experiences with various NoSQL solutions.  I've been working on this project for months and I've learned (and re-learned) a lot of great things that I'm excited to share.  

Some Life Lessons from VAX/VMS

HP announced back in June that end-of-support for OpenVMS will be 2020.  I've been compiling this blog post since then.  I have very fond memories of VAX/VMS... I used one to store my pillow for two years.  Funny story.  I learned a lot of life lessons from VMS.  Originally this was one big 'ol blog post but I decided to break it up so the post wasn't so long.  These posts are just a collection of funny VAX/VMS stories from my youth.    I really believe that I learned the most in my career from my few years working on VMS.  

Life Lesson Number 1: you need to have a sense of humor to work in this industry

Many people have speculated that HAL in 2001: A Space Odyssey was really supposed to be a play-on-words.  HAL comes one letter before IBM.  I guess that makes it "one better."  Mathematically, the equation is...IBM (IBM rot-1) = HAL.  The guy who wrote VMS was vehemently anti-IBM and when he ported VMS to WNT, which was the acronym for Windows NT before it was known as Windows NT, it is believed that WNT was "the next evolution" of VMS.  Mathematically...VMS(rot 1) = WNT.  That is a great sense of humor.
Life Lesson Number 2: Hard Work Pays Off
When I was in college I had a work/study job as a computer lab technician.  Basically a lab tech is supposed to sit at the front of the "lab" and help people print documents, log in to e-mail, etc.  This was a few years before computers were ubiquitous in every dorm room.  I was the only lab tech that actually walked around and helped kids and didn't just sit there playing games.  I'm serious...  It was so bad that I was promoted to head lab technician by Thanksgiving of my freshman year.  
I was promoted because I showed initiative and because I came up with snappy little "tricks" to make computers easier for the less

 savvy.  An example...our campus email was housed on a big VAX/VMS system.  We had to either login from a VT100 terminal or use a program called Kermit to access our email on old DOS 286's.  Kermit was actually a very popular communications protocol.  It is a backronym for "KL10 ErrorFree Reciprocal Microprocessor Interchange Over TTY lines."
I digress.  To say the least VT100's and Kermit were not user-friendly.  Students would send email and it would immediately bounce if the address was not formatted perfectly.  The syntax was "smtp%'email@address.com'".  And, yes, those are single quotes embedded in double-quotes.  In those days it was not well known that smtp meant "simple mail transfer protocol".  Lots of kiddies would type "stmp", "smpt", etc.  The error message was beyond cryptic when the format was not correct.  This led to lots of frustration.  I always explained that it was very simple, "send mail to postman."  People like when the difficult is distilled down to the simple.  I eventually wrote a VMS "macro" to make the syntax even easier for people.  
Just before Christmas break the VP of Information Services asked me if I would be interested in being the "computer operator" for the department, again, because I showed initiative.  The job was simple, take the backup tapes off-site every morning, check the status of scheduled jobs, answer emails for special requests, etc.  
I now had an easier job, could make my own hours, and was accountable to almost no one.  As long as the job got done, nobody bothered me. 
Hard work does pay off.  
Life Lesson Number 3:  When you fail, fail big.  (But not too big).  Or, your first intuition is probably your best intuition. 
You always want your failures to be big enough that they get you promoted because you solved a big problem, but not so big that you get fired for them.  

I had a part-time job while I was in college as the computer operator for a VMS system for a large, formerly monopolistic, telecom company.  This was way back before email was ubiquitous.  VMS has two e-mail programs…PINE and MAIL.  Neither has the concept of a “sent items” folder.  PINE is the “preferred” email program everyone used.  To set up a “sent items” folder you did this:

customized-hdrs=fcc: SENTITEMS

…where fcc stands for “folder carbon copy”.  It makes a copy instead of sending a copy.  Cool right?  In VMS, folders were not allowed to have spaces, much like the old DOS.  So if you typed this:

customized-hdrs=fcc: SENT ITEMS

…or this…

customized-hdrs=fcc: SENT,ITEMS

…you got the same thing, two copies of your email…one in a folder called SENT and another in a folder called ITEMS.  With me?   A comma and a space were the same thing. 

In Outlook you have the “Outbox” which is the temporary holding area where your mail sits until you can connect to Exchange.   VMS had the same thing but it was called OUTMAIL.  So, I really wanted to have my “Sent Items” called “OUTMAILCOPY”.  But I screwed up and typed this:

customized-hdrs=fcc: OUTMAIL COPY

Notice the problem?  Every email was copied to two folders…one of course being OUTMAIL, and that caused an infinite loop and caused email to go down for a few hours.  

So I crashed the email system by building my own little "mail bomb" macro.  I managed to fix it myself but by then people were paged and wondering what was going on.  I tried to hide what I did, "Don't worry it's fixed now."  

And surprisingly, that worked.  "Really?  You fixed mail all by yourself?  That's great Dave, we should promote you."  

This is where hubris of the young takes over, "Oh, it was no big deal really.  I know what I'm doing."   But of course I didn't know what I was doing.  At this point my failure was big but I had the opportunity to be promoted.  The next morning management began investigating the logs and saw that clearly it was my mail account that caused all of the problems.  Management's tone went from "let's promote Dave" to "let's fire Dave".  

I was spared.  But I was told I could no longer use PINE for email.  

Sidenote:  In VMS MAIL you use this command for “Sent Items”:


…which cc’s you on every mail you send, right to NEWMAIL (Inbox).  

More funny stories to come...


Metrics are a funny thing.  You hear people spout out metrics all the time.  "My Page Life Expectencies are 200."  "My disk latencies are 20 ms."  Just regurgitating the given metric says nothing about context.  There is no good guidance for PLEs or disk latencies and if you ask ten knowledgeable data professionals you'll likely get ten different answers.  But put all of the opinions together and you'll get a good "feel" for what might be good and what might be bad.  My general rule of thumb, I call it the "smell test", is if I don't have a benchmarked baseline for a metric then I look at "reasonable guidance".  If my metric is close then I look at some other aspect of the system to tune.  If my metric is wildly out of the range of reasonable guidance, then that is worth investigating further.  

Disk metrics are really tough.  There is so much abstraction going on with a system hooked up to a SAN that any given metric could be normal even if it falls wildly outside of the norm.  Latencies are one such metric.  About a year ago I pinned down an MS SQL Server field engineer who stated what the then current latencies guidance was (10 ms).  I blogged about that in my post Latency and IOPs.  I had a harder time extracting that guidance than that Microsoft employee's dentist had extracting his abcessed tooth.  

At the time I was told that IOPS of 100 Page Reads/sec was the guidance for "high".  We were at the time averaging 400 with sustained peaks over 4000.  This guidance made sense to me at the time because I knew our database code was horrendous (too much tempdb bucketing in procedures, lousy index decisions, procedures that returned 64MB result sets, lots of PAGEIOLATCH_SH waits, etc) and I was working on fixing the most egregious things.  So, using "Dave's Smell Test" I knew that our IOPS were probably outside of the norm and warranted further investigation.  

Now, I've always thought IOPS was a kinda nebulous term.  What is the definition of an IO?  Is it a 4KB block, or a 64KB block, which is of course what SQL Server cares about.  Depending on what the definition is your IOPS number could be off by a factor of 16.  Clearly it is in the interests of the vendor to spout off the highest IOPS number possible in their marketing material...so do you think the published number is for 4KB or 64KB blocks?  Read the fine print.  

It just seems a bit too confusing, which is why I prefer my disk metrics to always be in terms of "latency", not IOPS.  Your IO subsystem's goal should be, IMHO, the least amount of latency as possible, with the highest IOPS.  This is why you'll often see DBAs ask for their HBAs to have higher Q depth settings than the factory defaults.  Lower (default) Q depths will sacrifice some latency for better IOPS.  That seems backwards to me on an OLTP system.  

And then of course you have to take into consideration your "effective IOPS".  This is a term I made up (I think anyway) and it takes into consideration whether you are using RAID and what RAID level.  Single disk drives have a raw IOPS number.  Using RAID 10 effectively halves your raw IOPS number.  Why?  An extra write is required for the mirror.  It's twice as bad with RAID 5 (RAID 5 Write Penalty).  For instance, SATA drives average about 150 IOPS.  If I had 6 of those I would have about 900 IOPS, but I need to halve that number to get the true 450 IOPS at RAID 10.  This is all theoretical and confusing which is why I tend to ignore IOPS.  

IOPS numbers tend to NEVER hit the published numbers, especially on rotational media.  This is because database access tends to be somewhat random in nature.  Sequential transaction log access is the exception.  The more that disk head has to move, the worse the effective IOPS.  That's another reason why I firmly believe that ALTER INDEX...REORGANIZE is one of the best ways to speed up your effective IOPS.  

But even the latest SSDs will rarely give you IOPS numbers anywhere close to what they publish.  You'll likely saturate the PCI-E bus first.  

So I find myself not really relying on published IOPS or even "reasonable guidance".  There is no substitute for measuring your disk subsystem yourself.  Even so, once you have a good measurement, is there anything you can really do about it if it isn't up to the smell test?  Other than changing HBA QDs or using local SSDs for tempdb, I haven't had much luck in my career getting IO subsystems upgraded, even when they were woefully under-spec'd/misconfigured given my measured latency and IOPS.  The only thing I can ever really do is to lessen my reliance on IO.  And that of course means good indexing, good data hygiene, good caching, adding RAM, and tweaking IO intensive queries.  

Is a billionaire richer in Bucharest or Philly?

I was just code-reviewing a Romanian's code.   We got into a bit of an argument when we were looking at the query plan and saw that some operation was running this many times (sorry, no screenshot):


In the US (I think in most regions of India too), to make that easier to read we would write it as


…and most of us probably know that in most countries the decimal separator is actually the comma.  So in Romania the above number would be written as


...I think we all knew that.  But the interesting thing in the argument is how we would all “say” the number:

Me:  “about 2 trillion”

Alin (when speaking English):  “about 2 billion”

UK (in 1996):  “about 2 billion”

UK (present):  “about 2 trillion”

India:  (not sure)

This always confuses me when speaking to someone from Europe because they (mostly) use Long Scale notation and US (UK too for like 15 years now) uses Short Scale notation. 

Long Scale:  every term over a million = a million times the previous term (billion means a million million)

Short Scale:  every term over a million = a thousand times the previous term (billion means a thousand million). 

So for us short scalers…how would a Romanian say this number (in English of course):


Short scalers say “one billion”  and I *think* most long scalers would say “one thousand millions” ( or maybe “milliard” like in France).

The numbers are still the same it’s just how we say them. 

Side note:  I can guarantee you that Zimbabwe uses the Short Scale system.  The largest denomination currency bill ever produced was the Zimbabwe One Hundred Trillion Dollar note a few years back. They had a bit of a problem with hyperinflation...that note is now worth about 5 cents (actually about five bucks on eBay due to its historical and novelty value...I own a few).  Note the number of ZEROS.  Definitely short scale.  


So how do you make this less ambiguous when we are talking?  No clue. Use power notation?  Stop talking to each other and always work from home?

I find this fascinating.  It's a shame that something that is so scientific and should be a priori knowledge is in fact confusing and cumbersome when we try to communicate it verbally.  So, a billionaire is richer in Bucharest.  I guess.  

Useless Unit Tests

In my last post (Paradox of Unit Testing?) I mentioned that unit tests can give you a false sense of security.  This is especially true if you don't know how and what to unit test.  My examples here will be in pseudo-SQL, not tsqlt.  The concepts are equally applicable to junit or nunit or whatever.  The following is a list of useless unit tests that I see every day.  This list is not exhaustive.  

Having Too Many Tests

How many tests are too many and how many tests are too few?  This is totally subjective.  Assume the following code:

It's very simple, it takes a bit parameter and PRINTs it to the "standard output."  Note that I included 6 unit tests where I am attempting to exercise what exactly happens when I pass in different things for @Value.  But is any of this valuable?  No.  Any person with some experience will SQL Server will know how the rules of parameters for stored procedures work.  These six tests add no value.  They did not exercise the code and add no documentation value.  But you will often see procedures with hundreds of tests...and many/most are tests like this.  I think developers feel as though the more unit tests they have, the better their code will be received by their peers.  Reading hundreds of tests is not useful.  When I write unit tests, if I find I have too many (again "many" being subjective) I then consider breaking the logic up into multiple routines with a smaller concern.  

Commutativity and Associativity Tests

This is a contrived example but I see unit tests where the net effect of the test, usually unintentional, is testing commutativity.  Given the following screenshot:

You may see tests like "Assert that passing in arguments in a different ordinal order generates the same results."  In the case above, that kind of test is pointless.  Commutativity and associativity must be assumed by the developer of a stored procedure or java class.  You shouldn't need to assert that the language holds rules of arithmetic true.  

Attempting to Assert an Invariant and the Null Hypothesis

There are some big words there.  An "invariant" is some element in your code that cannot or should not change.  It is impossible to test an invariant that cannot or should not change using a unit test.  The correct place is to assert your invariant directly in your code.  

Why is it impossible to test that an invariant is invariant in a unit test?  Simple logic dictates that if you could prove an invariant can vary, then it is NOT truly an invariant, and therefore the entire premise for what you are testing is flawed...your code is flawed, your tests are flawed, possibly even your architecture, framework, and design are flawed.  A "varying invariant" can only be discovered "in the wild" and when that happens you realize your assumptions and code are flawed.  

This is known as "null hypothesis testing."  The null hypothesis states that I may not be able to prove that in every (possibly infinite) scenario my code does not break, but that does not mean it may be flawed.  Instead you must prove only ONE case where my code does fail to prove it is flawed.  In other words, I may not be able to always prove "correctness", but you should be capable of proving falseness once.  

OK.  Enough theory.  Assume this piece of contrived code:  

Clearly, following the principles of mathematics, the result must always be >= 1.  It can be nothing else.  We know this because, a priori, an absolute value function will always return a non-negative value and adding 1 to that result must therefore give me at least 1.  Knowledge that is a priori is simply knowledge that does not need experiential justification.  It just "is".  

So, given the laws of additon and absolute value we should know that the above procedure will always return a value >= 1.  Yet you may encounter developers who insist on writing unit tests to prove that.  It can't be done.  The result is always an invariant.  It will be >= 1.  Adding a unit test called something like "assert that the result set is always >= 1", first, is not testable (you would need the null hypothesis, otherwise the assumption must be true), and second, just clutters up any valuable unit tests that may exist.  

Testing Language Elements (a posteriori tests)

First a definition for a posteriori...a posteriori knowledge is experiential knowledge...you need to experience the knowledge, or empirical test it, to understand it.  This is the category of useless unit tests that I see the most.   On the left is our AbsPlusOne procedure from earlier.  I ALWAYS see unit tests that exercise what happens when the procedure is passed the wrong data type.  

You hopefully already knew that if you passed a string to an int parameter that you would generate an error, such as we see above on the right.  If you really knew your TSQL esoterica then you knew this would generate Msg 8114 with the EXACT text displayed in the screenshot.  Most of us don't know our esoterica that well, we simply know that we should never ever pass a string to an int.  How do we know that?  Through experience, basically a posteriori knowledge.  

So the question becomes, is this a valid unit test?  Should a unit test assume the reader has no knowledge of how the language works?  I say no.  If you don't know TSQL, then the unit tests are way over your head.  These unit tests are just noise to the serious developer who is trying to figure out what your procedure does by looking at your comments and unit tests.  

Here is a more blatant example:

Here we have 5 tests for a ridiculously simple procedure.  But simplicity of the procedure does not mean that unit tests are not necessary.  I've written many simple procs that have a handful of tests to help document assumptions for the next developer.  But in this case we have 5 tests that simply assert that "IF" switch branching works as advertised by Microsoft and Sybase in TSQL.  Those tests are not helpful and just clutter up and add noise to things.  

Paradox of Unit Testing?

A quick google search shows no hits for "paradox of unit testing", but there are equivalents, so I don't claim to invent the term.  What I'm about to tell you is so patently obvious that it is galling to me that no one else doesn't see this and attempt to change it.  My "paradox of unit testing" is quite simple...the more we unit test, the worse our code quality is and the less software we sell.  Please don't construe that I am against testing or unit tests.  I'm not.  I'm against excessive testing that shows no value.  And automated testing may actually be making your product worse!  
Over-Reliance on Automated (Unit) Testing Tools
For the last half decade every one of my employers has tried to automate more and more of the QA work.  True, for the small, less-than-5% of the automated tests, the code is entirely bug-free.  However, a machine cannot automate and test bugs it doesn't know how to test for.  Here's a well known story that you can google if you don't believe me.  When M$ was developing the Vista release of Windows there was an internal management push to use automated testing tools.  The reports indicated all of the tests passed.  Yet, the public reception of Vista was less-than-stellar.  Users felt the interface was inconsistent and unpolished and full of bugs.  An interface becomes aesthetically pleasing when the interface has consistent patterns of behavior and look-and-feel.  How do you automate testing of that?  You don't.  QA testers will only file bugs for those issues after they use features repeatedly and note the inconsistencies.  These obvious problems (to a human) do not fit the definition of a bug by an automated testing tool.  Taken in toto, these "bugs" led users to generally feel that Vista was inferior to XP.  
I am not at all claiming that automated testing tools should be tossed.  I'm merely saying that an over-reliance on them can be detrimental to the general perception of your software product.  
Goddamn Sonar and Code Coverage Metrics!
This is a family blog but occassional blasphemy is necessary when something is too egregious.  Sonar is code quality management software.  Sonar can tell you, for instance, what Java classes have unit tests and how many code branches have ZERO tests.  It can then drill-down into your code and determine where the bugs are given the technical debt.  This is less-than-perfect, but it gives you a good feel for where your bugs may be and I'm all for that.  It's another tool in the toolbelt.  
The problem is the tool gets a bit cute with its management reporting capabilities.  For instance, let's say your "Code Coverage" is a measly 7% (ie, 7% of your code has identifiable unit tests).  Is that bad?  If I was management, I'd be pissed.  The fact is, you don't need to unit test EVERY line of code.  Do you need to ASSERT that an "if" statement can evaluate a binary proposition and correctly switch code paths?  I think not.  If we needed a formal junit test for every line of code our projects would be even further behind.  
There is another report for "Code Duplication".  It's not uncommon to see this at 25% which again freaks out management.  Management says, "I thought you said you weren't supposed to copy/paste code.  Shouldn't that be a class or method? I don't understand you developers."  Well, the fact is that sometimes duplicated code is a good thing.  It's well known that tsql scalar functions (a code reusability feature) performance horrendously.  So in that case, a little duplication is good.  
But my favorite report is "Not enough or too many comments."  Whoa.  Is that ever subjective?  Coders:  "Uh oh.  We're only at 26.8% API documentation.  Let's spend the next sprint getting to 75%."  Does that sound like a good use of time?  
The bottom line is management is always looking for metrics to improve productivity.  Tools like Sonar cause developers agita as they need to refocus on meaningful work to adding possibly unnecessary tests and code comments simply to get higher sonar scores.  Will quality improve?  Doubtful.  The only metric that management or coders should ever worry about is "software sold."  Everything else is poppycock.  
Wasteful unit tests that assert language constructs and invariants
I'm going to cover this in the next blog post with examples.  What I see is that many unit tests people write are merely asserting that an IF statement works the way the language vendor has documented it.  That adds no value.  Another example is asserting an invariant has not changed in your code.  A unit test is not the place for this.  More to come in Useless Unit Tests.  
Management vs Developers
I hear management saying we have poor quality software.  Sometimes they'll dictate to the coder how to solve the problem..."we are a Kanban shop now.  We will follow good kanban methods and quality shall improve."  Or, usually, management will ask the coders how to improve quality.  Unfortunately, the coders usually say, "more and better testing." The answer should really be more time and resources, but we know that ain't happening.  
But the he solution is rarely more unit tests.  In fact, I don't even agree with the claim that software quality is poor.  The computer software you use daily is of really good quality compared to everything else in your life.  It only *seems* like software quality sucks because you are biased towards only remembering what you don't like.  Sorry if you don't agree with me.  

The solution to software quality problems is simple.  There is management and there are coders.  Coders want to solve software quality by adding more software...automation tools, TDD, more unit tests.  These tools are meant to prove to management that software is bug-free.  But how do you prove definitively that software is totally bug-free?  You can't.  The logic we coders use is flawed.  

Management doesn't care about any of this.  They want:

  1. more feature functionality
  2. less bugs
  3. faster to market
  4. cheaper

Management couldn't care less (or shouldn't) about TDD or dynamic logic code or more formal testing.  (But they do like the pretty Sonar graphs that they can use to show that their developers are lunk heads).  But management does understand good economics and the Law of Diminishing Returns.  If we focus on tracking down and fixing every last bug we may get better quality, but we will have lost in the marketplace.  

Good management knows that the goal is not bug-free software, the goal is software that has just few enough bugs that a sucker, er customer, will spend money on it.  Many coders don't realize this or have forgotten it.  Stop worrying about endless testing.  

What is the best way for a coder to test her code?

Given that you believe me and my contention that we test too much, what is the ONE thing we absolutely should do as developers to affect quality?  

Code reviews

There is no substitute.  I have spent 10 minutes looking at another person's code and have pulled out handfuls of bugs.  This is code that had full unit tests.  Likewise, I've had other, "junior" level people code review my work and within 10 minutes they've spotted bugs that I thought I had test coverage for.  

Always code review.  

You're Only As Good As Your Last Success

"Could you take a look at ticket 56789?"  I nodded my head in the affirmative.  I was already behind in my existing tasks so I was little annoyed.
I opened up the ticket in the system and read the following:  "Customer says that procedure FooBar consistently executes in 5 seconds.  Customer wants to know if there is any way we could make that a little bit faster.  Customer feels this is not a priority but since they use the function multiple times a day they would save some time."
Our system has almost 4000 stored procedures and I didn't immediately recognize the FooBar procedure as something I've worked on.  I went into our source control system and found FooBar.  The last check-in was seven years ago and, surprisingly, it was by me.  Even more surprisingly, the last check-in note stated:"Per Schlemiel, if you ever have to improve performance remove line 262."
What?  I was baffled.  Why would I write a check-in comment like that?  After pondering the issue for a few seconds, I remembered!  I shook my head, chuckled, remove line 162, checked-in my code, and went to find Schlemiel.  
Flashback seven years ago...
I was still a new employee and was trying to make a name for myself.  Our benevolent dictator at the time was Schlemiel.  He has since been promoted to a bigger team with more responsibility.  He was a well respected, godlike architect and I admired him very much.  Schlemiel always saw something in me and I was his to-go guy when he had data issues.
Schlemiel asked me if I could look at the FooBar procedure because it was running a little slow.  I asked him for the ticket number.
"12345.  Just do this as a favor for me.  A couple years back Schlomo wrote this FooBar procedure as a skunkworks project.  We never thought customers would really use it so we never bothered to ensure it performed over larger data sets.  Now it takes about 14 minutes to execute for most customers.  Customers are starting to complain and Schlomo can't figure out what to fix.  Take a look, make some recommendations, and we'll figure out how to prioritize."
I pulled up the ticketing system and opened up 12345.  Sure enough:  "Customer states the FooBar procedure is totally unusable.  It takes about 14 minutes to execute.  User state she clicks submit and then goes for coffee and bagel.  Sometimes it's done by the time she gets back.  Please fix it."
I pulled up the code for the FooBar procedure and began looking at it.  Within 20 minutes I figured it out.  One of the queries joined 12 different tables and one of the JOIN/ON conditions was missing.  Effectively there was a cartesian product causing billions and billions of rows to qualify for the final result set.  Thankfully the WHERE clause was filtering out those rows before they were sent to the client.  Through dumb luck to query was working, but it was not efficient.
I added the JOIN/ON condition and began writing some quick unit tests.  My new version proved to be functionally equivalent but took orders of magnitude fewer reads and executed in 500 milliseconds vs about 9 minutes on my test server.  I immediately scurried off to tell Schlemiel.
Schlemiel was less-than-enthusiastic.  "Now you've gone and gotten yourself a bit of a problem."
Puzzled, I replied, "Why?  The tests are perfect, clearly this is a slam dunk. Simple oversight by Schlomo, if you will."  
"I've no doubt you solved the problem, but the fact is, you made the performance TOO good.  If you release this version of your procedure then management and the customer will never believe that such a dramatic increase in performance doesn't introduce too much risk.  The change will never be approved.  No one will ever believe that simply adding an ON clause could fix an issue like this.  Instead, management and the customer will demand to see endless proofs and reams of unit tests before they have a comfort level.  Meanwhile, you are creating agita for yourself and I frankly need you to work on other, more important issues."  
As every word passed his lips I felt more and more deflated.  "What do you suggest?"
"I would suggest adding a WAITFOR DELAY '00:00:05' somewhere in your code.  Let's see that code."  He peered over my shoulder, "Right here on Line 262 looks good to me.  Once you've done that go ahead and check it in and note that I code reviewed and approved your work.  And next time, try not to be so efficient."  
The look of dejection on my face must have been overwhelming.  Perhaps he noticed a stray tear running down my face.  He tried to console me, "Dave, don't worry so much.  You did a great job.  You took a 14 minute procedure and go it down to 5 seconds.  You're a hero.  But just remember, today's hero can be tomorrow's goat.  Someday you'll do something stupid and you won't be hero.  At that moment just remember that you can always go back and remove that artificial WAITFOR DELAY and you'll be the hero again.  Remember, you're only as good as your last success."  
I pulled up the ticketing system and opened up 12345:  " customer states the FooBar procedure is totally unusable.  It takes about 14 minutes to execute.  User state she clicks submit and then goes for coffee and bagel."
I pulled up the code for the FooBar procedure and began looking at it.  Within an hour I figured it out.  One of the queries joined 12 different tables and one of the join conditions was missing.  Effectively there was a cartesian product causing billions and billions of rows to qualify for the final result set.  Thankfully the where clause was filtering out those rows before they were sent a client.  Through dumb luck to query was working, but it was not efficient.
I added the join condition and began writing some quick unit tests.  My new version proved to be functionally equivalent but took orders of magnitude fewer reads and executed in 500 milliseconds vs about 9 minutes on my test server.  I immediately scurried off to tell Schlemiel.
Schlemiel was less-than-enthusiastic.  " you've got a bit of a problem.
Flashforward to the present...
"Hey Schlemiel," I found him in the hallway in another building, "remember seven years ago when I tweaked FooBar and it performed so well that you wanted me to put a WAITFOR DELAY in the code?  Well, we just got another ticket by a customer asking us to tweak FooBar again.  I took out the WAITFOR DELAY and checked the code back in."  
I could feel my chest was puffed out to the fullest, I was proud of my accomplishment.  Schlemiel paused a second and then broke out in a terrific belly laugh.  "You idiot!  I can't believe you actually checked in that WAITFOR DELAY.  I was kidding of course.  No sane person would do something like that.  Didn't you realize I was kidding?"  
I walked away...again...dejected.  I had two major successes, both times the customer was thrilled, and both times I was the goat.  

The Wooden Badger

"The Wooden Badger" ... this is the story of how I got that nickname.  But first, let me say that I really don't "get" Monty Python.  I rarely laugh at their sketches.  But other people do.  Especially programmers for some reason.  The same thing happens with This is Spinal Tap.  Programmers love it.  

Here's an experiment:

  • Ask a random person if he enjoyed This is Spinal Tap
  • If the response is any of the following:
    • "But these go to Eleven" followed by excessive belly laughter
    • "It's like, how much more black could this be? And the answer is none. None ... more black."
    • "Remember when Harry Shearer got stuck in his pod". 
  • Then you've got yourself a programmer.  

I don't frankly find Spinal Tap all that funny, but programmers do, so best to speak on their terms. Basically, when in Rome speak Roman.

But I digress.

There is one scene from Monty Python and the Holy Grail that is hilarious. King Arthur and his Kinghts of the Roundtable lay siege on a French castle. They build a Trojan Horse, except their version is a Trojan rabbit.  At the end of the scene we see the English watching at a distance while the French wheel the wooden rabbit into the castle. King Arthur asks how this plan is supposed to work.  "Now we wait until nightfall and we all jump out of the rabbit, taking the French by surprise."  They all simultaneously realize the flaw in their plan and slap their foreheads.  Gallahad responds, "Um, well look, if we built this large, wooden badger."  Simultaneous groans could be heard.  

So, why do programmers in particular find this so funny?  My theory is that we feel we can do the same thing over and over again expecting different results. In my experience, that doesn't happen much.  For instance, we all know that not having proper declarative referential integrity is asking for trouble.  Yet every "new development" project I work on I invariably hear someone say, "we don't need foreign keys, we'll actually be able to ensure integrity at the Java tier this time."  And somehow the data folks always tend to lose the argument and the initial version of the system is delivered without keys.  

There is something even worse than "repeated insanity" and that is the Second System Syndrome.  In this case, we take an elegant, simple system, and we refactor it and add needless complexity and call it Version 2.  This is much worse because you HAD a working system, now you don't.  Imagine what King Arthur's wooden badger would've looked like.  More comfortable seating, gilded ornamentation...basically a bunch of useless stuff.  

So, how did I get the "Wooden Badger" nickname?  

It all started with 2 sprints left in the development cycle.  I was asked to do a "whiteboard" design review for an ordered queueing system.  Generally a whiteboard review session occurs in the first few sprints because if the design is faulty it's best to know that before too much code is written.  There is only one reason to hold a design review this late in the cycle...someone is married to their design approach and knows it is not right and that I will refute it which will cause lots of rework.  My arguments, architecturally sound as they always are, can be rebutted with a simple sentence, "Dave, you know, you may be right, but we have to get this release out the door, so we can't go back to the drawing board now."  And this argument only works if it is used at the last minute.  

So I went to whiteboard session and learned about the requirements first:

  • There are multiple queues, some ordered, some not, some ordered by different, varying keys.  
  • We must support tens of thousands of messages per minute.  

This doesn't sound like anything I haven't seen before.  Let's look at the design:

  • The queues will be stored in the database as a single master queue table (not a Service Broker queue).  
  • There will be 3 outrigger tables to handle supporting details.  No foreign keys needed.  And since we support multiple, varying keys, let's store them as XML and key-value pairs.  
  • Ordering logic, when needed, will be performed during enqueueing.  And that will require shredding the XML for existing data to determine the keys.  
  • Since ordering logic is so complex we need to single-thread ordered queue processing, so they used SQL Server's applock feature...which is a big mutex.  
  • The queues will not "drain", they will keep 14 days of successful messages in the queue tables, and 30 days for any failure messages.
  • Support people will be able to query the queue tables real-time to look at throughput metrics or to look at individual message routing status.  

If you didn't understand the queueing mumbo jumbo above, never fear.  Rest assured that EVERY bullet point is a well-known, established anti-pattern.  So I began my retort:

  • Why are we not using JMS?  It is an anti-pattern to build queues in a database.
    • "It's too hard to support ordering across multiple JMSs without going single-threaded."  
    • But an applock is a way to make sure a process in SQL Server is single-threaded.  
    • "Well, we like this pattern better."
  • Why not use Service Broker?
    • "What is System Broker?"  
    • Oh boy.  
  • Why not have separate queues for ordered and unordered processing?
    • "Well, it's too late to change the design."  
  • Why not drain the queues and on dequeueing simply write the dequeued data to a History table?
    • "We can do that in Version 2."

Clearly the designers were happy and nothing was going to change.  There was consensus that some people would look at my concerns and consider tweaking the design for Version 2.  

Six months later our first customer experienced queueing problems.  I again expressed my design concerns and the consensus was that we would begin working on them.  

But of course break/fix work is not sexy like feature functionality, so the work was de-prioritized.  Another 6 months elapsed until the next customer experienced problems.  This time we hit the "knee of the curve" and the design simply could not scale any further.  I again suggested we prioritize my simple design changes.  I made the case that my few changes above yield radical improvements without a full-scale redesign.  We could tweak the implementation and design to be more of a standard queueing system.  

(Many developers relate to Star Trek too)

"It sounds like you want to build another one of your wooden badgers Dave."

Me:  "I'm sorry, but I'm just trying to adapt a bad design a bit and make it follow best practices a little more.  I think the final design won't be perfect, but we'll get the scalability without the cost of a full redesign."  

"No, you signed off on the last design, now you just want to keep tweaking it hoping for different results every time.  When will you learn that we can't keep designing Wooden Badgers, we have to do things using industry best practices.  So, you're new nickname is Badger and we are going to redesign this the right way this time."  

"Sounds good to me."  

UPDATE:  We've had a few additional large customer outages due to this design.  Each time I've been asked to participate in giving band-aid fixes.  And after every incident I always ask my collegues, "So, Badgers, when are you going to have that redesign done?"  


Subscribe to RSS - Data Architecture