DaveWentzel.com            All Things Data

April 2014

Find and fix untrusted foreign keys

Seems like most of the databases I work on have untrusted foreign keys.  This can happen for a few different reasons that I covered in my post Untrusted Constraints.  I think the biggest reason is that people temporarily disable the FK to do a data load/purge and then forget that the CORRECT syntax for re-enabling the constraint and making it trusted, is a bit goofy.  

WITH CHECK CHECK CONSTRAINT

Yeah, notice CHECK is in there twice, the first as a verb and the second as a noun (yes I know that grammatically those aren't nouns and verbs, the point is one is an object, the other is declaring an action on the object).  

Here is a script that will help you determine if you have untrusted FKs.  It is dynamic SQL and it does NOT actually do any work.  It is "SQL that generates SQL" that you can run at your leisure.  On larger tables you may want to run both the validation script and the WITH CHECK CHECK scripts off hours to avoid concurrency issues, lock escalation, etc.  Might also be worthwhile to index the underlying FK.  (The PK will already be indexed).  

The script can be found in my git repo.  Here's a screenshot of the header.

 

 


You have just read Find and fix untrusted foreign keys on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

Tags: 

PerformanceCollector and the Heisenberg Uncertainty Group

This is the last post in my series on my PerformanceCollector tool.  This post is a warning and an anecdote that you can over-monitor your server.  By default PerformanceCollector (PC) monitors quite a bit of information but I've had the utility running on high-throughput production OLTP workloads and I could not discern any negative performance impacts by running it using the default configuration.  It has all of the necessary fail-safes in place (I hope) to ensure it doesn't roll off the rails.  For instance, there is a configuration setting that will cap the size of the database and will auto-purge data if needed.  

The Heisenberg Uncertainty Principle may eventually come into play however and you should be aware of that.  If you are not familiar, the HUP basically says that the more you monitor something, the more the act of monitoring will adversely impact the system.  You may also hear this called the "observer effect".  

If you find that PC is a bit too heavyweight for your environment then you can disable some of the AddIns or extend the interval between runs.  Or you can tweak the code to fit your needs.  The possibilities are endless.  In older versions of the tool I ran into some cases where the system was under such stress that PC could not accurately log the waiting and blocking of the system.  That kinda renders the tool useless when you are trying to diagnose severe performance problems.  But I haven't had that happen in a few years now so either my code is getting better, or M$ has improved the performance of their DMVs to work under harsh environments.  I assume the latter.  

But you can take performance monitoring too far.  Here's a quick story to illustrate the fact.  

I consulted at a company that had deployed 2 third party performance monitoring tools that shall remain nameless.  They worked ok but did not capture the wait and subsystem information like PC does so the client decided to deploy my scripts as well.  

There were now 3 tools running.  

PC only collects waits and blocks every 15 seconds and it's likely it won't capture EVERY "performance event" that occurs during that interval.  But again, I'm worried about HUP.  

Primum non nocere.

Like any good doctor, "first, do no harm."  

Frankly, users are generally tolerant of an occassional performance blip, but anything over 15 seconds and rest assured some user will open up a help desk ticket.  That is why the monitoring interval is 15 seconds.  But this interval is a bit like Prunes Analysis (when taking prunes...are 3 enough or are 4 too many?).  You can certainly make the case that every 10 seconds is better, or that every minute is optimal.  Decisions like this are very experiential.  

This particular client decided that with a 15 second interval too much valuable information was not being logged.  They decided that if a block or wait occurred the 15 second monitoring interval would be shrunk to every second to get even better metrics.  This required one of their developers to change the WHILE 1=1 loop in PC.  If after 60 seconds of monitor at the one-second interval no blocking/waiting was observed, then the code reverted back to every 15 seconds.  

IMHO, that's far too much logging.  It is quite normal for a busy system to constantly experience some variant of waiting.  That meant that this new code was ALWAYS running in one second interval mode.  This hyper-logging mode was implemented after I left the client and they did it without consulting me, or really consulting anyone that understands a modicum of performance monitoring a SQL Server.  

It didn't take long until performance of the system was worse than ever.  At one point the instance was crashing every few days due to tempdb log space issues.  Further, PerformanceCollector did no have all of the fail-safes at the time and the db was about 1TB in size, even with 30 day purging in place.  Another performance consultant was brought in (I was out-of-pocket) and he quickly determined that PC was the biggest consumer of tempdb and IO resources on the instance.  I looked thoroughly stupid when I was called back in to the client to explain myself.  I quickly determined that my code was not at fault, rather the change to shorten monitoring interval and increase the frequency of monitoring were the causes.  

Without finger-pointing I explained that we should put my PC code back to the defaults and re-evaluate why hyper-logging was needed at all.  The other consultant agreed and within a day or so everyone was calm and happy again.  Mistakes like this happen.  

As I was wrapping up this engagement I finally asked, "So, have you identified and corrected many issues that PC has found?  Clearly if you thought hyper-logging was needed you must've corrected quite a few performance problems with the tool."

The DBA responded, pensively, "Well, we keep the trend data but we haven't opened up any tickets to the dev team to fix specific stored procs yet.  We don't have any budget for a DBA to look at the missing indexes either.  But maybe we'll get to do all of that some day."  

The devs, architects, and I choked back the laughter.  All of that monitoring and nothing to show for it.  To this day the other architects at that client call the DBAs the HUGs (Heisenberg Uncertainty Group) behind their backs.  I have since returned to that client for other engagements and I have witnessed this for myself.  Those of us in the know chuckle at the joke.  But the lead HUG didn't get the joke apparently…he got angry once and asked not to be compared with Walter White. 

Moral of the story, monitor as much as you need to, and nothing more.  If you don't have the budget or resources to fix issues you find during your monitoring then it is best to consider NOT monitoring.  No need to make your performance worse with too much monitoring.  That will keep Heisenberg from rolling over in his grave.  


You have just read PerformanceCollector and the Heisenberg Uncertainty Group on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

PerformanceCollector AddIns

How do the AddIns Work?

There are a series of procedures named PerformanceCollector.Addin*. PerformanceCollector.RunAddIns is tickled every 5 minutes from the SQL Agent AddIns job. This proc builds a cursor over ANY entry in PerformanceCollector.Config that is enabled and runs it.

The AddIns procedures are responsible for determining when the given AddIn is actually supposed to run (every 8 hours, once a day, etc). It determines this from querying PerformanceCollector.Config. It then calls the actual worker proc if it needs to.

AddIns are responsible for populating the other tables in the PerformanceCollector schema.  The following tables store the performance metrics that I find most interesting.  Each table is structured so the first columns are CurrTime and ServerName. This aids in troubleshooting. You can  aggregate all of this data across multiple customers/instances to do trending in a centralized repository.

PerformanceCollector Object Name Default Runtime Interval Purpose
AdHocPlanCacheBloat 7 Hours Logs single-use ad-hoc query plans that are a total waste of plan cache space. If this is really big then the solution is to run 'optimize for ad hoc workloads' which reclaims all of the memory and makes sure that ad hoc plans are not fully cached. This is a "analyze once" script. Once you've enabled 'optimize for ad hoc' then you are done and you can disable this AddIn. But we run this in case a new framework (ruby seems susceptible) is deployed which causes a radical increase in ad hoc plans. You can also use the text to determine where you are not doing parameterized queries correctly.
BufferStats 5 Hours Gathers stats on the counts and size of each object in the buffer cache.  Larger values for some objects might mean a new index is required.  
Config   Using this table you can configure certain AddIns to NOT run or change their schedule.  
DeadlockSnapshot 4 Hours Shows all deadlocks that have occurred and tries to decode the deadlock XML to show the Victim and Survivor and what they were doing. Please note that there is a lag time between when a deadlock occurs and when it is decoded here that could be AT MOST 8 hours. You can manually run the Deadlock AddIns to do the decoding at any time if you really need to see up-to-the-minute deadlock data.
IndexBlocking 4 Hours Logs the amount of time each index is spent in a blocking/waiting state.  Useful to determine if better indexes may be needed.  
IndexesFromQueryPlanDtls   Not yet implemented. This will read the missing indexes from the query plans instead of relying on the DMVs. This seems to be far more accurate. This will list the TotalCost (somewhat nebulous) as well as the table and index columns that the optimize believes would be beneficial.
IndexesFromQueryPlansSummary   Not yet implemented. An aggregate of the the Dtls table that quickly shows the best indexes to be created.
IOStalls 7 Hours Lists the IOStalls metrics by database and filetype.  See more at Monitoring Disk Latencies with PerformanceCollector
MissingIndexes 4 Hours Snaps the sys.dm_db_missing_index* DMVs. However, there is a better way to do this that looks just at cached query plans (see IndexesFromQueryPlanDtls). Even so, this will show every egregious examples of missing indexes.
PageLifeExpectancies 5 mins Logs the PLEs for the instance.  
PerfCounters 1 Hour Gathers instance-level perf counters in EAV format for aggregation later.  
PlanCacheMetrics 7 hours Logs stored proc stats (reads/writes/CPU/duration) by Metric.  By looking at procs that overlap the different metrics you can quickly determine what you should be focus on when you are looking for performance problems.
-- Top Cached SPs By Execution Count 
-- Top Cached SPs By Avg Elapsed Time
-- Top Cached SPs By CPU
-- Top Cached SPs By Total Logical Reads
-- Top Cached SPs By Total Physical Reads
-- Top Cached SPs By Total Logical Writes
PlanTypeStats 7 Hours similar to AdHocPlanCacheBloat. This will show you the breakdown of adhoc vs prepared vs proc, etc. the USE_COUNT will show you if parameterization is working for you.
PossibleBadIndexes 4 Hours Logs cases where writes>reads. This may mean the index can be dropped. But you must understand your workload. However, if Total Reads is ZERO then you can generally drop that index. Unless you have a year-end process that needs it or some other edge case where the index is only needed occassionally.
TempDBLogWatch 5 mins Lists transactions and query plans that are using more than 50MB of tempdb log space.
TempDBUsage 5 mins Shows spids and statements that are using more than 50MB of tempdb space.
 

You have just read PerformanceCollector AddIns on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

PerformanceCollector.BlockAndWaitMonitor

This is Part 3 of my blog series on PerformanceCollector.  This utility will help you monitor and baseline the performance of your SQL Server.  You can download the source code here. It is totally open source and there is no copyright.  It follows the Oracle YAPP method of performance monitoring where we monitor our response times and try to improve them by eliminating unnecessary waiting on the server.  Today I am going to cover PerformanceCollector.BlockAndWaitMonitor which is a view that shows you any process that was waiting or blocking/blocked and the cause. This is snapshotted from various DMVs every 15 seconds. We maintain this data for 30 days.  Both the snapshot interval and data retention settings are configurable in PerformanceCollector.Config.  

The Typical Use Case

In general people will complain about application performance by saying, "The app was slow from 1:45 to 2:00 today.  Can you look at the database?" In that case start by querying this view and filtering where CurrTime BETWEEN '1:45' and '2:00'.

You are looking for:  

  • any blocking/blocked spids
  • any spids waiting and what they are waiting on.
  • sql statements/procedures with longer durations than normal (may be missing an index or stats need to be updated).  

There are infinitely more things to look for that I can't possibly cover in a short blog post.  Experience and google are your friends here.  You'll likely find the cause of the problem if you look at the data and think logically.  

Baselining

It's difficult to baseline wait events and whether they are trending in the wrong direction.  However, you can do basic trending by looking for "canaries in the coal mine".  Every system I've ever worked on has had patterns of performance behavior that, when observed, were good indicators that your users were unhappy with performance.  An example is lock escalations.  When a given query is running in READ COMMITTED isolation and scans an index it runs the risk of locking the entire data structure because the fine-grained row locks are escalated to table locks to conserve resources.  When this happens lots of spids tend to be blocked behind the lock escalating spid.  When the lock escalating spid runs for more than a few seconds users will think your database is down.  So, this is a use case I baseline.  It is fuzzy logic but we know to react to these event chains quickly before the calls start coming in.  

(sample output from PerformanceCollector.BlockAndWaitMonitor)

Manually Capturing a Spid

You can manually run EXEC PerformanceCollector.WaitAndBlockInspector at any time and it will log current waiting and blocking data. This may be useful if you notice your query is taking longer than it should. You can also pass a @spid parameter like this EXEC PerformanceCollector.WaitAndBlockInspector @spid = 100; and it will show you any waiting and blocking as well as any activity for the spid. This data is output to the screen and logged in the table.

Abandoned Spid Handling

BlockAndWaitMonitor will also show "abandoned spids." An abandoned spid is any spid with an open transaction for at least 60 seconds that is blocking something else. Here is the most important aspect...the spid must be sitting on AWAITING COMMAND. That is what differentiates an abandoned spid from an ordinary, run-of-the-mill blocking spid. When the spid is sitting on AWAITING COMMAND that means the SQL Server is waiting for another command from the spid to do something...most likely issue a COMMIT TRAN.

I have never seen a case where the spid will come back and issue the COMMIT.  Hopefully your developers are smart enough that they don't allow user input while a database transaction is open.  In every case an abandoned spid occurs because a data access framework thinks it is NOT in implicit_transaction mode and SQL Server thinks it is. For instance, various combinations of jdbc and websphere produce this behavior because jdbc does not issue sp_reset_connection commands like .NET does.

An abandoned spid is first logged after 60 seconds. After 5 minutes of being in an abandoned state, and after another spid has been blocked by it for 60 seconds, the spid is killed. It is logged to PerformanceCollector.SpidsKilled.

You can of course disable this feature in PerformanceCollector.Config.  

Summary

BlockAndWaitMonitor is the most important aspect of the PerformanceCollector tool.  It will show you how the SQL Server subsystems (IO, Memory, lock manager, log manager, etc) are contributing to "waiting" on your SQL Server.  You will see more than just the SQL command that is blocking...you'll see an approximation of why it is blocking.  This is important.  Just because a procedure is long-running does not mean a new index will fix it, nor will a rewrite either.  You have to understand why it is behaving like it is and which subsystem is contributing.  Only then will you know what needs to be fixed.  

In the next post I'll cover some of the other features of Performance Collector.  


You have just read "PerformanceCollector.BlockAndWaitMonitor" on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

Data Quality

The NewThe New York Observer:
I was consulting with a healthcare IT vendor and one of their attorneys told me a story about a woman in labor who showed up at the hospital without her identity card (driver's license).  The hospital receptionist pulled up her data on their admissions system and turned her away because, according to their records, she already had the baby six months earlier.  Clearly this was a miscarriage (pun intended?) of justice/insurance/ObamaCare.  Turns out someone with some common sense stepped in and admitted her anyway.  I assume baby and mom were non-the-worse-for-wear.  Turns out her identity was stolen and apparently some other pregnant woman had a baby using her insurance/identity.  
 
Bizarre?  Not as bizarre as you may think if you've ever worked in healthcare IT.  Healthcare data quality is notoriously horrendous and, surprisingly, there's often some very good reasons for that.  
 
My Story Is Almost as Funny/Disturbing
(names, locations, and events have been changed to protect the guilty.  Otherwise, this is a true story...you can't make this stuff up!!!)
 
I worked on a project a long time ago to remove duplicate patients/medical records in an FDA-regulated medical "device" used at hospitals in a large metropolitan area.  Each hosptial/acute care facility/clinic/nursing home/etc had its own "device".  And each "device" had its own little database of patient information.  In the biz we call this the "medical record."  
 
This large metropolitan area has its own unique socialized medicine system that has evolved from what was once a bunch of independent, mostly religious-affiliated hospitals (presbyterian and jewish hospitals, for instance).  The oldest hospital in the system is close to 300 years old.  As you can imagine, after that many years of mergers, acquisitions, and consolidations there were a lot of duplicate patient records in each of these little device databases.  This causes problems, both from clinical and billing perspectives.  If PatientA gets a prescription at Hospital X and then goes to Hospital Y who has no knowledge of the prescription, possible drug interactions may occur.  You can imagine, I'm sure, what billing issues would arise from this as well.  Wouldn't it be nice if we could have a "best-of-breed" billing system that alerted each facility/device database/medical record of updates?  
 
Anyway, I was tasked with merging patient records.  I'm a firm believer in agile so I wanted to deliver something quickly that met a subset of the requirements so that user-acceptance testing could begin.  There was no magic with this project.  Very simply, using a simple algorithm I was able to merge patient records for "John A Smith" and "John Smith Jr" if they had 11 data points that were similar other than name.   Data points might be SSN, address, DOB. The hard part was merging the records together which, unfortunately, wasn't merely a case of changing MedicalRecordID 1234 to 6789.  (Bad normalization at play).  Basically, pretty simple requirements, pretty simple code.  
 
Within a couple of sprints I was done, QA signed off, and the "customer" was thrilled. The rollout went smoothly.  Since we had to merge MILLIONS of medical records the code took a "creepy/crawly" approach...it built some cursors (yes, there are good uses for cursors) and slowly started migrating data, ensuring we had no possibility of deadlocking or performance issues, migrating "active" data first, worrying about "archived" (read dead people) data later.  We could never do this during downtime (no such thing in the clinical healthcare space).  With smaller data sets the "devices" could all now communicate with each other and share patient records real time.  Performance was better, data quality was better, we even reclaimed disk space.  
 
So, where's the problem?
I arrived at work one day a month later and was told by a frantic VP that I needed to immediately initiate a rollback of the code.  He was out-of-breath and clearly frazzled, saying my code was faulty, etc etc, all the while gasping for air.  I tried to remain calm, "What's wrong? Is it something we can fix before we take drastic measures?"
 
"Heroin addicts can't get their methadone!"  
 
I kid you not, that was the exact statement.  "Come again?  Heroin addicts can't get their what?"  My mind was trying to figure out how my change could affect heroin addicts.  
 
The Problem
We did lots of testing so I felt confident in my code.  But at times like this I tend to panic too.  When you deal with FDA regulations you learn not to cut corners with QA.  Some of us developers pay off the DBAs with beer, in healthcare you pay off the QA folks (but not in beer).  You want QA to be coherent and you want them finding every bug.  You don't just pass off your code to QA, you make sure QA understands what you did and why.  Pairing and scrum are necessities in regulated industries.  
 
Pretty soon conference calls were arranged and I learned about the true nature of the problem.  In the good ol days before my code was in place, "John A Smith" would register at Hospital X and receive his methadone.  His patient record at Hospital A would be updated to reflect that his next methadone treatment was scheduled for x days.  Along with that he was given a subway token to get back home. How thoughtful.  
 
Ah, but nothing inspires ingenuity like a drug addict trying to get his next fix.  Entire TV shows are based on this simple premise.  So, instead of going home Mr Smith would use his subway token to go to Hospital B where he was also registered, albeit under John Smith Jr and a slightly different address or DOB.  Our Mr Smith would get his allotted methadone, again, and another subway token where he would go to Clinic C.  
 
Rinse and repeat, always repeat.  
 
After a month of my code being in place, with its "creepy/crawly features" the formerly-over-medicated patients were now experiencing rapid detox.  Perhaps I should take a step back.  Heroin is a dangerous, addictive opiate.  Methadone is a synthetic opioid that helps ween addicts off heroin if dosed carefully.  But, when taken in excess is just as addictive as heroin and can lead to overdose.  (I hope you learned something about drug culture in this blog post...you can now wow your friends with your new street cred).  
 
Now, I'm a very sheltered guy and it came as a complete surprise to me that rapidly detoxing drug addicts aren't a good thing to have running around a large metropolis.  Much better to have them over-medicated than under, I suppose.  
 
No one, from the customer (read government politicians who run this city's socialized medicine), to the BAs, to the QA people, to me, realized that de-duplicating data like this would be a problem...until it was a problem.  Only the hospital staff knew this would be a problem, having experienced these issues for years...but no one bothered to ask those users their opinion on some benign system change like purging duplicate medical records.  
 
In fact, months later we learned that it was a conscious decision by some of these clinics to ensure that patients were in there systems numerous times.  The system would limit the methadone dosing for patients that clearly needed more.  To override a dosing schedule would've put the prescriber on a "red list".  The easiest way to get a patient a bigger dose is to make a duplicate of the patient.  Once hospital staff worked with our software for years they knew the idiosyncracies that would allow them to enter a duplicate patient record by, say, altering an address from "1357 W 98th St" to "1357 W 98th Av".  
 
In Summary
A reasonable person might give me plaudits for my beautiful code that stopped this fraud...even if it was by accident.  But we're dealing with with the irrational.  The politicians were livid because, even though they now had more accurate medical records, and properly-medicated patients, they had a whole bunch of wild heroin addicts that couldn't get their overdose of methadone.  I had the change backed out in 48 hours and we eventually coded a system that eliminated duplicate patients from the devices...unless they were on methadone.  Those dups stayed.  
 
So, basically, we turned a blind eye to data quality...not to mention proper health care.  
 
Again, this story is modified to protect the guilty.  I assure you the actual circumstances were just as ridiculous.
 
True story... York Observer:
Lisa Schifferle, an attorney with the FTC, told me a story about a pregnant woman who after having her identity stolen 
 
showed up at the hospital in labor and was turned away because according to the hospital's records she had already had a 
 
baby six months earlier.
 
That's nothing.  My story is better.  I worked on a project to remove duplicate patient names in an FDA-regulated medical 
 
device used at many hospitals in NYC...which has its own unique socialized medicine system.  Using heuristics I was able 
 
to merge patient records for  John A Smith and John Smith Jr if they had 11 data points that were similar other than name. 
 
 The requirements and code were signed-off by all parties and everyone was happy.  
 
Until 2 weeks after it was "turned on" at all of the hospitals.  I arrived at work and was told that if I didn't roll back 
 
the change I would be fired on the spot.  I asked what was wrong.  
 
"Your system is working too well and our heroine addicts can't get their methadone!"  
 
Seriously, that was the statement.  Previously John A Smith would register at Hospital A and receive his methadone and a 
 
subway token to get back home and his patient record would be updated to reflect that his methadone was scheduled for 
 
refill in x weeks.  Instead of going home Mr Smith would use his subway ticket to go to Hospital B where he was registered 
 
as John Smith Jr and would receive his allotted methadone and a subway token.  He would take his token to Hospital C where 
 
he was registered as John A. Smith Jr ...you get the point.  
 
Now, one would think I should get plaudits for stopping this fraud...even if it was by accident.  Nope.  The politicians 
 
were livid because they now had more accurate hospital records but a whole bunch of wild heroine addicts that couldn't get 
 
their overdose of methadone.  I had the change backed out in 48 hours and we eventually coded a system that eliminated 
 
duplicate patients from the systems...unless they were on methadone.  
 
True story...

You have just read "Data Quality" on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

PerformanceCollector Part 2: How It Works

This is Part 2 of my blog series on PerformanceCollector.  This utility will help you monitor and baseline the performance of your SQL Server.  You can download the source code here. It is totally open source and there is no copyright.  It follows the Oracle YAPP method of performance monitoring where we monitor our response times and try to improve them by eliminating unnecessary waiting on the server.  

Overview

PerformanceCollector looks at various DMVs and at given intervals takes a snapshot of the most interesting DMV data that you can then query later.  It is implemented as two SQL Agent jobs.  

  1. PerformanceCollector BlockAndWaitMonitor:  runs in a "WHILE 1=1" loop and will rerun every 5 mins if it is stopped. Within the loop, every 15 seconds (configurable in PerformanceCollector.Config) it will collect and blocking and waiting data to PerformanceCollector.BlockAndWaitMonitor.
  2. PerformanceCollector.RunAddIns: runs every 5 mins to determine what AddIns are scheduled to run. This is discussed more in a future blog post:  nodetitle.  The AddIns capture data at less frequent intervals.  You can write your own AddIns as needed.  

Installation

Initial deployment and upgrades are handled by the same Powershell script: Deploy-PerformanceCollector.ps1. There is one parameter: $servername. It assumes you are running as an NT acct with sysadmin permissions.  Simply download the code and execute it like the screenshot.  

It will install a new database for your instance called PerformanceCollector that contains all of the code.  It will also create the new SQL Agent jobs noted above.    The installer only takes a few seconds to create or upgrade the utility.  The installer uses a variant of my MD3 (Metadata Driven Database Deployments) utility to install or upgrade a database via scripts.  

 

In the next blog post in this series I'll cover the BlockAndWaitMonitor which is the core component of the tool and the go-to object when I am presented with a performance problem on a production system.  


You have just read "PerformanceCollector Part2: How It Works" on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

PerformanceCollector: a performance monitoring tool for SQL Server

About 15 years ago I wrote a suite of performance monitoring scripts for SQL Server 7 that would try to tell you why a given query was running slowly.  Back in those days there was no concept of DMVs to help you out.  Most people I knew relied solely on sqltrace (Profiler).  The problem is, Profiler will tell you how long something will run, but it won't tell you what subsystem of SQL Server is causing those long runtimes.  Is it a memory issue?  Slow IO?  Blocking?  The only information available was sysprocesses.  And it contained excellent information.  

I came from an Oracle background and in those days (Oracle8) there was no equivalent to Profiler.  If you wanted to look at your performance metrics you needed to query the sys and system schemas and the v$ views.  These are functionally equivalent to DMVs.  In fact, the folks at Oracle published a bunch of whitepapers on using these DMV-like tools and called it YAPP (Yet Another Performance Profiling Methodology).  YAPP was quite simply tuning based on the subsystems that caused long response times.  

I never realized how simple and elegant this approach to performance management was until I started working with SQL Server 6.5.  Other than sysprocesses, there was nothing else to tell you where the bottleneck was.  sqltrace was great for baselining but given a big query that had sporadic performance problems, where do you look next?  I actually started to do some independent learning about Performance Management that I began to document.  

But how could I do YAPP in SQL Server which had no publically-exposed wait interface?  Well, it turns out it did.  sysprocesses would show you the waittype, waittime, and waitresource for any spid that is waiting on some subsystem.  

I actually started this website in 1997 partially to document my findings with sysprocesses and ways to do YAPP-like analysis.  Subsystem Information Home, which now has lots of information on DMVs and response time tuning techniques, was initially a series of static pages on interested, undocumented things I found.  Remember, there was no google yet.  

I began to see more performance engineers and DBAs relying on sysprocesses as their tool of choice for performance troubleshooting.  Slowly DBAs started complaining to M$ to expose even more information in sysprocesses and give us something more YAPP-like.  With the release of SQL Server 2005 we were given our first DMVs.  I was thrilled.  But there were so many DMVs and they were still not entirely documented.  I created a new section of my website, DMV Info where I started to document what these DMVs did and how they helped in response time and wait analysis.  

I certainly wasn't the only DBA doing this, nor even the best.  In the last 10 years a complete shift in how SQL Server DBAs do performance engineering has occurred.  Nowadays I don't see anyone using Profiler and everyone uses the DMVs.  How the times have changed.  

Anyway, my first attempt at a performance monitoring and baselining solution was a SQL Executive (now known as SQL Agent) job that merely took a snapshot of sysprocesses and squirrelled it away so I could look at it later when someone complained about poor performance.  It was invaluable.  I rewrote it with the advent of DMVs.  Slowly over time I was monitoring even more "stuff" as every new release of SQL Server gave us new and exciting DMVs.  

I found this so useful I was installing it at every client engagement.  I decided to release a version of it a few years ago and posted it on my blog under the name Performance Engineer-in-a-Box.  I've since installed it at even more client sites and I finally got around to properly releasing this in a manner where I can do controlled releases as needed.  I now call the tool "PerformanceCollector".  The source code is available at my GitLab installation and is open to the public.  Feel free to modify it, fork it, steal it, change it, or whatever you would like.  I really don't care and their is no form of copyright on it.  I would appreciate an attribution if you do use it, but no payment is necessary.  

Over the next few blog posts I'll cover in details what PerformanceCollector does and how it works.  


You have just read "PerformanceCollector: a performance monitoring tool for SQL Server" on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

The Sunk Cost Trap

"We want you to float on Project Automation and get back to us in a week with your thoughts.  We are a little concerned that this project has cost us $14 million and 14 months.  It is scheduled for code freeze in 4 months and we have yet to see a working proof-of-concept."  The worry was audible in the VP's voice.  
 
And this is how it starts.  I both relish and fear these moments.  On one hand I love coming in and saving a failing project, on the other hand, you never win and friends doing it.  You see, there was a customer perception that our enterprise software required too much manual intervention and Project Automation (PA) was meant to relieve some of that burden.  It was a contractual demand from our largest customers.  We either succeeded or we went out of business.  
 
I quickly began spending my days in the PA team room, at first absorbing what I was hearing, then asking questions, then offering suggestions.  When you get to that last stage the reticence is noticeable from your co-workers.  Within two weeks it was clear to me that most of the team viewed this as a death march project.  The team knew key aspects needed to change if it was to be a success but everyone was hesitant to make those changes knowing that there was only 4 months left.  Left alone, inertia guaranteed this project would fail.  
 
Within three weeks I had a working list of features that, as implemented, would NOT meet customer requirements regardless of time or money spent at this point.  The approach was wrong.  I also worked out a better approach that would meet the requirements, albeit scaled back for Version 1, and could be coded-up and unit tested within the remaining 3 months.  Granted, this would require some serious hunkering down and probably some OT, but we would show success.  
 
It took another week until I could convince the entire team that the only way they could show success was to ignore the sunk costs and change direction.  Within that week I had a somewhat-working proof-of-concept that I could demo and the team was won over.  It took another week to convince management to ignore the sunk cost and change direction.  Management was not happy about time and money being flushed away, but it was the only way to show success.  
 
We delivered Project Automation about a month late and with scaled-back requirements.  In the enterprise software development space, that's an unqualified success!  And we worked almost no overtime (I strongly believe in work/life balance).  By the time the release was GA'd we already had a service pack to address the missing requirements and even added a few new features that we realized were lacking after usability testing.  Customers were thrilled.  
"We want you to float on 'Project Automation' and get back to us in a week with your thoughts.  We are a little concerned that this project has cost us $14 million, 140 man-years, and 14 months of time.  It is scheduled for code freeze in 4 months and we have yet to see a working proof-of-concept."  If 'worry' has a sound it was audible in the VP's voice that day.  
 
And this is how it starts.  I both relish and fear these moments.  The moment that I am asked to work on an impossible project.  On one hand I love coming in late and saving a failing project, on the other hand, I never win friends doing it.  And I've failed at one of these opportunities.  That's not braggadocio.  There were times I went days without sleep to get to success.  It's just my nature, and I'm neither the only one nor the exception to the rule.  
 
Project Automation (PA) was critical to the future of our software.  You see, there was a customer perception that our enterprise software required too much manual intervention and PA was meant to relieve some of that burden.  It was a contractual demand from our largest customers.  We either succeeded at PA or we went out-of-business.  The details of what PA did are irrelevant to this story.  Just understand that it was critical.  
 
I wasn't the first person asked to help out with PA.  Like I said, there are lots of people far better than me.  For months management tried to get some control.  They tried the obvious things...shorter sprints, a move from scrum to kanban, enforced pair programming, more unit testing, less unit testing.  Nothing worked.  Those obvious things, while often helpful, will not turn around a failing project.  I know that, management sometimes does not.  
 
I quickly began spending my days in the PA team room, at first only listening.  Soon I began asking questions, then later offering suggestions.  When you get to that last stage the reticence is noticeable from your co-workers.  Within two weeks it was clear to me that most of the team viewed this as a death march project.  The team knew key components needed to change if it was to be a success but everyone was hesitant to make those changes knowing that there were only 4 months left.  Left alone, inertia guaranteed this project would fail.  
 
Within three weeks I had a working list of features that, as implemented, would NOT meet customer requirements regardless of time or money.  The approach was wrong.  I worked out a better approach that would meet the requirements, albeit scaled back, and could be coded-up and unit tested within the remaining 3 months.  Granted, this would require some serious hunkering down, but we would show success.  
 
It took another week until I could convince the entire team that the only way they could show success was to ignore the sunk costs and change direction.  Within that week I had a somewhat-working proof-of-concept that I demo'd and the team was won over.  It took another week to convince management to ignore the sunk cost and change direction.  Management was not happy about time and money being flushed away, but it was the only way to show success.  More out of desperation for their failing business than faith in my ideas, they acquiesced.  The PA team would change direction and do it my way.  
 
We delivered Project Automation about a month late and with scaled-back requirements.  In the enterprise software development space, that's an unqualified success!  And we worked almost no overtime (I strongly believe in work/life balance...and it was NHL playoff time too).  By the time the release was GA'd we already had a service pack to address the missing requirements and even added a few new features that we realized were lacking after usability testing.  Customers were thrilled.  We remain(ed) in business.  

Why did this project succeed?  It wasn't because I'm some kind of architect-superstar.  We know that's not true.  It was a success solely because I could convince others to ignore the sunk costs, especially so late in the development cycle, and switch focus to something else that would work instead.   

The Sunk Cost Trap

I feel I'm constantly explaining the sunk cost trap to project and product managers.  Recognizing a sunk cost trap when you see it in your software projects can save your company.  

A sunk cost in accounting parlance is a cost that has already been incurred and can't be recovered.  A sunk cost is not necessarily bad if it is a cost that can yield a positive return in the future.  The sunk cost trap arises when people cannot see that an investment will not ever show a return, yet they continue to waste resources pursuing it.  

There are lots of pithy aphorisms that deal with the sunk cost trap that you might've heard:  

  • "When you first find yourself in a hole, stop digging."
  • "Your first loss is your best loss."  
  • "Don't throw good money after bad."  
  • "Let bygones be bygones."  

Each of these aphorisms is conveying the message, more or less, not to fall for the sunk cost trap.  Yet people fall for this fallacy all the time.  We have the mistaken belief...it is really very irrational...that we've already committed a lot of resources to a given endeavor and we are past the point of no return.  We must continue with the chosen path and see it through.  The sunk cost trap, like all fallacies, can cost you and your company a lot of money when not identified.  

It is important as a software architect to not be a victim of fallacies.  This may even be the most important part of a software architect's job description.  

In your everyday life you probably fall for the sunk cost trap more than you realize.  Have you ever finished watching a two-hour movie that you were not enjoying after the first half hour?  You already had thirty minutes invested in it right?  It might get better so you keep watching.  Or you've finished a terrible dinner at a posh restaurant because you aren't the type to complain and walk out without paying.  If you're going to pay anyway, why not finish the meal?  Or you hold onto a bad stock in your portfolio because it is down too much to sell now, even though you'd never buy more of it today.  

Try to recognize fallacious activity. If you suddenly realize that a bad decision was made and it is preventing you from attaining your goals, and a better alternative exists, then you need to ensure that you change.  The truly good architects will do this...the mediocre architects will focus on the sunk costs and will try, come hell or high water, to make a subpar solution work at any cost.  

Prevention and Cure

Here are some tricks that I use to prevent sunk cost traps in my projects and overcome common objections when they do arise:   

  • "Fail early instead of fail late".  I preach this.  Never embark on a 6 month (or longer) project.  Instead, pare down the scope until you can have a proof-of-concept (POC) deliverable in a few sprints.  In your sprint retrospectives you should be evaluating whether or not the POC is working and if there are any alternatives that may be better.  POCs are really just "science projects" but many people are not willing to deviate after a few sprints due to the sunk cost trap.  I just finished a very lengthy blog series on various POCs I did using NoSQL solutions.  As soon as we found an issue that caused a given product to be sub-optimal for our needs we dropped that POC immediately.  We did not continue to throw good money at it just because we already wrote a few hundred lines of code.  I dropped a couple NoSQL POCs after just a few hours because of fundamental flaws I saw in the products given our requirements.  
  • Understand lost opportunity costs.  By continuing to invest in something that is sure to fail you lose TWICE.  You are flushing money down the toilet on the existing solution and you can't properly invest in an alternative that is more promising (that is the "lost opportunity cost").  Management fails to understand opportunity costs constantly.  For instance, senior, strategy-minded architects are forced to perform support tasks because there is a shortage of support personnel.  The opportunity cost is the senior architect cannot think about how to relieve the technical debt that causes inflated support costs in the first place.  
  • "This was a learning cost, albeit an expensive one".  At a minimum, make your mistakes educational so they won't happen again.  

For years after Project Automation I still had managers griping about the $14 million "learning cost" we spent before we changed direction.  This is entirely the wrong thing to focus on.  It isn't that we wasted $14 million, it is that we are still in business and are profitable.  I can assure you that we never had another fiasco like PA again.  Of course we made sure we had better project governance, but most importantly we recognized when the sunk cost trap was causing us to not do the right things.  

Metrics That Count (and it ain't "points")

I've lived through my fair share of software productivity and management frameworks.  Kanban, agile, scrum, XP, waterfall...there's probably more that I'm trying to subconsciously suppress as I write this.  Each of these is concerned in some respect with metrics.  How do we evaluate how much work is getting done and who is doing it?  How do we use metrics to improve "cycle times"?  How do we improve "burn down"?  How can we use these metrics at performance review time?  Well, IMHO, none of these "metrics" really matter.  What matters is shipped software, delighted customers, rising stock prices, and stress-free employees who get to go home at the end of the day and spend quality time with their spouses, significant others, and/or kids.  Nothing else matters.  

Management thinks it needs hard numbers and metrics to determine if the program is meeting its software development goals.  It also needs numbers to determine which developers are meeting expectations and which are unsatisfactory.  One problem is that software development is not assembly line work.  In a Toyota factory, management has various mechanisms to determine efficiency of individuals.  Number of defects, speed of the line, number of work stations mastered, the ability to train new employees, etc etc.  

Art vs Science

Software development is *not* assembly line work, no matter what new language or Big Data system the cool kids are all using.  Software development is more "art" than "science".  And management of "art", with its inherent lack of metrics, is so much harder to do than managing something with defined metrics and "numbers"...something like science or math.  

Think I'm exagerrating?  Do you think Pope Julius II evaluated Michelangelo based on the number of cherubs he painted on the ceiling of the Sistine Chapel everyday?  It's true that they argued over the scope of the work and budget, but the Pope never tried to evaluate The Master based on some concocted metric. 

There is so much in software development that simply cannot be measured up-front.  We generally call these things the "non-functional requirements."  Some shops call them "-ilities".  Performance is generally considered a non-functional requirement.  We all try very hard to evaluate the performance of our software before it ships, often using tools such as LoadRunner.  But more often than not we find that we have not met the necessary performance metrics once the software is in the customer's hands.  So, how do you measure a performance metric early?  You really can't.  So, do we ding the team or individual for failing this metric? 

The only metric that matters

... in software development is working, released features that a customer wants.  If the feature has not shipped then you get zero credit.  There is no A for Effort.  Even if you are 80% feature-complete, you get no credit.  If you shipped it but the customer doesn't like it, you get no credit either.  I hear developers complain that it isn't their fault that the product failed...the requirements from the analysts were wrong and the developers merely implemented the requirements as given.  I appreciate that argument, and I feel your pain, but the only metric that counts is a happy customer.  When your company goes bankrupt because your product failed because of bad requirements, I'm not sure your mortgage company is going to care.  

Other Metrics

There are lots of metrics management uses to evaluate us.  Here are a few and my rebuttal as to why they don't work for project evaluation:

  • Tickets closed:  I've worked at shops where each branch of code needed its own ticket for check-in purposes.  And we always had 4 supported versions/branches open at any time.  So a given bug may need 4 tickets.  That's called "juking the stats."

  • Lines of code written:  so now we are incentivizing people to write more code instead of elegant, short-and-sweet, supportable code.  More lines of code = more bugs.  

There are two ways to design a system: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies

  • Story Points:  A quick google search for "what is a story point?" yielded this article which pretty much concludes that you shouldn't use story points for metrics.  Oops.  
  • Velocity:  this supposedly shows management the "rate of progress" in our development.  In a perfect world our velocity will improve when we develop new tools that help us work smarter and faster, such as automation.  But many times velocity is merely going up because developers are incentivized to make the velocity improve and they do this the simplest way possible...cut corners.  
  • Code Test Coverage:  there are lots of tools that will analyze how many lines of code you have that have no unit tests.  I covered this in my blog post Paradox of Unit Testing?.  This leads to people juking the stats again...writing a bunch of tests to make the code coverage analysis tool happy.  
  • Unit Tests Written:  see nodetitle again.  I have worked with people who have refused to add new features to their code because there were too many unit tests that would need to be rewritten.  

The last two are the WORST offenders.  Most developers realize that lines of code, points, and tickets closed are ridiculous metrics, but many otherwise thoughtful developers fall for static code analysis and unit tests.  I've seen whole teams spend entire sprints writing unit tests for code that was 5 years old with no reported bugs because there was no test coverage.  It sucks the life out of the product and the team.  

I once tried to remedy this situation, and I hate to admit this, by merely adding empty test bodies to get the metrics acceptable.  And I've seen lots of people merely comment out broken tests to avoid getting weenied for that.  

Why do we rely on metrics?

Numbers suggest control.  Management likes control.  But this is totally illusory.  I've worked on teams where every sprint had some task called something like "refactor the Widget interface settings" that was assigned 15 points.  If the team had a bad sprint they merely claimed these points to make the numbers like good.  No work was ever really getting done and management had no idea.  That same team, after a 12 month release cycle, had ZERO features to contribute to the product's release.  Management was not happy.  But every sprint showed progress and burndown.  

Heisenberg and Perverse Incentives

When something is measured too much then the measurement itself will skew the system under measurement.  Loosely, this is known as the Heisenberg Uncertainty Principle.  I've worked on teams where there was an over-reliance on points as the metric to determine productivity.  People knew they were being measured and they geared their activities to those things that could generate the most points.  This usually meant pulling simple cards with small points that were low-risk.  The more important, higher point but longer duration "architecture" cards were never worked on.  They were too risky, you either got all of the points, or none of them.  

Summary

I'm sorry about this long rant on software development metrics.  Every team is unique and will determine how best to structure itself for optimal efficiency.  So many of these metrics are an effort to shoe-horn the team into a structure that management is comfortable with, even if that is forsaking the goals of the program.  Let the team figure out what metrics it needs.  Management's only metric of concern should be shipped features.  Nothing else matters.  When management evaluates its development teams on that metric I believe you will see teams that are super-productive because they are no longer juking the stats and wasting time on anything that is not leading them to more shipped features.  

But nothing is going to change.  It is too risky for management to not have a solid metric in which to evaluate us and our products.  Very depressing.   


You have just read "Metrics That Count" on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

On Being a Team Player

Ah spring!  Birds chirping, flowers blooming.   As I sit here smelling the freshly cut grass my mind wanders to thoughts of whether the grass may actually be greener on the proverbial other side.  An email just popped into my inbox from some recruiter I've never heard of for a consulting gig that appears to be a good fit for me.  Advanced SQL Server skills, knowledge of data modeling, ability to do performance tuning.  And it's a great rate and good location to boot.  

But there's one thing that's an instant turnoff for me.  "Must be a team player."  Please don't get me wrong...I love working on teams and I enjoy discussing technical problems with knowledgeable colleagues.  But "Must be a team player" is just fluff to me.  Of course your company wants a team player!  Why must that be stated prominently on the job req?  Makes no sense to me.  No company would ever advertise for Benevolent Dictators for their team and they would never advertise for a non-team player either.  I've [Freelinking: unknown plugin indicator "</span><span style="line-height"] about other needless phrases on job reqs...like "senior-level".  Everybody thinks they are senior-level.  Instead, tell me exact skills the job requires.  

I'm waiting for the day when "Must have good oral hygiene" is listed on a req.  Honestly I'd rather see that on a req than "team player."  There are SO many people walking around with bad breath and it affects team productivity when you have to work in a close-knit scrum room.  

Further, have you ever heard anyone describe themselves as "not being a team player"?  I've been accused of not being a team player and I'm sure others have too.  Being accused of not being a team player is sometimes a sign that you have a maverick who is willing to buck consensus for what she believes is right.  I'd rather have that "non-team player" than a team player who is merely a yes-man for the status quo.  That's how companies lose their market edge and go bankrupt.  Too many team players that don't have the backbone to affect positive change.  

I really question whether a company would want a "team player" for certain roles regardless.  Team players are great if you just need a staff programmer or jr DBA.  But if you need a good architect on a Death March Project, then maybe a team player isn't what you need.  Perhaps you need someone to question the most basic premises in your organization.  Perhaps you need someone who won't be afraid of being a maverick if it means the difference between success and failure, or profits and bankruptcy.

I just really hate seeing "must be a team player" on a job req.  It's a big turnoff for me.  Might as well say, "Requires 5 years of experience with SQL 2012"  (it's 2014 as of this post's publication).  Or perhaps, "requires proficiency with word processing applications and email".  Really?  This is for an IT job and I'm expected to be functionally literate with Outlook?  Oh my.  

Seriously though, I'm not sure how any senior-level person (I hate "senior-level" too, see above, but it's apropos for this post) could fathom applying for a job with "must be a team player" in the req.  I would rather apply for a job with a more concise description that does not contain fluff.  

Pages