DaveWentzel.com            All Things Data

SQL Server

PerformanceCollector and the Heisenberg Uncertainty Group

This is the last post in my series on my PerformanceCollector tool.  This post is a warning and an anecdote that you can over-monitor your server.  By default PerformanceCollector (PC) monitors quite a bit of information but I've had the utility running on high-throughput production OLTP workloads and I could not discern any negative performance impacts by running it using the default configuration.  It has all of the necessary fail-safes in place (I hope) to ensure it doesn't roll off the rails.  For instance, there is a configuration setting that will cap the size of the database and will auto-purge data if needed.  

The Heisenberg Uncertainty Principle may eventually come into play however and you should be aware of that.  If you are not familiar, the HUP basically says that the more you monitor something, the more the act of monitoring will adversely impact the system.  You may also hear this called the "observer effect".  

If you find that PC is a bit too heavyweight for your environment then you can disable some of the AddIns or extend the interval between runs.  Or you can tweak the code to fit your needs.  The possibilities are endless.  In older versions of the tool I ran into some cases where the system was under such stress that PC could not accurately log the waiting and blocking of the system.  That kinda renders the tool useless when you are trying to diagnose severe performance problems.  But I haven't had that happen in a few years now so either my code is getting better, or M$ has improved the performance of their DMVs to work under harsh environments.  I assume the latter.  

But you can take performance monitoring too far.  Here's a quick story to illustrate the fact.  

I consulted at a company that had deployed 2 third party performance monitoring tools that shall remain nameless.  They worked ok but did not capture the wait and subsystem information like PC does so the client decided to deploy my scripts as well.  

There were now 3 tools running.  

PC only collects waits and blocks every 15 seconds and it's likely it won't capture EVERY "performance event" that occurs during that interval.  But again, I'm worried about HUP.  

Primum non nocere.

Like any good doctor, "first, do no harm."  

Frankly, users are generally tolerant of an occassional performance blip, but anything over 15 seconds and rest assured some user will open up a help desk ticket.  That is why the monitoring interval is 15 seconds.  But this interval is a bit like Prunes Analysis (when taking prunes...are 3 enough or are 4 too many?).  You can certainly make the case that every 10 seconds is better, or that every minute is optimal.  Decisions like this are very experiential.  

This particular client decided that with a 15 second interval too much valuable information was not being logged.  They decided that if a block or wait occurred the 15 second monitoring interval would be shrunk to every second to get even better metrics.  This required one of their developers to change the WHILE 1=1 loop in PC.  If after 60 seconds of monitor at the one-second interval no blocking/waiting was observed, then the code reverted back to every 15 seconds.  

IMHO, that's far too much logging.  It is quite normal for a busy system to constantly experience some variant of waiting.  That meant that this new code was ALWAYS running in one second interval mode.  This hyper-logging mode was implemented after I left the client and they did it without consulting me, or really consulting anyone that understands a modicum of performance monitoring a SQL Server.  

It didn't take long until performance of the system was worse than ever.  At one point the instance was crashing every few days due to tempdb log space issues.  Further, PerformanceCollector did no have all of the fail-safes at the time and the db was about 1TB in size, even with 30 day purging in place.  Another performance consultant was brought in (I was out-of-pocket) and he quickly determined that PC was the biggest consumer of tempdb and IO resources on the instance.  I looked thoroughly stupid when I was called back in to the client to explain myself.  I quickly determined that my code was not at fault, rather the change to shorten monitoring interval and increase the frequency of monitoring were the causes.  

Without finger-pointing I explained that we should put my PC code back to the defaults and re-evaluate why hyper-logging was needed at all.  The other consultant agreed and within a day or so everyone was calm and happy again.  Mistakes like this happen.  

As I was wrapping up this engagement I finally asked, "So, have you identified and corrected many issues that PC has found?  Clearly if you thought hyper-logging was needed you must've corrected quite a few performance problems with the tool."

The DBA responded, pensively, "Well, we keep the trend data but we haven't opened up any tickets to the dev team to fix specific stored procs yet.  We don't have any budget for a DBA to look at the missing indexes either.  But maybe we'll get to do all of that some day."  

The devs, architects, and I choked back the laughter.  All of that monitoring and nothing to show for it.  To this day the other architects at that client call the DBAs the HUGs (Heisenberg Uncertainty Group) behind their backs.  I have since returned to that client for other engagements and I have witnessed this for myself.  Those of us in the know chuckle at the joke.  But the lead HUG didn't get the joke apparently…he got angry once and asked not to be compared with Walter White. 

Moral of the story, monitor as much as you need to, and nothing more.  If you don't have the budget or resources to fix issues you find during your monitoring then it is best to consider NOT monitoring.  No need to make your performance worse with too much monitoring.  That will keep Heisenberg from rolling over in his grave.  


You have just read PerformanceCollector and the Heisenberg Uncertainty Group on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

PerformanceCollector AddIns

How do the AddIns Work?

There are a series of procedures named PerformanceCollector.Addin*. PerformanceCollector.RunAddIns is tickled every 5 minutes from the SQL Agent AddIns job. This proc builds a cursor over ANY entry in PerformanceCollector.Config that is enabled and runs it.

The AddIns procedures are responsible for determining when the given AddIn is actually supposed to run (every 8 hours, once a day, etc). It determines this from querying PerformanceCollector.Config. It then calls the actual worker proc if it needs to.

AddIns are responsible for populating the other tables in the PerformanceCollector schema.  The following tables store the performance metrics that I find most interesting.  Each table is structured so the first columns are CurrTime and ServerName. This aids in troubleshooting. You can  aggregate all of this data across multiple customers/instances to do trending in a centralized repository.

PerformanceCollector Object Name Default Runtime Interval Purpose
AdHocPlanCacheBloat 7 Hours Logs single-use ad-hoc query plans that are a total waste of plan cache space. If this is really big then the solution is to run 'optimize for ad hoc workloads' which reclaims all of the memory and makes sure that ad hoc plans are not fully cached. This is a "analyze once" script. Once you've enabled 'optimize for ad hoc' then you are done and you can disable this AddIn. But we run this in case a new framework (ruby seems susceptible) is deployed which causes a radical increase in ad hoc plans. You can also use the text to determine where you are not doing parameterized queries correctly.
BufferStats 5 Hours Gathers stats on the counts and size of each object in the buffer cache.  Larger values for some objects might mean a new index is required.  
Config   Using this table you can configure certain AddIns to NOT run or change their schedule.  
DeadlockSnapshot 4 Hours Shows all deadlocks that have occurred and tries to decode the deadlock XML to show the Victim and Survivor and what they were doing. Please note that there is a lag time between when a deadlock occurs and when it is decoded here that could be AT MOST 8 hours. You can manually run the Deadlock AddIns to do the decoding at any time if you really need to see up-to-the-minute deadlock data.
IndexBlocking 4 Hours Logs the amount of time each index is spent in a blocking/waiting state.  Useful to determine if better indexes may be needed.  
IndexesFromQueryPlanDtls   Not yet implemented. This will read the missing indexes from the query plans instead of relying on the DMVs. This seems to be far more accurate. This will list the TotalCost (somewhat nebulous) as well as the table and index columns that the optimize believes would be beneficial.
IndexesFromQueryPlansSummary   Not yet implemented. An aggregate of the the Dtls table that quickly shows the best indexes to be created.
IOStalls 7 Hours Lists the IOStalls metrics by database and filetype.  See more at Monitoring Disk Latencies with PerformanceCollector
MissingIndexes 4 Hours Snaps the sys.dm_db_missing_index* DMVs. However, there is a better way to do this that looks just at cached query plans (see IndexesFromQueryPlanDtls). Even so, this will show every egregious examples of missing indexes.
PageLifeExpectancies 5 mins Logs the PLEs for the instance.  
PerfCounters 1 Hour Gathers instance-level perf counters in EAV format for aggregation later.  
PlanCacheMetrics 7 hours Logs stored proc stats (reads/writes/CPU/duration) by Metric.  By looking at procs that overlap the different metrics you can quickly determine what you should be focus on when you are looking for performance problems.
-- Top Cached SPs By Execution Count 
-- Top Cached SPs By Avg Elapsed Time
-- Top Cached SPs By CPU
-- Top Cached SPs By Total Logical Reads
-- Top Cached SPs By Total Physical Reads
-- Top Cached SPs By Total Logical Writes
PlanTypeStats 7 Hours similar to AdHocPlanCacheBloat. This will show you the breakdown of adhoc vs prepared vs proc, etc. the USE_COUNT will show you if parameterization is working for you.
PossibleBadIndexes 4 Hours Logs cases where writes>reads. This may mean the index can be dropped. But you must understand your workload. However, if Total Reads is ZERO then you can generally drop that index. Unless you have a year-end process that needs it or some other edge case where the index is only needed occassionally.
TempDBLogWatch 5 mins Lists transactions and query plans that are using more than 50MB of tempdb log space.
TempDBUsage 5 mins Shows spids and statements that are using more than 50MB of tempdb space.
 

You have just read PerformanceCollector AddIns on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

PerformanceCollector.BlockAndWaitMonitor

This is Part 3 of my blog series on PerformanceCollector.  This utility will help you monitor and baseline the performance of your SQL Server.  You can download the source code here. It is totally open source and there is no copyright.  It follows the Oracle YAPP method of performance monitoring where we monitor our response times and try to improve them by eliminating unnecessary waiting on the server.  Today I am going to cover PerformanceCollector.BlockAndWaitMonitor which is a view that shows you any process that was waiting or blocking/blocked and the cause. This is snapshotted from various DMVs every 15 seconds. We maintain this data for 30 days.  Both the snapshot interval and data retention settings are configurable in PerformanceCollector.Config.  

The Typical Use Case

In general people will complain about application performance by saying, "The app was slow from 1:45 to 2:00 today.  Can you look at the database?" In that case start by querying this view and filtering where CurrTime BETWEEN '1:45' and '2:00'.

You are looking for:  

  • any blocking/blocked spids
  • any spids waiting and what they are waiting on.
  • sql statements/procedures with longer durations than normal (may be missing an index or stats need to be updated).  

There are infinitely more things to look for that I can't possibly cover in a short blog post.  Experience and google are your friends here.  You'll likely find the cause of the problem if you look at the data and think logically.  

Baselining

It's difficult to baseline wait events and whether they are trending in the wrong direction.  However, you can do basic trending by looking for "canaries in the coal mine".  Every system I've ever worked on has had patterns of performance behavior that, when observed, were good indicators that your users were unhappy with performance.  An example is lock escalations.  When a given query is running in READ COMMITTED isolation and scans an index it runs the risk of locking the entire data structure because the fine-grained row locks are escalated to table locks to conserve resources.  When this happens lots of spids tend to be blocked behind the lock escalating spid.  When the lock escalating spid runs for more than a few seconds users will think your database is down.  So, this is a use case I baseline.  It is fuzzy logic but we know to react to these event chains quickly before the calls start coming in.  

(sample output from PerformanceCollector.BlockAndWaitMonitor)

Manually Capturing a Spid

You can manually run EXEC PerformanceCollector.WaitAndBlockInspector at any time and it will log current waiting and blocking data. This may be useful if you notice your query is taking longer than it should. You can also pass a @spid parameter like this EXEC PerformanceCollector.WaitAndBlockInspector @spid = 100; and it will show you any waiting and blocking as well as any activity for the spid. This data is output to the screen and logged in the table.

Abandoned Spid Handling

BlockAndWaitMonitor will also show "abandoned spids." An abandoned spid is any spid with an open transaction for at least 60 seconds that is blocking something else. Here is the most important aspect...the spid must be sitting on AWAITING COMMAND. That is what differentiates an abandoned spid from an ordinary, run-of-the-mill blocking spid. When the spid is sitting on AWAITING COMMAND that means the SQL Server is waiting for another command from the spid to do something...most likely issue a COMMIT TRAN.

I have never seen a case where the spid will come back and issue the COMMIT.  Hopefully your developers are smart enough that they don't allow user input while a database transaction is open.  In every case an abandoned spid occurs because a data access framework thinks it is NOT in implicit_transaction mode and SQL Server thinks it is. For instance, various combinations of jdbc and websphere produce this behavior because jdbc does not issue sp_reset_connection commands like .NET does.

An abandoned spid is first logged after 60 seconds. After 5 minutes of being in an abandoned state, and after another spid has been blocked by it for 60 seconds, the spid is killed. It is logged to PerformanceCollector.SpidsKilled.

You can of course disable this feature in PerformanceCollector.Config.  

Summary

BlockAndWaitMonitor is the most important aspect of the PerformanceCollector tool.  It will show you how the SQL Server subsystems (IO, Memory, lock manager, log manager, etc) are contributing to "waiting" on your SQL Server.  You will see more than just the SQL command that is blocking...you'll see an approximation of why it is blocking.  This is important.  Just because a procedure is long-running does not mean a new index will fix it, nor will a rewrite either.  You have to understand why it is behaving like it is and which subsystem is contributing.  Only then will you know what needs to be fixed.  

In the next post I'll cover some of the other features of Performance Collector.  


You have just read "PerformanceCollector.BlockAndWaitMonitor" on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

PerformanceCollector Part 2: How It Works

This is Part 2 of my blog series on PerformanceCollector.  This utility will help you monitor and baseline the performance of your SQL Server.  You can download the source code here. It is totally open source and there is no copyright.  It follows the Oracle YAPP method of performance monitoring where we monitor our response times and try to improve them by eliminating unnecessary waiting on the server.  

Overview

PerformanceCollector looks at various DMVs and at given intervals takes a snapshot of the most interesting DMV data that you can then query later.  It is implemented as two SQL Agent jobs.  

  1. PerformanceCollector BlockAndWaitMonitor:  runs in a "WHILE 1=1" loop and will rerun every 5 mins if it is stopped. Within the loop, every 15 seconds (configurable in PerformanceCollector.Config) it will collect and blocking and waiting data to PerformanceCollector.BlockAndWaitMonitor.
  2. PerformanceCollector.RunAddIns: runs every 5 mins to determine what AddIns are scheduled to run. This is discussed more in a future blog post:  nodetitle.  The AddIns capture data at less frequent intervals.  You can write your own AddIns as needed.  

Installation

Initial deployment and upgrades are handled by the same Powershell script: Deploy-PerformanceCollector.ps1. There is one parameter: $servername. It assumes you are running as an NT acct with sysadmin permissions.  Simply download the code and execute it like the screenshot.  

It will install a new database for your instance called PerformanceCollector that contains all of the code.  It will also create the new SQL Agent jobs noted above.    The installer only takes a few seconds to create or upgrade the utility.  The installer uses a variant of my MD3 (Metadata Driven Database Deployments) utility to install or upgrade a database via scripts.  

 

In the next blog post in this series I'll cover the BlockAndWaitMonitor which is the core component of the tool and the go-to object when I am presented with a performance problem on a production system.  


You have just read "PerformanceCollector Part2: How It Works" on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

PerformanceCollector: a performance monitoring tool for SQL Server

About 15 years ago I wrote a suite of performance monitoring scripts for SQL Server 7 that would try to tell you why a given query was running slowly.  Back in those days there was no concept of DMVs to help you out.  Most people I knew relied solely on sqltrace (Profiler).  The problem is, Profiler will tell you how long something will run, but it won't tell you what subsystem of SQL Server is causing those long runtimes.  Is it a memory issue?  Slow IO?  Blocking?  The only information available was sysprocesses.  And it contained excellent information.  

I came from an Oracle background and in those days (Oracle8) there was no equivalent to Profiler.  If you wanted to look at your performance metrics you needed to query the sys and system schemas and the v$ views.  These are functionally equivalent to DMVs.  In fact, the folks at Oracle published a bunch of whitepapers on using these DMV-like tools and called it YAPP (Yet Another Performance Profiling Methodology).  YAPP was quite simply tuning based on the subsystems that caused long response times.  

I never realized how simple and elegant this approach to performance management was until I started working with SQL Server 6.5.  Other than sysprocesses, there was nothing else to tell you where the bottleneck was.  sqltrace was great for baselining but given a big query that had sporadic performance problems, where do you look next?  I actually started to do some independent learning about Performance Management that I began to document.  

But how could I do YAPP in SQL Server which had no publically-exposed wait interface?  Well, it turns out it did.  sysprocesses would show you the waittype, waittime, and waitresource for any spid that is waiting on some subsystem.  

I actually started this website in 1997 partially to document my findings with sysprocesses and ways to do YAPP-like analysis.  Subsystem Information Home, which now has lots of information on DMVs and response time tuning techniques, was initially a series of static pages on interested, undocumented things I found.  Remember, there was no google yet.  

I began to see more performance engineers and DBAs relying on sysprocesses as their tool of choice for performance troubleshooting.  Slowly DBAs started complaining to M$ to expose even more information in sysprocesses and give us something more YAPP-like.  With the release of SQL Server 2005 we were given our first DMVs.  I was thrilled.  But there were so many DMVs and they were still not entirely documented.  I created a new section of my website, DMV Info where I started to document what these DMVs did and how they helped in response time and wait analysis.  

I certainly wasn't the only DBA doing this, nor even the best.  In the last 10 years a complete shift in how SQL Server DBAs do performance engineering has occurred.  Nowadays I don't see anyone using Profiler and everyone uses the DMVs.  How the times have changed.  

Anyway, my first attempt at a performance monitoring and baselining solution was a SQL Executive (now known as SQL Agent) job that merely took a snapshot of sysprocesses and squirrelled it away so I could look at it later when someone complained about poor performance.  It was invaluable.  I rewrote it with the advent of DMVs.  Slowly over time I was monitoring even more "stuff" as every new release of SQL Server gave us new and exciting DMVs.  

I found this so useful I was installing it at every client engagement.  I decided to release a version of it a few years ago and posted it on my blog under the name Performance Engineer-in-a-Box.  I've since installed it at even more client sites and I finally got around to properly releasing this in a manner where I can do controlled releases as needed.  I now call the tool "PerformanceCollector".  The source code is available at my GitLab installation and is open to the public.  Feel free to modify it, fork it, steal it, change it, or whatever you would like.  I really don't care and their is no form of copyright on it.  I would appreciate an attribution if you do use it, but no payment is necessary.  

Over the next few blog posts I'll cover in details what PerformanceCollector does and how it works.  


You have just read "PerformanceCollector: a performance monitoring tool for SQL Server" on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

SSIS Subpackages and Version Control

I use subpackages in SSIS for just about everything.  But I've noticed that I'm in the minority on this.  Instead, I see these huge monolithic SSIS/DTS packages that attempt to do a lot of different things.  Many people have written much about why you should consider moving more logic from your main package(s) to subpackages.  I'll briefly rehash that here.  One reason for using subpackages I rarely see mentioned is that it really helps with version control.  

Common Reasons for Subpackages

  1. Separation of concerns.  It is based to keep your "extract" logic separate from your "transform" logic.  This is just good fundamental design.  If you would normally use a separate function or module in your code, consider using a new SSIS package. 
  2. It aids unit testing, which is already extremely difficult with SSIS.  

One Big Pitfall with Subpackages

Just like with any other modern programming language, subpackages/modules/functions have certain scoping and lifetime issues that you need to account for.  A child package has access to everything declared in the parent package.  The problem is that child packages cannot alter certain variables in parent packages.  Also, child packages cannot pass messages back to the parent other than a simple SUCCESS or FAILURE.  These issues are not very common and with some creativity you can bypass these issues by storing "state" someone else instead of passing it around.  

Version Control

The most overlooked reason for using subpackages is version control.  SSIS is horrible when it comes to version control.  If you use any of the source control plug-ins for Visual Studio you'll immediately notice that even to VIEW the properties of a task requires the .dtsx file to be checked out.  How ridiculous is that?  And even simple changes like changing a default package variable value can result in many esoteric changes occurring to the underlying .dtsx file (which is XML).  So a visual diff is rendered worthless if you wanted to do something like a "svn blame" (determine who broke what, when, using a visual compare).  

SSIS Team Development Best Practices

  1. Avoid having multiple team members work on the same dtsx packages during a given release/sprint.  Due to the visual merge/diff issues just mentioned you WILL have a difficult time merging discrete changes to different branches.  Plan your development accordingly.  In fact, don't even bother to try to merge some changes to other branches.  It won't work.  For merging purposes, treat your .dtsx files as binary objects, not as source code.  
  2. Use BIDS Helper and it's SmartDiff utility.  It will help you pinpoint the really key changes you have made from simply formatting changes and internal SSIS versioning within the XML.  I can't say enough good things about this.  
  3. Your developers should make small, discrete changes per commit and avoid mixing layout changes with functional changes.  You want to limit how much change is done per commit.  This is why subpackages are also so critical.  If you must go in and re-organization the control flow layout, just make sure you do that in a separate commit.  This makes finding logic bugs easy later.  
  4. Every package have various Version properties.  Depending on your build tool and your environment these may be autopopulated for you.  Regardless, I find it helps to populate these values whenever I make a change to a package.  If you automate this, or at least try to enforce this rule on your team, you have a nice version control mechanism embedded into the file.  I know that embedding versioning info in source code is generally frowned upon...we can get that information from the VCS repo...I think it makes sense with dtsx XML files that have inherent versioning problems like I've mentioned thus far in this blog post.  
  5. It's also interesting to note that if you use the project-based deployment mechanism (.ispac) using Integration Services Catalogs you get some additional versioning information that can be helpful when troubleshooting.  You can see this by right clicking your Project under Integration Services Catalogs and selecting "Versions...".  In the screenshot I can see that Version: 2.6.1.13 was deployed at '4/1/2014 at 10:38AM'.  

You have just read "SSIS Subpackages and Version Control" on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

Getting SSIS Packages to Run Faster in Visual Studio

I thought this was commonly-known but I find that quite a few SSIS developers do not know this.  SSIS packages are known to be horrendously slow when running in debug mode (F5) in Visual Studio.  It probably has a lot to do with all of the UI painting.  If you really don't need to see debugging messages (perhaps you have a good logging framework you can rely on instead) and are really just executing your packages for a smoke test or whatever, there is a faster way.  CTRL+F5.  (Start without Debugging from the Debug menu).  The package will execute in a cmd window using dtexec.  Radically better runtimes (YMMV but mine is an order of magnitude faster) and also far less system resource consumption (memory utilization especially).  

This is a huge timesaver for me.  

Post-restore Process

If you are involved in a lot of development work with a lot of environments then you probably get frustrated when you are given a new environment and it isn't set up correctly.  For instance, the CM guy (or DBA) forgets to sync the sql logins and db users.  Or, someone forgets to enable service broker.  Or, you use the CLR (sloppily, I might add) and your restored db isn't TRUSTWORTHY?  In all honesty, it's tough to blame the DBA or CM Guy.  We devs love to change our requirements constantly and turn on every whiz bang new feature that Microsoft gives us.  This causes the DBAs fits when creating repeatable environment builds.  I once consulted at a place that had a 3 page document on miscellaneous "tasks" that had to be performed in a newly restored database before it could be handed over as a non-prod environment.  

That's just crazy.  And it's difficult to maintain a document like that because at any given moment you could have multiple build documents based on how many releases and new features you need to support (like Service Broker).  Or you may have customers at various releases that need different processes executed.  

The solution is simple.  

  1. Create a PostRestore stored procedure.
  2. Compile it in your db.  
  3. Train your DBAs and CM guys to ALWAYS run this proc whenever a db is restored.  Any db.  

What are the benefits of doing this?  First of all, you can distill that 3 page restore manual into a simple stored proc call for them.  That's a VERY good example of DevOps.  Second, you have the ability to "version control" your db restore process.  If you decide you need to ENABLE_BROKER in the next release of your software you simply make that change to your PostRestore proc and check it in with that release's code.  Now your DBA/CM Guy doesn't even need to know this is a new "feature" you are enabling/deploying. 

It would be really great if Microsoft gave us a way to run this automatically whenever a db is restored.  I've tried server triggers and custom events and have not found a better way to do this, short of creating a SQL Agent job that runs every minute looking for a newly restored db and then running the command...and that seems insane.  Somehow this seems like it should be a part of a contained database.  

Tags: 

CHANGE_TRACKING_CURRENT_VERSION() not always reliable with Log Shipping

There may be a "bug" with CHANGE_TRACKING_CURRENT_VERSION() when using Change Tracking and log shipping.  The returned value will not be accurate if ALLOW_SNAPSHOT_ISOLATION is set to OFF.  Change Tracking is one of a million ways to query for data changes and then do interesting things with those changes...such as ETL.  The best article I've ever read on Change Tracking (CT) is by the folks at Solidq.  Here are parts 1, 2, and 3.  

With the ETL use case the process is to take the CT value from the last ETL run and use that as the basis for the new changes you want to see using CHANGETABLE (CHANGES blah, @last_synchronization_version).  The process Microsoft outlines specifically says that, "...we recommend that you use snapshot isolation. This will help to ensure consistency of change information and avoid race conditions..."  

That's the problem...not EVERY db in the world can easily be converted to snapshot isolation without a lot of code rewrite and regression testing.  

It's also common for your ETL to run on a "copy" of your OLTP prod system.  For instance, I've seen lots of customers run their ETL processes on readable secondaries, transactional replication subscribers, or log shipping destinations.  In these situations you want to be able to run your CT queries against your "replicas" to reduce load on your primary server.  Ah, but that may not work if you are not using ALLOW_SNAPSHOT_ISOLATION ON.  

I don't have the time to fully repro this for a blog post or a Microsoft Connect bug, but here are some screenshots of the issue and a possible solution.  

First, it's important to understand CHANGE_TRACKING_CURRENT_VERSION().  By definition it should always give you the same answer as SELECT MAX(commit_ts) from sys.dm_tran_commit_table, regardless of isolation levels.  Here is a screenshot of my prod system that has many, active databases ... some with snapshot isolation ON, some with it OFF.  Note that the values are the same, regardless of whether snapshot isolation is ON or OFF.  There are cases where the values will be off, slightly, because transactions may be committing between the when the first and second queries run.  Note that the MAX() query can take a bit of time to run on a busy system, but I've never seen it cause blocking.  

So on a "writeable" db, regardless of isolation level, CHANGE_TRACKING_CURRENT_VERSION() works great.  Here is the result from running the same queries on my log shipped replica.  We use this replica as the basis for ETL daily loads.  We only restore log backups around 3 AM daily, so during the day we can assume that the replica is out of sync because we have the log restores turned OFF.  

Note the wild discrepancy between CTCV() and the MAX query.  Knowing what you know about my setup, clearly the MAX query is right but CTCV() is not in cases where ALLOW_SNAPSHOT_ISOLATION is OFF.  

Why?  

When I first saw this symptom I had no idea what could possibly cause this.  We have about 50 dbs on these servers and each is configured slightly differently regarding snapshot isolation settings, ansi settings, Service Broker enabled, etc.  After an hour or so I realized the pattern was that the erroneous data only happened on our older "legacy" dbs where we can't/didn't ALLOW snapshot isolation.  

I doublechecked the MS documentation for Change_Tracking_Current_Version() to see if I missed anything in the documentation that would account for this anamoly.  Otherwise this would be a bug.  I found nothing. 

But I was interested in why this would happen.  We all know that we can't view the source code for system functions to see if we can spot the bug, but perhaps instead we could see what sys.dm_tran_commit_table is "looking at" under the covers.  Sometimes you can do sp_helptext on DMVs...and this is one of them.  Here's what you'll see:  

--sp_helptext 'sys.dm_tran_commit_table'

       SELECT commit_ts, xdes_id, commit_lbn, commit_csn, commit_time

                     FROM sys.syscommittab (READCOMMITTEDLOCK)

       UNION

       SELECT commit_ts, xdes_id, commit_lbn, commit_csn, commit_time

                     FROM OpenRowset (table SYSCOMMITTABLE, db_id(), 0, 0)

That's an interesting little query, isn't it?  Frankly, I found myself doing all kinds of googling on what OpenRowset (table) does under the covers.  I definitely got a little shiny object syndrome looking at that code.  But that's another blog post.  Trust me, you can't run that query from ssms directly unfortunately.  

My thinking is that it is POSSIBLE that CTCV() is doing something like this under the covers and is getting confused and hence is returning the wrong information.  Again, I have no way of knowing without reviewing Microsoft source code...and that ain't happenin'.  

My suggestion is that if you are:

  • using Change Tracking
  • with Log Shipping
  • and are not using at least ALLOW_SNAPSHOT_ISOLATION

...then you should not rely on Change_Tracking_Current_Version on your log shipped copy to accurately determine the last committed row change that exists on your subscriber.  MAX(commit_ts) from sys.dm_tran_commit_table

I hope this helps someone in the future.  

Prunes Analysis vs Fletcher's Castoria

Did you ever have somebody ask you:  

  • At what fragmentation level should I REORG vs REBUILD?
  • How many indexes should I have on my table?  
  • How often should I update my statistics?
  • Should I change my oil every 3000 miles or 5000?
  • What is the correct federal funds rate to ensure full employment and low inflation?  

I call these "Prunes Questions".  And I call the process of arriving at a satisfactory answer to these questions "Prunes Analysis".  I named this process after the famous Fletcher's Castoria commercial from the early '60's.  

Just like the commercial says...The problem with prunes is...if I'm constipated, is 3 enough?  Is 6 too many?  

This question haunted housewives and moms for years until the advent of Fletcher's Castoria in the mid 1800's.  Generally, the best laxative at the time was a good dose of prunes.  But how many do you eat?  If you don't eat enough, you're still constipated.  If you eat too many you'll have diarrhea.  

I know, lovely thought.  Fletcher's was supposed to eliminate this enigma.  You just give your kid the dose on the bottle and forget about it.  But of course, even with Castoria you might need to take a second and third dose.  

Where am I going with this?  People, especially IT people, like every question to have a firm (pun intended?) answer.  Given a choice between some prunes and a proper dose of Castoria, most people will take the Castoria every time.  People think of Castoria as the "known" and the prunes as the "unknown".  Known is viewed as being better.  

But not every problem has one "known" answer.  Sometimes "it depends" is a valid answer.  The questions I posited at the start of this post are examples of "prunes" questions.  Every person may have a slightly different answer and who is say who is right and who is wrong?  The answers are often difficult to prove.  In every case the "it depends" is referring to all of the variables that cannot be held constant.  

But most people hate being told "it depends" as the answer to their query.  They view the tergiversator (ambiguous respondent) as lacking knowledge or being standoffish.  But that's not always the case.  Answering "it depends" can be especially perilous if you are a consultant.  Clients like to know that their consultants have all of the answers.  

So, unless I'm trying to be purposefully equivocating, my stock answer to these types of questions is, "Is 3 enough? Is 6 too many?".   This answer makes the questioner stop and think.  I'll then give my answer/opinion, given the knowns, without having to say "it depends."  It's usually good for a chuckle too, especially if the questioner is a bit older.  

Pages

Subscribe to RSS - SQL Server