DaveWentzel.com            All Things Data

Data Architecture

The Paradoxes and Ironies of Automation

I have some side work where I help to automate some tedious, time consuming operational DBA tasks.  If I strung these processes out serially it would take about 18 hours to do everything.  Management's goal is to get this to a totally lights-out process, which serially would take about 6 hours (there are certain tasks that can't be sped up with automation, such as copying TB-sized files across a WAN).  Total automation is a noble, albeit unattainable goal.  Management does not totally agree with me.  Total automation is rarely a good idea.  I'm totally for the concept of as much automation as possible, but there are serious side effects to automating EVERYTHING, or even attempting it.
Costs of Automation

Full automation does not make sense from a business perspective.  At the beginning of an automation project, the cost of automating the most egregious things is very cheap and the benefit is very high.  When you near the end of an automation project the cost is very high and the benefit is very low.  This is something akin to the 80/20 rule. The last 20% of the automation tasks will take 80% of your budget.  
If the software you are trying to automate needs frequent maintenance and enhancements then it's very likely that your automation work will need the same, or more, maintenance.  This costs money.  So how much do you automate?  As much as you can, and can afford, and nothing else.
Paradox of Automation
The Paradox of Automation is defined as:  
" the more efficient the automated system is, the more essential the human contribution that is needed to run the Automation System.  Humans are less involved in heavily automated systems, but their involvement becomes more critical."
If an automated system experiences an ABEND, the system is often completely unavailable until the error can be identified and fixed.  Finding the error in an automated system is a much more labor-intensive, specialized operation.  Entry level programmers probably won't be able to find the abend.  It will require senior people to find the errors, and senior people will need to know more domain knowledge.  Let's not even mentionethe fact that many automated systems run but fail to properly trap errors.  In these cases we need to look for the conditions that are not right and try to determine why the automated system is failing.  This is really, really difficult to do.  Just because the system is running at this moment doesn't mean it's running correctly either.

During an automation project you will often find that the manual operators are hostile to what you are doing.  "You are automating me out of a job."  That's rarely the case (unfortunately).  If the automated system experiences an error it will be multiplied until fixed or shut down.  This makes the operator even more important and valuable.  Automation requires more specialized skills.  The operator also needs to be able to spot "the canary in the coalmine" and be proactive to avoid errors.  
Irony of Automation
There is a certain "irony" to a successful automation project.  The more reliable the automation project is, the less opportunity there will be for the operator to practice direct intervention, and the more difficult will be the demands of the remaining tasks requiring operator intervention.
I am a firm believer in the DevOps movement.  A core component of DevOps is automation.  However, you can't just discard your Human Monitors.  Otherwise the result is deskilled and novice operators who cannot fix a failed system.  Any automated system failure is used as evidence of human failure (whether the human operator or programmer) resulting in more effort being expended to automate the system even further!  
More fail-safes and redundancies are built into the system.  The increased automation now hides even more of the small errors that would've occurred and could've been solved easily with "canary in the coalmine" monitoring.  Now latent errors are being hidden until the failure becomes even more catastrophic and difficult to resolve. 
The "Three MIle Island" incident in the 1970's is often used as an example of the Irony of Automation.  So much was automated that the reactor was at a critical mass until alerting began.  At that point there was no operator experienced enough to know exactly what to do.  It is said that an inexperienced engineer began turning knobs and made the problem even worse.  It has been speculated that if the operator just walked away and got some coffee that the system would've resolved itself.  In this case the problem wasn't over-automation, it was a lack of qualified people who understood the complex system.  


I recently turned downI recently turned down a consulting day because the customer insisted on using a no SQL solution when I was convinced the requirement did not warranted.
Background…  a biopharma company contacted me to help them set up their Drupal CMS.  This isn't what I normally do, I am more of a data architect.  But they said the googled around and they saw that I have experience with Drupal, and that I'm local, and they decided to reach out to see if I was interested.  They wanted to store FDA-regulated data in a HBase solution.  I insisted that I had no comparative advantage in either HBase or Drupal, but they REALLY wanted me to come in.  
So I did.  They wanted users to fill out their documents using a built-in WYSIWYG editor in Drupal and simply store the generated HTML as a big 'ol unstructured BLOB document in the HBase solution.  I pointed out that, yes, Drupal (or any CMS) could store all of their HTML "data" as a schema-less blob.  This is actually extremely easy to do.  But the question really is, is that the right thing to do?
We all learned years ago that keeping our raw data separate from the presentation markup is a very wise thing to do.  Using things like CSS we can change the presentation layer very easily at a later date as requirements change without losing any data fidelity.  Why would we want to take a step backward and violate this basic principle?
As I said, this is also FDA-regulated data, and the government is highly structured.  It would make sense to maintain that structure somehow.  I suggested at a minimum to allow users to complete a series of metadata questions that would allow the data to more easily meet the government requirements without have to do a bunch of MapReduce queries later.  The customer was hell-bent against using any structure.  They wanted their HBase schema-less solution at any cost.
We had an hour-long conversation about why I really felt this was a bad idea.  I tried everything I could think of to talk them out of doing this.  Although Drupal allows everything to be unstructured, clearly some structure would aid taxonomy searches, yielding a faster and more accurate system.  This is in fact a CMS Best Practice.
I even tried to convince the customer that unstructured data is a contradiction in terms.  Data has to have some kind of structure otherwise is just bits and bytes.  Random bits are not data.  Data has meaning.  Meaning has structure.  They weren't buying any of my hyperbole.  I even tried to convince them that HBase does actually have data "structure" behind it because database management, at its core, is nothing but manipulation of some data structure.  They still weren't buying it.  
I even asked how many BigData engineers they had on staff that could architect and engineer the solution.  (They were hoping I would help out with that as well.)  
At this point I realized I was dealing with another BigData Icarus Project (the second one in as many weeks).  They insisted on using a schema-less solution even though there was no requirement, functional or otherwise, to do so.  I kindly thanked them for inviting me into their company and said that I thought they could do much better than contract with me.  
BigData BigProblems, LittleData LittleProblems.  

BigData Icarus Project

I was invited to work on another BigData project that was doomed-to-failure.  I turned it down.  But not because it was using a NoSQL solution.  It was an "Icarus Project".  
The project's requirements were very simple:  Be able to store and retrieve any tiny bit of patient health data ever generated during the lifetime of a patient.  This data would be used not only to improve patient health outcomes but as a mechanism for the provider to spot "trends".  The data must be able to be queried quickly and read activity cannot conflict with the real-time streaming of new data from the source systems.  By "any tiny bit of data", I mean, absolute minutae, such as individual heartbeats.
Wow, that's a lot of data.  
I have very limited BigData experience but I definitely agreed that a typical RDBMS solution was probably not suited for this task.  Frankly I think all of the project's functional requirements are little bizarre and nebulous but after we started talking through the requirements further I realize this project was doomed to failure.
As I started to talk about adding a little bit of structure to the data, I was immediately told that this MUST be a schema-less solution.  We absolutely could not have any structure to the data.  Why?  Certainly some of the data has structure.  I was finally told the real requirement of this project.
This was to be the company's first foray into BigData.  Now we were getting somewhere.  The company was losing software contracts to other shops with BigData backend implementations.  This had the telltale signs of an Icarus Project.  
An "Icarus Project" is the term I use to describe the situation where managers or executives initiate an overly ambitious project which then fails.  Fed by excitement for the project, the executives are unable to reign in their misguided enthusiasms before it is too late to avoid failure.  In Greek mythology, Icarus and his father, Daedalus, were imprisoned by King Minos.  Daedalus constructed wings for them to use to escape by flying.  The wings were made from wax and feathers.  Daedalus warned his son not to fly too close to the sun.  Icarus was overcome with the excitement of flying and ignored his father's warning.  As he approached the sun the wax melted and icarus fell to his death.
An Icarus Project is most often seen when companies plow into new technologies that they have no experience with.  As they spend more and more money to try and catch up to companies already dominant with the new technology, they use up the cash reserves built up by their profitable products.  In the past, in the realm of data, the biggest Icarus projects were around data warehousing.  That's changed in the last two years.  Now the biggest Icarus Projects seem to be around BigData.

What Goes Around Comes Around

At the turn of the millenium, when it was clear that web architecture was the future and was going to displace client-server programming as the de facto mechanism for future applications, it occurred to me that, indeed, what goes around comes back around again.   The first computer I ever had was a 286 and every application was stand-alone.  What little networking existed was via modem and simple BBS applications.  When I got to college I maintained various VAX/VMS and Ultrix-based "dumb terminal" applications.  People hated them because of the esoteric commands needed to be productive.  To send an email required the user to construct a command-line string similar to smtp%"email@domain.com".  Please remember that I was dealing with freshman college kids with little to no experience with networked computers, not to mention Luddite college professors who believed all innovation in "word processing" stopped with the advent of the IBM Selectric II typewriter.  I had little "tricks" I taught the professors to make things easier for them to remember.  Example:  "SMTP" was "Send Mail To Postman".  Even that was too hard to remember!  

"I typed smtm% just like you taught me and it's saying 'Invalid command'."

"I'm sorry you are having problems Professor Mankiewicz.  It's actually smtp, send mail to postman, not smtm.  Why would you think it's smtm?"

"I thought it was 'send mail to mailbox'.  Sorry about that."

"Is there anything else I can do for you today doctor?  No?  OK sir, see you in Econ 101 tomorrow.  Bye."  


I digress.  

My first "real" job out of college was as a programmer on an IBM mainframe TSO system.  TSO stands for Time Sharing Option.  It's quite simple, a user can login to the system and is unaware that other users are currently interacting with the system.  This is no different than a Unix login over telnet.  Around the same time TSO was introduced it was becoming common for universities to "rent" time on their computers to interested folks who could develop their own programs.  You've probably heard of people like young Bill Gates who "rented" off-hours, unused cycles from colleges and companies that had these expensive computers.  

Of course the concept of "renting" computing resources went out the window with the advent of technologies like the web and VMWare, right?  

Wrong!  Renting off-hours, unused cycles is again en vogue with the advent of the cloud.  

(Google's first hit for TSO)

As I've mentioned in a few recent blog posts, I need to research and prototype a number of "NoSQL" solutions.  I could of course build these in-house, but some of these products are extremely difficult to setup and install unless you hold a PhD in CompSci.  It's much easier to provide Amazon with a credit card and use Amazon AWS.  After an hour and less than a hundred bucks I had a running Riak system to play with.  Very cool.  But AWS is not really time sharing.  That's what I thought until I researched all of the AWS offerings a bit more.  

AWS and EC2

On the AWS website is a link to powerof60.com.  This explains Amazon's EC2 offering.  EC2 is an acronym for Elastic Compute Cloud.  With this offering you purchase processing cycles just like you purchase electricity.  You pay for as much as you use, and you pay more during peak hours.  Services are available just like a utility.  This sounds a lot like time-sharing to me.  Amazon has lots of case studies where companies (even NASA) have occassional need for lots of processing power.  Amazon will let you spin up the resources as you need them, when you need them, without the up-front capital expense of purchasing your own hardware and software (not to mention labor).  If you can do your processing off-hours you pay less.  You can even have Amazon spin up your processing for just a few minutes (off-peak) to save even more $$$.  


Although I can't find this documented by the Azure folks, I just heard that Windows Azure is going to a per minute pricing strategy in the future.  Apparently you can power down your VMs when (if) your developers leave the office and 5pm and you won't be billed again until you power them back up the next morning.  

What goes around comes around again.  


Schema-less solutions may need a little more schema after all

A few months ago my company renegotiated its SQL Server licensing agreement with Microsoft.  We got a really terrible deal.  Because of that our data architects have been tasked with looking at alternatives to SQL Server.  We've had a few vendors pop in and give us brief demos.  Our most recent visitor was a vendor for a commercial version of MongoDB.  I'm much less jaded about NoSQL/BigData/schema-less database management systems than my co-workers (some of whom would NEVER EVER consider using one of these solutions).  I have seen a few use cases where it would be worthwhile to consider a proof-of-concept of a NoSQL solution over a standard RDBMS. 

I digress.  

The vendor started out his demonstration by saying Mongo was a pure "schema-less" solution.  He proceeded to persist a JSON document in the system without first declaring a "table", a "key", an "index", or anything else remotely representing a "schema" in the traditional RDBMS-sense of the word.  He then went on for a few minutes spouting hyperbole about how this is a more natural process for Java guys than having to deal with a RDBMS with its object/relational impedance mismatch.  No need for even the limited structure that an ORM would provide.  MongoDB simply persists a document object just like a Java Jock would create it.  The presenter noted how Mongo assigned an "_id" to our document.  He showed us how this was basically the equivalent of a "Primary Key" that Mongo handled for you for free.  Every document must have this.  He finished his demo by writing some quick "queries" to retrieve the JSON and display it to the screen.  

So far, hyperbole aside, I thought his demo was intellectually honest.  I did question him that his "_id" column was actually a Primary Key, so wasn't this "schema" after all?  He agreed, but said that was the minimal schema that Mongo needed to operate.  

He proceeded with his demo where he showed how to "shard" the data across nodes.  This is where the demo got interesting.  He wanted to shard the data by a "region" attribute in all of his documents.  He had 3 nodes and three possible values in all of his documents for "region".  

I interrupted him.  "Excuse me but your original document only contained the attributes _id and name, there is no region element, so what shard will contain that data?"  

"For the demo we won't actually shard the original documents I inserted, we'll blow them away and use some properly-prepared documents that have the necessary region information."  

I had him.  "So, are you saying that if we wanted to maintain those original JSON documents we would need to perform an update and somehow assign them a region attribute?"  

"That is correct."  

"So, are you saying that the region element is a required element of all documents if we want to shard by that data element?"

"That is correct."

"So, doesn't that mean that there is a defined 'schema' necessary in MongoDB?  Basically, Mongo does REQUIRE at least some semblance of schema to work as designed when introducing sharding, replication, and redundancy.  Correct?"

At this point I thought I saw him roll his eyes.  "Well, yes, if you want to implement sharding correctly so it performs.  But we could've just sharded on a hash of _id and we wouldn't need any schema at all." 

"Interesting.  So what you are saying is that if you want a properly-performant system then, yes, some about of up-front planning is kinda required.  The fact is, this Mongo solution has not really eliminated any of the logical data modeling activities that should be occuring on any system.  Yes, Mongo does not require data typing at the column/property/attribute level, does not require foreign keys, does not require domain constraints, and does not require nullability checks, but then again, tables in SQL Server, Oracle, and DB2 do not require this either.  The only thing that I can see that Mongo has eliminated is the most basic "CREATE TABLE blah" command that would need to be run on a standard RDBMS first.  I've worked on RDBMS-based systems that had absolutely no PKs or FKs defined, this was "handled" in the Java tier.  The RDBMS was simply a dumb persistence mechanism.  In all honesty these systems all ran fabulously well in many cases and customers were thrilled.  

Problems only really manifested themselves when the schema had to evolve due to new requirements.  This is where the lack of up-front schema planning (logical data modeling if you will) meant lots and lots of rework to get the new requirements shoe-horned in.  And I've never had the luxury, like you just did, of simply discarding my old data that did not fit the new schema requirement and 'start over'.  In each case the data had to be migrated to the new schema, and that was difficult and painful.  But it needed to be done. "  

Crickets could be heard.  

The vendor's eyes were fully glazed over.  I allowed him to collect his thoughts and continue with his demo without further interruption by me.  Other people pointed out the patently-obvious problems with NoSQL solutions, such as the lack of good report writers (Crystal Reports, etc won't work with these solutions because they only understand "SQL" and Mongo doesn't have a SQL-to-cURL bridge or SQL-to-MapReduce bridge).  The vendor assured us that Cognos and Microstrategy were actively working on reporting solutions and would have something soon.  

The fact is, there are lots of good uses for NoSQL products.  But I think it is disingenuous to say that your product is "schema-less" when in fact a good "schema" is the only way to make your product work correctly.  There is no substitute for good logical data modeling and designing good schemas.  No substitute.  When people ask me during interviews what is my least favorite thing about being a data architect I always say, without hesitation, "logical data modeling."  I've worked on so many systems (and designed even more unfortunately) where the schemas were just flat out wrong.  It was too difficult/risky to change them and performance and customer satisfaction always suffered.  For those reasons I'm always fearful I'll design something that will not work.  

Too many people want to short-circuit the logical design phase of a product.  This is usually because they never had a good education on the subject.  Instead, they rely on product-specific knowledge of modeling.  In other words, they only know about modeling what they learned in a book about their particular RDBMS.  Or, in the case of Java Jocks, they don't want to be constrained at all by any structure.  It's not "sexy".  Code is sexy.  JSON is sexy.  "Schema-less" is sexy.  This is terribly short-sighted.  We are unlearning best practices that have evolved over the last 40 years.  Again, there are good uses for many NoSQL/BigData products (eventual consistency is AWESOME if you need it), but eliminating schema and design just isn't one of them.  

They Should Call Them Evil Spools

In my last post, SCHEMABINDING Performance Considerations, I mentioned problems with Eager Spools.  I've written about these many times.  They really should be called Evil Spools.  To be clear, they do serve an important purpose, but they can often be eliminated without detrimental data quality side effects.  

This is just a really quick follow up to that last post (here) that covers some other Eager Spool conditions and methods to remove them from query plans.  Most of this I've written about before.  

What is an Eager Spool? 

  • gets a "read-consistent" view of the data.  Solves infinite loop and "Halloween Protection" problems with UPDATE statements. 
  • HP/Eager Spools are also required for:
    • INSERT statements where the target table is also referenced in the SELECT statement.  
    • DELETE statements with self-join relationships.  
    • certain CTEs (which is one reason why CTE performance is usually worse than many other alternatives).  I avoid CTEs whenever possible due to this (the exception is recursive CTEs, which really are very elegant)
    • recursive CTEs  (very unfortunate, but HP is needed in this situation to get the read-consistent view of the data).  
    • many "Hole-filling queries"...I got this term from an excellent blog series Paul White did on the Halloween Problem.  He does a fabulous job on all of this.  More on "hole-filling queries" in a moment...
  • a "blocking" operator where that "branch" of the query serializes behind the Spool.  The Eager Spool MUST be satisfied before continuing.  An operator is non-blocking if it can take a single row and pass it to the next operator (ex Nested Loops).   An operator is blocking if the next pipeline step cannot occur until the entire operator is satisfied (ex Hash Match...probing cannot occur before the entire hash is built)
  • uses tempdb
  • (usually) uses rebinds which are worse than rewinds
  • since it is a "blocking operator" it will take more locks and longer-duration locks.  



Hole-Filling Queries

A hole-filling query is where you need to INSERT into a table where the key does not already exist.  The implementation is usually something like this:  

INSERT INTO #TempTable...WHERE NOT EXISTS (select ObjId FROM #TempTable)

The query plan will include an Eager Spool to ensure we have a read-consistent view of the data.  These can be fixed by using a LEFT JOIN and a WHERE ObjId IS NULL clause.  
A MERGE statement is far more elegant though.  (h/t to Paul White for pointing this out in his blog post).  
  • the WHEN NOT MATCHED BY TARGET clause *must exactly* match the USING/ON clause.  
  • And target must have *a* unique key (anything will do).  

Here is an example of a query plan that performed a hole-filling query:  


That's a very expensive Eager Spool.  

I rewrote the query using a LEFT JOIN and look at the difference: 

The wall-clock time improvement was just as staggering.  

SCHEMABINDING Performance Considerations

If you use TSQL UDF's (and you really shouldn't be in many cases), make sure you use WITH SCHEMABINDING.  The performance improvement is STAGGERING.  Let's look at an example.  Here we have a typical scalar UDF, it takes a series of inputs and nicely formats a telephone number.  Very simple.

For the demo I have 2 versions of this function...one WITH SCHEMABINDING and one without.  Here is the demo file.  And here are the functions I will use for the demo.  

After looking at the UDF it is clear that it does NOT actually access any data in our database.  We build a new db from scratch and have not created a table so that should be obvious.  

Whenever you create a UDF two metadata properties are flagged...whether or not the function accesses system data and whether it access user data.  Here's where it gets interesting.  If we don't specify SCHEMABINDING then SQL Server marks our function as accessing system and user data.  It is basically the DEFAULT.  

Why does SQL Server do this?  There is, I guess, no way for SQL Server to know that anything referenced by the function has not changed since the function was created.  SQL Server would need to do this check on EVERY invocation causing a big performance hit.  So the safe thing to do is to mark the function as accessing system and user data.  You can think of this as marking the UDF as UNSAFE.  Again, I'm assuming this is the rationale the folks at Microsoft used. 

Whenever the values are set to 1 then any query plan referencing the UDF will have an extra "Eager Spool" operator (I've written about these lots of times) inserted around the UDF if it is accessed as part of an INSERT, UPDATE, or DELETE statement.  Why?  "Halloween Protection".

What is Halloween Protection?

I've written about Halloween Protection before.  In its simplest form, the Halloween Problem occurs when we read and write keys of a common index.  Think of an Employee table with an index on Salary.  The requirement is to give everyone a 20% raise with a salary > 20000.  
UPDATE Employee SET Salary = Salary * 1.20 WHERE Salary > 20000
The index on Salary will be used to find the rows to be updated.  But the process of updating a row's Salary will cause the index key to migrate to a new location which could cause it to be re-read and re-updated, even causing an infinite loop situation.  Halloween Protection is the process (actually defined in the ANSI SQL standard) where a read-consistent view of the updateable keys is squirreled away in temporary storage and then used as the replacement for the WHERE clause condition.  
The process of building a read-consistent view of the data is implemented in SQL Server as an "Eager Spool".  These are a performance nightmare and should be eliminated whenever possible.  
How is this relevant to unbound scalar UDFs?
If a UDF is not schemabound then SQL Server must assume that the UDF may actually change data referenced in the outer query's FROM clause, meaning that we see an Eager Spool in our query plan.  
Continuing the Example
Let's build a PhoneNo table with columns that will map to our scalar UDF.  Let's also throw some junk data into the table.  
Just running a simple SELECT will not give us an Eager Spool.  Nothing is changing so we don't need to worry about it.  A standard shared lock during the read (assuming you are running in read committed mode) is good enough.  
But let's assume we need to INSERT the output of our scalar UDF call into a temp table, possibly for further processing in a stored procedure.  The example code below makes 4 different calls to our 4 different scalar functions.  In this screenshot we call the unbound function.  Note the Eager Spool.  
This makes sense given that the query engine must handle Halloween Protection.  It only costs 4% of the total query plan, but this is a simple example (and please hold that "only 4%" thought for a moment).  Now let's run the schemabound function:  
Cool.  No Eager Spool is needed this time, which makes sense.  
OK, but the Eager Spool was only 4% of the query plan.  No big deal.  
If we run the two tests again, but run DBCC FREEPROCCACHE first, we can see what our performance numbers look like: 
--TEST 1 is the plan with the Eager Spool.  That is incredibly poor performance due to simply forgetting to schemabind your function.  
In the example files (here and here) I also have calls to scalar UDFs that access user data.  SCHEMABINDING these UDFs doesn't yield much of a performance improvement.  In fact, in some cases the performance is EXACTLY the same, almost indicating that SCHEMABINDING makes no difference on scalar UDFs that access persisted table data.  Other times performance is slightly better.  On very rare occassions I do see staggering improvements, but it is rare.  I'm not sure what the pattern is...I'm still researching.  
Ultimately, the correct solution is to *not* use scalar UDFs if you absolutely do not need to.  Performance is ALWAYS better using an alternative such as inlining the function logic or using a stored procedure.  YMMV.  
My next post will be a deeper dive into Eager Spools.  

ALTER SCHEMA TRANSFER for Zero Downtime Database Upgrades

Our software has a Zero Downtime requirement for database upgrades.  In the past it took us hours to upgrade our databases from one release of our software to the next.  After doing some basic process-improvement stuff (like taking log backups instead of full backups after taking the system down) we next worked on making sure our index changes were able to utilize the ONLINE=ON option.  We could even refactor/change the normalization of tables and keep the data in sync using triggers and Service Broker.  There were a few things we just couldn't seem to do while the application was online.  This blog post is how, I think, you can perform just about ANY database refactoring while your application is online.  Without incurring a noticeable performance penalty or affecting concurrency.  

But first, some background.  Our software/database is deployed at about 50 customers, ranging in size from ~100GB to 13TB.  Each customer can be, unfortunately, on a different version of the software at any point in time, and some customers even run multiple versions of the software.  The database must be upgradeable from any version to the "current".  The software and databases can be hosted at the customer site or in our own "cloud".  I've written in the past about how to we reduced downtime when migrating from SQL 2005 to SQL 2008.  I also wrote about my database porting process numerous times and how it handles upgrading any version of any database to the "current" build.  Here specifically I compare and contrast it with Data Tier Applications.  All of these things get us "reduced" downtime, but not "zero" downtime.  I did a write-up a number of years ago about a skunkworks project I led where we tried to get to the elusive "zero" downtime.  

Using the method I'm about to outline we finally got to "zero" downtime.  Quick definition of "zero"...it's actually not "zero" downtime, it's "near-zero".  We actually require our application and database to be quiesced for about 10 seconds while we "true-up" the system and make the switch.  In theory we could use this technique for total zero downtime database upgrades if we spent a little more time on "rolling upgrades" of our Java tier, but the cost outweighed the benefits.  It turns out 10 seconds (or so) of downtime was a good enough requirement.  


On to the approach.  We decided to call this process "Darwin" since it takes the current database and "evolves" it under the covers.  I am a firm believer in evolutionary database design.  The Darwin approach is to:

  • Use a separate "Stage" schema to create your n+1 release database code changes.  
  • Use a mechanism such as replication, triggers, or Service Broker to keep the data synchronized if a table's schema is changing.  
  • Do a "check-out" of some kind to ensure that Darwin is in-place correctly and data is "evolving".  
  • When you are ready to switch over simply turn off your data synchronization mechanism and issue ALTER SCHEMA TRANSFER commands to transfer out the old versions of your code to a "Deprecated" schema and then transfer your "Stage" schema to your production schemas.  The ALTER SCHEMA TRANSFER is a near-instantaneous metadata operation.  
  • Rerun your permission (GRANT) scripts.  These are the only objects that are not carried-forward with ALTER SCHEMA TRANSFER.  
  • If anything goes wrong run the reverse ALTER SCHEMA TRANSFER commands and your system is totally reverted without having to resort to a database RESTORE.  

Real World Example

For the example I show next we will assume that our production schema is "dbo" and our data synchronization mechanism is simple transactional replication.  The problem is we have a table called dbo.BatchTot that has a series of columns that hold monetary data that are defined as decimal(9,4).  The requirements have changed and now those columns must be declared as decimal(19,4).  The table holds ~2 billion rows at our largest customer.  

If you attempt to run a command like "ALTER TABLE dbo.BatchTot ALTER COLUMN Blah decimal(19,4) NOT NULL" where Blah is currently a decimal(9,4), a "size of data" operation occurs and the ENTIRE table must be rewritten to accommodate the increase in disk size required for the new datatype.  While that occurs the table is serialized (Sch-M locks) and potentially you risk running out of tran log space.  

Many of our architects proposed using what I call the "deprecation model" to accomplish this.  In essence:

  • we add new cols (called BlahNEW for instance) to hold the values from the original columns 
  • put a trigger on the table to keep the cols in sync
  • write a little SQL that will cause every row to be "updated" and fire the trigger.  
  • Change all code to use the BlahNEW cols.  
  • Drop the Blah cols

I hate the deprecation model because it requires you to change all of your references from Blah to BlahNEW (the red highlights above).  In a large application this is really hard to test.  But the real issue is a "people" problem.  As deadlines slip we of course decide that we can reduce testing to maintain project deadlines.  That's a mistake.  Or we decide that there is no time to write the code that drops the old cols, so they stay there forever, along with the deprecation code in the Java tier.  

It's just a bad pattern in my opinion.  

So we decided to use Darwin.  


These steps run while the system is up.  They are not impactful to online users.  First, we create our Stage and Deprecated schemas.  Stage is where our new objects will go.  Deprecated is where the old dbo objects will be placed when we switch over to the new version of the software.  


Now we will create all of our n+1 version objects in the database.  Remember, these are applied to the Stage schema.  For simplicity I really just care about BatchTot and the new length of my columns.  


Now we need a mechanism to keep the data synchronized.  My choice for this is transactional replication.  So we create a new publication with an article of dbo.BatchTot.  The subscriber is the same database.  The destination is simply Stage.BatchTot.  The sync_type should be 'database snapshot' so that snapshotting the table will hold the shortest-duration locks on dbo.BatchTot as possible.  

Once replication is synchronized you can do a "check-out" of your new application if desired.  You are done with the pre-downtime steps.  The beauty is that if anything goes wrong we have not impacted our production users of the current version of our software.  Replication is only a slight performance penalty, but you should of course determine what is best for your needs.  



When you are ready for your downtime you simply need to quiesce your system.  This means that you have no active transactions and distribution agents are completed. We are "truing-up" the system.   

Now we run two ALTER SCHEMA TRANSFER commands.  The first moves the current version objects to the Deprecated schema from dbo.  The second moves the Stage schema objects to dbo.  This process moves all "dependent" objects as well to the destination schema.  By "dependent" I mean any indexes, constraints, and keys.  It does not, however, move permissions.  Those are "lost" during the transfer, so you simply need to reapply them.  I would assume this is a conscious decision by Microsoft to ensure we don't introduce any gaping security holes.  

As mentioned above we now also need to teardown replication.  

Next Steps

You can also remove the Deprecated and Stage schemas whenever it is convenient.  I usually keep them around "just in case" for a few days.  

The ALTER SCHEMA TRANSFER process never seems to take us longer than 7 seconds, even for large tables or Stage schemas with many objects.  I assume under-the-covers that Microsoft has implemented this as a "metadata-only" operation.  


I have only scratched the surface of what can be done with "Darwin" and ALTER SCHEMA TRANSFER.  I believe this is an extremely simple method to get to near-zero downtime database upgrades.  It is not as costly as Oracle's GoldenGate and is conceptually simple to understand for most people.  It even has a built-in "rollback" mechanism.  I hope this blog post was interesting and helps you in the future.  

TSQL Unit Testing Tool Comparisons


I once had a manager who liked to say, "I want to see checked-in code with tests that prove it doesn't work."  Very Yogi Berra-esque.  

I'm a big proponent of unit testing my TSQL code.  The questions become

  • What tools are available for the data professional?  
  • Which tool is the "best"?  

Tool Availability

First off, there are a lot of unit testing tools that handle database testing, but are not necessarily TSQL unit testing tools.  An example would be DbFit.  This tool allows your Fit/Fitnesse tests to actually execute against your database.  This is cool, and valuable, but really doesn't help me unit test a stored procedure.  Another example is psunit.  This tool allows you to write unit tests for your PoSH code.  I've seen lots of people, myself included use psunit as their basic assertion framework to test tsql.  Again, this is cool, but it isn't a tool designed specifically for testing tsql.  Frankly, any tool that follows the "Arrange/Act/Assert" model can be retrofit to call a database and unit test your SQL.  

So, for this blog post, I'm not going to include unit testing tools that only peripherally touch the database, I only want to compare unit testing tools that specifically focus on TSQL.  A very strong argument can be made that db unit tests should be done in something like DbFit so that the db unit tests are stored and executed with the Java tests and code.  This is exactly how DataDude handles this.  I understand the rationale, I am simply questioning whether this is always the best approach.  In many organizations the db experts are not the Java experts, so keeping the tests separate makes a lot of sense in some circumstances too.  Further, I want the best tsql unit testing tool, not the best unit testing tool for my development environment that also handle tsql.  

What Makes a Good TSQL/database TDD Tool

These are just my opinions 

  • adheres to the AAA (Arrange, Act, Assert) Model for unit testing
  • Has a robust assertion engine that is specific to the needs of a data developer.  For example, support for asserting that a result set matches the expected result set, down to the column ordinals, names, and datatypes is important.  
  • Does not require injecting special assertion or debugging code into your stored procedure.  We don't need that stuff in our production code.  
  • Mocking data should be simple.  
  • Mocking views and stored procs so they return a given output for testing.  
  • has support for automatically setting up (Arranging) and tearing down test data without being destructive to existing data in the database.  So basically I don't need to backup/restore a test database to run my tests.  (Never run unit tests on a production system).  
  • Shouldn't require you to learn another language or tool, such as PoSH to write or run tests.  It should just work in SSMS.  
  • Generates a standard output that is human-readable and automatable so other CI Tools can call it.  Bonus points if it can return HTML/XML so I can plug the output into a build automation tool's standard reporting.  
  • Good documentation and tutorials.  Most data professionals have no idea how to start with TDD and unit testing their TSQL code.  (You could probably make the same case for Java Jocks too).  

Evaluating The Available Tools

The only two tools I've found that are specifically unit testing tools for tsql and stored procedures are TSQLUnit and TSQLT.  

Feature TSQLUnit TSQLT
Open Source (so it's easy to look at the source code and extend it for your needs) Yes Mostly.  The table assertion functionality is in CLR code that is not open source.  The rest of the product is written entirely in tsql.  You are using tsql to test tsql.  
Mocks for data Yes Yes.  tsqlt.FakeTable
Stored proc and view mocking No  Yes.  tsqlt.SpyProcedure.  This is very ingenious.  
Has the ability to "arrange" and "cleanup" test data Yes Yes
Can compare result sets Yes Yes
Test output to XML/HTML No Yes
tests can be written and executed entirely in SSMS No, requires python There is even a plugin helper available from RedGate.  
support from industry vendors Not that I'm aware of RedGate's release of SQLTest is fully supported
Documentation kinda sparse Lots of documentation and tutorials.  
Support (and on-going viability) Doesn't seem to have much new development lately The two authors actively respond to bug and feature requests.  (full disclosure...I was a co-worker with one of the authors)
Deployment test code is in dbo schema with your production objects Uses separate schemas that map closely to test classes, just like Java.  They are only deployed if your tests are run.  If you follow the test class creation pattern in their tutorial you can run tests and you will have no remaining testing artifacts in your database.  


Clearly I think the world of TSQLT.  Even if you do not do TDD or don't understand unit testing, you really should download the tsqlt code and peruse it.  I know that I learned quite a bit about transact-sql by looking through their code.  It's just totally ingenious.  If you are a database guy and are not familiar with TDD or unit tests then try using it when you write your next stored proc.  Your unit tests will likely be amateurish but your code will be much easier to maintain and will have fewer bugs.  In a few years you'll look back and laugh at your early unit tests.  That's ok, it's a journey.  

Autonomous Transactions

IMO the most obnoxious shortcoming of Transact-SQL (after maybe the goofy error-handling that won't allow you to rethrow any and every error) is the lack of native autonomous transactions.  I wrote about this here and here.  I've only ever found 2 good reasons for autonomous transactions, but they are HUGE:  

  1. The ability to ROLLBACK a long-running transaction but maintain any error logging for later analysis.  The workaround, and it has a lot of obnoxious limitations, is to log to a @TableVariable and persist that to an actual table after you have done your ROLLBACK.  
  2. The ability to spawn off a separate transaction to get more keys from a central keystore.  In SQL 2012 we have SEQUENCEs which function the same way, unfortunately, not all ORMs and custom-built in-house code will be able to use them.  At least not right away.  

The problem with Number 2 is that we can't write stored procedures that spawn the separate transaction needed to get more keys.  Obviously there are more uses for autonomous transactions, but let's continue the demo.  In this screenshot I have a call to a key generator from within a transaction.  I ask for 10 keys and I can see my LastUsedObjId has been incremented to 645 as expected.  

Now let's assume that call is a long-running query.  Assume that another similar call is made from another connection/transaction while the first spid is executing.  What will happen?  Blocking...


OK.  Let's issue a ROLLBACK on both connections.  It turns out that SQL 2008 and above can get you an autonomous transaction if you are OK with using a Loopback linked server.  Here I create a loopback linked server and ensure that RPC OUT is set.  The magic is 'remote proc transaction promotion'.  This setting ensures that MSDTC does not enlist any calls to this linked server into a distributed transaction.  Essentially, it just spawned off an autonomous transaction for you.  

So, let's demo this.  All we need to do is alter our original code slightly to call the linked server.  Here is how I do it:  

Notice I build the stored proc call in a variable that simply builds my 4 part name call.  I have nothing "hard-coded" with this...it will work on a test, prod, or dev environment, which aids in supportability.  Note that my LastUsedObjId was 655 and it was incremented correctly to 665.  Now if I open a new window/spid and paste the code and execute it:  

Note that the spid is NOT blocked this time.  That's because the call to the sequence generator was called in an autonomous transaction.  Note that the new spid correctly saw the first spid's updated LastUsedObjId and properly incremented it's LastUsedObjId. 

That's what we wanted.  


Subscribe to RSS - Data Architecture