Zero Downtime Initiative Skunkworks Project

Everybody claims they need five 9's availability for their systems.  I really doubt it frankly.  Certainly a global ecommerce operation like Amazon does, but likely most systems that state they need zero-downtime can really tolerate some downtime without losing a nickel of revenue.  A few years ago I worked on an accounting application where the initial requirement was 100% uptime, I initially chuckled...by the end of a few meetings the requirement changed to allow up to one week of downtime.  Wow, big difference, and it meant a huge savings in hardware for things like clustered HADR solutions, warm spares, SAN replication, etc.  As a side note, when SarbOx was passed this changed.  Yes, we could still tolerate one week of downtime, unless the downtime occurred during the accelerated reporting period at quarter end.  During that time publicly traded companies, according to SarbOx, must provide accurate, timely, complete information for shareholders.  The CEO and CFO are legally responsible for financial disclosure reports during the accelerated reporting period and an outage during that time could (we thought at the time) mean penalties including jail time.  So, we were back to five 9's.  

Still, some systems do legitimately have zero/reduced downtime requirements, and not just from a regulatory perspective, and I've worked on those systems too.  For instance, HIPAA does not require five 9's uptime, but there are requirements around EMR (Electronic Medical Records) that state, very simply, that you must ensure the availability of data.  HIPAA also does define "availability" as information that is accessible and useable upon demand within 5 days of request assuming an emergency. 

How do you upgrade a database with zero downtime?

That brings me to "Zero Downtime Initiatives".  For this post I don't care about HADR solutions, we all know about those.  I'm more concerned about how to upgrade your db without incurring downtime.  I find this fascinating frankly.  How exactly does Amazon upgrade their system when they release a new version?  Sure, releasing the web code is simple.  Pull machines out of the farm, upgrade them, and insert them back into the farm.  (Yes I know upgrades to caching technologies, etc may make that harder, I'm just generalizing here).  But how do you do database upgrades when theoretically you may only have one big database (or in Amazon's case, a few that handle particular purposes)?

Background

Assume a hospital database.  If I take the db offline for an upgrade the interfaces (pharmacy, scheduling, physician order entry, and really any type of "automation" or asynchronous processing) back up and are not processed.  That could be a problem so traditionally we have tried to separate the db upgrade process in stages...those stages which can be performed while the system is "live" and not down, and those which cannot.  We didn't get too cute here for "Version 1" of the Reduced Downtime Initiative..."live" system upgrades simply meant adding "new" features to the db that did not affect existing features.  This includes

  • New tables
  • New stored procs/views
  • Adding NULLable cols to tables (we don't allow SELECT * or INSERT statements without col lists...so this is safe)
  • Certain additional indexes

It did not include items which would impact the "live" system (these changes required downtime):

  • changing a column's datatype
  • certain index operations that would cause blocking/deadlocking
  • changing a stored procedures inputs or outputs (hence changing its data contract)
  • data modification scripts

Version 2

Here is where we did start to get cute.  For Version 2 we decided to experiment with locking mechanisms and Versioning metadata to get the upgrades to work.  I don't want to spill all of the beans, but here is an example of what we did.  Assume we needed to radically change the inputs/outputs of a stored procedure (change the data contract) for Version 2.  Here is we handled it

  1. During the pre-downtime phase of the upgrade we installed the new stored procedure that wrapped the old stored procs logic in a big switch statement.  A new NULLable parameter called @Version was added.  The application passed @Version for the new version of the application.  Obviously the old version did not, that's how we overcame logic differences.  
  2. A lot of thought went into how to handle the "interim" period when both versions of a stored procedure were actively processing data.  In many cases we wrote simply data migration scripts that ran post-upgrade to generate missing data.  Data defaults were always the easiest solution.  
  3. A post-upgrade script was run that would remove the giant switch statement in the stored proc since the old code path was no longer needed.  Why do this?  For procedure plan cache optimization.  

In reality Version 2 ended up being common sense more than anything else.  A careful analysis of order of events of an upgrade showed that upgrading a database actually required almost zero downtime if we thought through each of these issues carefully.  Operational discipline is important.  Everything must be checklisted and done in the correct order.  

Version 3

We put together a committee to research Version 3 which we defined as using 3rd party solutions to overcome remaining downtime.  Why use 3rd party solutions?  We knew whatever remaining downtime existed during upgrades was downtime we couldn't easily eliminate without introducing lots of risk.  It was smarter to look at tools that others were using for the same purposes.  These tools would not be specific to our application or industry.  Some examples of what we researched:

  1. RDBMS-specific solutions...for SQL Server this was log shipping, database mirroring, 3rd mirror SAN solutions.  We were not positive how any of these solutions might help us meet our goals, but we decided a thorough researching and understanding of these technologies was important to our decisions.  For instance, perhaps all read only activity could continue unabated and writes could be routed to a transaction log where it could be replayed later, after the db upgrade.  Yes I understand transactional consistency would be a concern but our concern was interfaces that were handled much like message queueing technologies where this would not be an issue.  
  2. When migrating to new hardware or performing SQL Server upgrades could we leverage log shipping or active/active clustering to accomplish what we needed?  For instance, when upgrading SQL versions on a cluster prior to SQL 2008 potentially a lot of downtime was needed.  I wrote about a way to get around this [[Upgrading a SQL Server 2005 Cluster to 2008|here]].  This method should no longer be needed with SQL 2008 where MS has nicely given us a method to have a "one node cluster" to specifically overcome the upgrade problem.  
  3. We already have various HADR solutions in place for business continuity planning, can we leverage these somehow?  Specifically, could we get the business units to agree that some data latency for reads was acceptable and then simply read off of our existing HADR solutions.  
  4. We researched very specialized 3rd Party data integration tools, such as GoldenGate (since purchased by Oracle).  Our GoldenGate POC was very, very promising. GoldenGate essentially captures data changes at a source and synchronizes a destination, performing basic ETL in the middle if needed.  GoldenGate was not marketed as a solution for reduced downtime initiatives, it was meant for BI solutions with zero latency requirements.  Same basic concept though.