DaveWentzel.com            All Things Data

Data Architecture

TSQL Autonomous Transactions or Another Reason I Like PLSQL over TSQL

Since I have both an MCDBA and OCP I often get asked which RDBMS I like better, Oracle or SQL Server?  The stock answer is, "it depends" (what a cop-out).  But for sure, PL/SQL is light years ahead of TSQL.  Some of my complaints with TSQL:

  • terrible error handling.  Yes I know all about TRY/CATCH.  I also know that there are still plenty of error conditions that occur before TRY/CATCH is invoked.
  • When assigning potentially multiple values to a variable, no error is thrown.  Consider this example:

      DECLARE @variable varchar(200)

      SELECT @variable = Name
      FROM
            (VALUES
                  (1,'Dave Wentzel'),
                  (1,'Angie Wentzel')
            ) AS tbl (ID,Name)
      WHERE ID = 1     
      SELECT @variable

If you've never seen the goofy VALUES syntax it is Row Constructors syntax.  Don't get caught up in the weeds, basically the variable will always be set to the last value returned.  For this example it will likely be Angie Wentzel every time, but can't be guaranteed and should generate an error.  Oracle does this via the error directive TOO_MANY_ROWS (or OTHERS at a minimum).  We should not be allowed to assign a non-scalar query result to a scalar variable

  • No autonomous transactions.  You can also consider this inconsistent temporary object behavior. 

Oracle has these things called PRAGMAs which is a fancy term for a "compiler directive".  In Oracle these are often misused and misunderstood.  There's really only one good use for them...logging error messages.  Assume a long running, multi-step batch process with a single transaction where I wish to log status messages as I go along.  Now assume near the end of the process it fails and I need to ROLLBACK.  I can of course catch the error and write out informational messages, but I cannot log anything to a table, especially those intermediate steps, since the transaction that rolls back will also ROLLBACK the logging.  Darn.  In Oracle we can write our logging routine using PRAGMA AUTONOMOUS_TRANSACTION which essentially means that routine does not participate in any existing transaction.  Problem solved.

For the longest time we didn't have this...until the introduction of table variables.  This "feature" is not well documented.  You can in fact do this:

      DECLARE @variable varchar(200)

      SELECT @variable = Name
      FROM
            (VALUES
                  (1,'Dave Wentzel'),
                  (1,'Angie Wentzel')
            ) AS tbl (ID,Name)
      WHERE ID = 1     
      SELECT @variable

Well, that's pretty cool and solves our logging problem, but if you didn't fully understand that behavior and continued to reuse a table variable after a ROLLBACK you will get unexpected results.

it also violates ACID properties of a transaction.  Some think a table variable isn't really a part of the db, it is a temporary object, but then why is the behavior different for temp tables in tempdb?  TSQL needs more consistency here.  

What is this NoSQL thing?

You've probably seen NoSQL being bandied about on the web and in the IT press.  Whenever we see a "new" technology trend get repeated press coverage we know we need to bone up on that hot new thing.  First of all there is a NoSQL RDBMS that runs on Unix, but this isn't what is being hyped right now.  The "NoSQL" being hyped is really just a concept that states that we can store data without using an RDBMS, the SQL language, or even relational modeling!  

What is it?

Doing Database CIT via TFS

Continuous Integration Testing (CIT) is an agile concept where you are constantly testing your code checkins to ensure they build, compile, and have successful unit tests.  Notification of failures for smaller shops is usually just an email to an SCM who then tracks down what broke using the version control system and notifies that person.  In larger shops a special CIT machine will sit in the scrum room with a big red or green light indicating if the builds are passing CIT.  When it goes red someone usually steps up to determine cause. 

In most shops this CIT concept is a no-brainer for the application code (tools like Apache Ant and CruiseControl really help with the automation and scheduling) , but is rarely done for the database scripts.  One reason is that most applications have a database build process that doesn't lend itself to CIT-style processing easily (for instance, requiring db-level object diffs to determine what needs to be applied).  My database build process is more of an evolutionary process, so by definition it does lend itself to CIT and change quite nicely.  

This post is the next in a series of posts showing useful TFS tips for the data professional.  I'll cover how I created a very simple script that handles database CIT using TFS as the source control repository.  You could also plug in a different VCS quite easily. It uses vbscript and batch files but PowerShell or a compiled application would work just as well, the concept is the important thing.   

I have the actual files available here for download.  Here is the process:

A controlling batch file (called RunCIT.bat) is scheduled using AT.exe or similar scheduler, this handles the automation component.  Let's look at some of the interesting points of the batch file...

Lines 1 - 14...the batch file has a series of parameters that tell it which server to run the db scripts against, user/pwd, location of the logs, and the TFS connection string.  

It then builds a loop (line 19) and checks  "branches.txt" (a csv file) for a list of branches that we want to run CIT against.  In our environment we have at least 5 active code/release branches at any time so I want to run CIT against all branches.  Our SCM can change this file whenever a new branch is created or deprecated.

As a side note, if you've ever tried to do looping in a batch file, it's a real PITA.  If you've ever tried to expand your DOS variables within that loop you know it's almost impossible.  I found a nifty workaround to make this happen using "SETLOCAL ENABLEDELAYEDEXPANSION" and alternative variable naming schemes.  See Lines 3 and 27 for examples. 

Within the loop we then run a series of tf.exe commands to Get Latest for the source code for that branch.  This means that your build machine needs to have Visual Studio loaded on it (unless someone knows how to get the tf.exe commands to run without it).  

The scripts for the given branch are then executed, first against a "net new" database and then a "ported" (upgraded) database, each is set up via manipulate_db.v-b-s.  I've found that naming a vbscript file as anything other than .vbs and then calling it with cscript/wscript is an easy way to bypass UAC on newer Windows OSs

Why run CIT against a "new" and "upgraded" database?

Unlike C# code, where I simply need to test if I compile and my unit tests succeed, a db build is a little more tricky because I need to be concerned about data history and upgrading data.  If I'm changing a UNIQUE INDEX from ColA,ColB to just ColA I can't just CREATE INDEX WITH DROP_EXISTING since it's possible an existing db will have violations.  I need to test that.  If I'm adding ColC to a table I need to remember to add that to the CREATE TABLE statement (for new databases) and an appropriate ALTER for upgraded dbs.  In each case the "new" and "upgraded" database take different codepaths through the sql scripts. 

So your upgraded or "ported" db should probably be a customer db with some substantial data.  You can even run against multiple upgraded dbs, perhaps one from each version of your software that you may have to upgrade. 



But the key is Line 73, GenerateDiscrepancyReport.vbs.  This compares netnew vs ported dbs and reports on missing (or schematically diferent) tables, views, columns, indexes discrepancies (different cluster keys, cols), and it will even report any data changes if you have defined system or master data tables that you ship with your db.  This is handled via 2 stored procs (here and here). 

Email is sent on success and failure with the logs attached. 

Happy Testing!!

Data Virtualization

For some reason "data virtualization" is one of the new hot buzzwords in my industry.  Data virtualization is really just combining a bunch of data sources (databases, unstructured data, DWs, weblogs, whatever) into a single virtual data tier layer that provides access to applications.  Haven't we been doing this for years.  I remember when linked servers came out for SQL Server.  I thought they were the coolest thing at the time.  I could now connect to Oracle from my SQL Server and manipulate the data in the more familiar TSQL, realtime, without the need of an ETL process.

column-oriented databases

I briefly wrote about these a few weeks ago in this blog post.  I believe Michael Stonebraker is credited with "inventing" this technology, although it's really been around since the 1970's.  I was tasked at evaluating column-oriented databases after management read about these in a trade journal and thought it might be applicable for our fast-growing analytics system.  He's possibly right. Fast data access is the cornerstone.  

What are these things?

Whatever it is, I'm against it

I'm told at work that I'm adverse to new technologies.  Well, here's what me and Groucho have to say about that.  

column-oriented databases

I briefly wrote about these a few weeks ago in this blog post.  I believe Michael Stonebraker is credited with "inventing" this technology, although it's really been around since the 1970's.  I was tasked at evaluating column-oriented databases after management read about these in a trade journal and thought it might be applicable for our fast-growing analytics system.  He's possibly right. Fast data access is the cornerstone.  (read more here)

Pages

Subscribe to RSS - Data Architecture