DaveWentzel.com            All Things Data

June 2011

A Pattern To Create @PrintOnlyMode or @DebugMode Procedures

Have you ever had the need to create complex stored procedures that execute lots of other TSQL commands, yet you were scared of executing them directly due to some destructive nature of the procedure?  Wouldn't it be nice to be able to write stored procedures that had a flag like @PrintOnlyMode where it would PRINT the commands it would normally execute, without executing them.  Those PRINT statements could then be copy/pasted into another SSMS window and executed, after verifying their safety?  

I recently had a need to do just this.  We had to build a routine that looked to a custom application metadata table to determine if replication needed to be installed, and if it was installed, was it "up-to-date" with all of the latest pubs/articles/subscribers required for our application to function properly.  With replication I knew I wouldn't have a high degree of confidence that a single procedure called, for instance, "dbo.Replication", would be able to do all of these things, even if I wrote it.  

So I created a mechanism where I can either EXECUTE the commands directly (the default behavior), else PRINT the commands to the console to be double-checked by a human and executed manually later.  Surprisingly this is harder to do than you might think.  But a little framework and some helper procedures make the code easier to work with.  Here is basic framework:

Add a Parameter to Turn On/Off PrintOnly Mode

,@PrintOnly bit = 0

This is the mechanism to turn on/off PrintOnly Mode.  The default will be "execution" mode. 

Add an OUTPUT Parameter to Your Stored Proc

,@PrintMsg varchar(max) = '' OUTPUT

This is optional.  @PrintMsg will be needed, but it can be a standard variable.  @PrintMsg will contain the string we either execute or PRINT.  The benefit of making this an OUTPUT parameter is that we can write unit tests (I love TSQLT for this) where we capture the output and execute it to ensure our PRINT statements work as advertised.  

Add a PRINT/EXECUTE Block at the End of Your Proc

Done:

IF @PrintOnly = 0

BEGIN

   EXEC
(@PrintMsg)

END

ELSE

BEGIN

   PRINT
(@PrintMsg)

END

Your proc will be building a string...you need to determine if you want to PRINT or EXECUTE.  

So far, these are the simple steps.  Now comes the complicated portions...building strings that can be PRINT'd or EXEC'd.  I'll cover that in the next post.  

Tags: 

What is a Data Retention Plan (Part 3)

This is Part 3 of my "Data Retention Plan" series.  If you are a data architect you really need to understand data retention from many different perspectives.  In the first post I covered technology considerations for data retention.  These are probably no-brainers for most data architects.  In the second post I covered some of the functional issues around data retention that I have seen.

What is a Data Retention Plan (Part 2)

This is Part 2 of my What is a Data Retention Plan series.  If you are a data architect you really need to understand data retention from many different perspectives.  In the last post I covered technology considerations for data retention.  These are probably no-brainers for most data architects.  Now I want to cover some of the functional issues around data retention that I have seen.  Hopefully you will find this information useful.  

Replication: Determining When My Distribution or Snapshot Agent Has "Caught Up"

I needed a quick way to determine when my Snapshot Agent completed and all of its subsequent commands were delivered to my subscribers.  Said another way, how can I tell when my Distribution Agent has "caught up" and all transactions have been delivered to my subscribers?  Surprisingly a quick Google search on variants of this question yielded nothing.  So I wrote my own.  First, when might this be useful?:

  • I need to stop replication before an upgrade event occurs and I want to be sure all outstanding transactions have flowed from the publisher to the subscribers. 
  • If for any reason I need to re-snapshot my large publication I have a quick method to know that my subscribers are back in sync so I can open up my system to my data consumers.  
  • In my case, I need a mechanism to quickly serialize my data so I can swap tables from one schema to another (more why you would want to do this in the future).  

The easiest way to do this is to use the built-in tracer token functionality.  These routines are documented as being used to measure latency between publisher-distribution and distributor-subscriber.  Essentially we simply want to know when our tracer token, which is a fancy way of saying our LSN (log sequence number) has arrived at the subscriber.  

My code waits and does not return control to the client until the tracer token has arrived at the subscribers.  It requires a pub name, but really any pub name will work.  You could actually change the proc to look at the TOP 1 pub from syspublications.  Remember, the tracer token is really just an LSN so we only need to do this once for our topology.  

CREATE PROC ReplRunSnapshotJobs_Tests.TracerTokenWait

(

  
@publication varchar(200)

)


AS

BEGIN

  
--executes a tracer token on the publisher, waits until any and all subscribers get it before returning control to the caller.  

  
WAITFOR DELAY '00:00:05'

  

  
DECLARE @tracer_token_id INT, @i INT

   SELECT
@i = 0

  

  
WHILE @i < 10

  
BEGIN

       BEGIN
TRY

          
BEGIN TRAN

               EXEC
sp_posttracertoken @publication = @publication, @tracer_token_id = @tracer_token_id OUTPUT

          
COMMIT TRAN

           SELECT
@i = 10

      
END TRY

      
BEGIN CATCH

          
IF ERROR_NUMBER() = 21488

          
BEGIN

              
--No active subscriptions were found. The publication must have active subscriptions in order to post a tracer token.

               --not sure why this happens.  We should wait and try again.  

              
PRINT 'No active subscriptions were found.  Waiting for agents to spin up.'

              
WAITFOR DELAY '00:00:05'

              
SELECT @i = @i + 1

          
END

           ELSE

           BEGIN

               PRINT
ERROR_NUMBER()

              
PRINT ERROR_MESSAGE()

              
RETURN

           END

       END
CATCH

  
END;

  

  
CREATE TABLE #tokenhistory (distributor_latency bigint, subscriber sysname,subscriber_db sysname, subscriber_latency bigint, overall_latency bigint)

  

  
SELECT @i = 1

  
WHILE @i = 1

  
BEGIN

       BEGIN TRAN

           insert into
#tokenhistory

          
exec sp_helptracertokenhistory @publication = @publication, @tracer_id = @tracer_token_id

      
COMMIT TRAN

       IF
EXISTS (select * from #tokenhistory WHERE overall_latency IS NULL)

      
BEGIN

           waitfor
delay '00:00:05'

          
PRINT 'waiting for distribution agent to distribute data changes...'

          
delete from #tokenhistory

      
END

       ELSE

       BEGIN

          
--BREAK

          
SELECT @i = 0

      
END

   END

END


GO

Other Replication Topics

Tags: 

QR Codes vs UPCs

QR codes (QR is an acronym for "Quick Response") are those blocky UPC-like symbols that have become ubiquitous in newspaper and magazine ads for the past year or so.  Any barcode reader with QR software installed (and QR software is available freely and is open-source) can read these things.  It is also an ISO standard.  The real benefit is that camera-equipped smart phones can read these things too, without needing to purchase a Bluetooth barcode scanner.  You can even create your own QR generator for free using a generator.  Here is one.  

The UPC numbering scheme was invented by IBM in the 1930's, but the first scanned UPC barcode symbol did not occur until the mid 1970's.  The QR was created by Toyota in the 1970's and has been ubiquitous in that country for many years.  

Why Use a QR over a standard UPC barcode?

  • I can store more data:  A UPC can store about 20 digits of data.  A QR could store almost an infinite amount of data as scanning technology improves.  As of now you can store about 100 digits of data in the same space as the minimum amount of space required to realistically store 20 digits of UPC data without loss of fidelity.  This is mostly because QR stores data both horizontally and vertically.  The less amount of data you store the lower resolution the resulting QR code.  The more data you want to store, the higher resolution the resulting QR code will be.  This also means that many scanners may not be able to interpret that data, at least as of today.  

An example...the QR code on the left simply embeds my URL, the QR code on the right embeds my URL, phone number, email address, and a small string of text.  Both images are 155x155.  The image on the right would need to be 355x355 to show the information at the same fidelity.  As you can see, as scanning technology improves we can embed tons more information within the QR without changes the QR specification at all.  The same cannot be said for UPC.  

 

  • There is checksumming built in.  So, if your QR code is damaged, perhaps torn or even very dusty, you can probably still retrieve at least some of your data.  Realistically you can lose 30% of a QR code before you can no longer accurately reproduce its message.  UPC also has checksumming, but it is far more rudimentary and is simply there to ensure someone has not modified the UPC.  
  • Barcodes are a more closed system.  You can't just create a barcode, slap it on your product, and expect it to work if you sell your product at the local grocery store.  Barcodes must be obtained from GS1 US, a standards organization that provides UPC numbers and barcodes to businesses.  To apply for a barcode is a cumbersome process starting with the registration of your business and any product you wish to barcode.  There are many barcode creation websites that will charge you a small fee (about $10) and create a barcode for you under their business registration.  This saves you some time and steps, but is not feasible for companies wishing to quickly produce many variants of a new product.  The libertarian in me does not like that we have a government organization controlling UPCs.  A more open system, like QR, will allow flexibility over time.  
  • Embedded encryption possibilities:  I should be able to encrypt information in the QR, and I should be able to use any encryption algorithm I want.  Yes, UPC does this, but like UPC checksumming, is rudimentary.  Here is a Dutch coin that began minting earlier THIS MONTH:

 

 

Note the embedded QR code.  Right now this code points directly to the Dutch mint's website and is a static QR code.  Why couldn't the QR code be dynamic, with embedded encryption, that could be used to identify the coin to determine its authenticity?  This opens up lots of new possibilities.  Granted, it would take some additional cost but counterfeiting possibilities could be radically reduced.  In the US we rely on those special yellow highlighters at checkout counters to determining if newer, larger denomination bills are phonies.  And this technology has probably been broken by counterfeiters already.  Using QR codes could potentially work much better, and as counterfeiters break "QR encryption" newer mechanisms could be embedded quickly.  

What is a Data Retention Plan (Part 1)

Every data architect must have been given a handbook that I never received that says something like, “you must have a data retention plan.”  If you ask the typical data architect what a good data retention plan is I’d be surprised if you got an answer more advanced than "I work with the business units to determine what the business needs, then I architect my solution accordingly."   Most simply say, "we purge our transactional data every x years."  I think these answers are lacking quite a bit, but they are the canned answers I hear from data architect candidates during inter

SQL CLR Security

In my last post I covered a little bit about a DOS shell command utility that I wrote that was a replacement for xp_cmdshell.  I wrote this after looking at the alternatives (sp_OA procs, PowerShell, calling sp_start_job) to run DOS commands.  I learned a little bit about the intricacies of creating a CLR routine that I thought I would document.  Do a quick google search and you'll find tons of xp_cmdshell replacements using the CLR.

Tags: 

DOS Shell Commands in SQL Server

There are lots of ways to shell out to run a DOS command from SQL Server, each with pros and cons.  This post will show some methods that I use, some other methods I avoid, and my new best friend, the CLR method. 

xp_cmdshell

Tags: 

UPDATE FROM syntax and LEFT JOINs and ambiguity

If you use SQL Server you've probably seen code like this:

UPDATE dbo.tab SET SomeCol table2.SomeOtherCol

FROM dbo.tab 

JOIN dbo.table2 ON tab.ID table2.ID

To us SQL Server people this code makes perfect sense...update some values in the first table with rows from a second table.  It surprises many SQL Server people when exposed to Oracle for the first time that this is not valid ANSI SQL syntax.  I don't want to make this an Oracle post, but there are solutions to the "update one table with values from a second" pattern that work in Oracle.  They will usually perform just fine, my only complaint would be that the Oracle syntax is not nearly as easy to convert from a SELECT to an UPDATE so I can "test before executing".  What I mean is...in my UPDATE statement above I could change the first line to "SELECT table2.SomeOtherCol, tab.ID" and execute it to see exactly what rows I will be changing.  

That is not a jab against Oracle.  An Oracle purist will counter that the UPDATE...FROM TSQL syntax is frought with danger.  I agree, and that is what this post is about.  Let's look at some code:

CREATE TABLE #Foo (ID intcolA varchar(20))

CREATE TABLE #Bar (ID intFooID intcolB varchar(20))



INSERT INTO #Foo VALUES (1,'FooRow1')

INSERT INTO #Foo VALUES (2,'FooRow2')

INSERT INTO #Foo VALUES (3,'FooRow3')

INSERT INTO #Foo VALUES (4,'FooRow4')

INSERT INTO #Bar VALUES (1,1,'BarRow1')

INSERT INTO #Bar VALUES (2,2,'BarRow2')

INSERT INTO #Bar VALUES (3,2,'BarRow3')

INSERT INTO #Bar VALUES (4,2,'BarRow4')

Requirement:  Set #Foo.colA equal to *any value* from #Bar.colB, if there is a value available for the given key.  A contrived example to be sure.

BEGIN TRAN

UPDATE 
#Foo SET colA Bar.colB

FROM #Foo Foo

LEFT JOIN #Bar Bar ON Foo.ID Bar.FooID



select from #Foo

ROLLBACK TRAN

It should be obvious that #Foo.ID = 1 will be set to 'BarRow1' since there is only one lookup row.  It should also be obvious that #Foo.ID rows 3 and 4 will be set to NULL since there are no corresponding keys in #Bar for those FooIDs.  

What is less clear is what will happen to #Foo.ID = 2 ... will it be set to 'BarRow2', 'BarRow3' or 'BarRow4'?  There is ambiguity here.  In Oracle-land, the equivalent query (written of course without a LEFT JOIN) would throw "ORA-01427: single row subquery returns more than one row."  Exactly.  And really, isn't that what SQL Server should be throwing?  Many people have petitioned Microsoft to deprecate UPDATE...FROM (and DELETE...FROM, which has the same issues) syntax for this reason (as well as some other reasons).  It's become something akin to a religious war. With people on both sides of the debate.    

I personally like the UPDATE...FROM syntax, mostly because it's simple to convert to SELECT and test.  It's the UPDATE...FROM...LEFT JOIN syntax that gets tricky.  There are two reasons we might use LEFT JOIN in an UPDATE statement:

  • I don't want to restrict the rows I wish to act on, rather I want to assume NULL if the LEFT JOIN was not satisfied.  This is safe and unambiguous in my mind.  This is also the requirement listed above for this blog post.  
  • I want to LEFT JOIN but I may have *many* LEFT JOIN table rows for each *one* key row from the FROM table.  In other words, a 1:M relationship.  This is also part of my requirement above.  

In that case, which is my example above, what will be the final value for colA for ID 2?  Here you go:

BEGIN TRAN

UPDATE
#Foo SET colA = Bar.colB

FROM #Foo Foo

LEFT JOIN #Bar Bar ON Foo.ID = Bar.FooID



select * from #Foo

ROLLBACK TRAN

Before you jump to the conclusion that "I'll always get the first matching row back", remember that there is no concept of row ordering in an RDBMS unless there is a specific ORDER BY clause applied.  Never assume order.  Don't believe me?  Let's modify our code to add a CLUSTERED INDEX strategically:

BEGIN TRAN

CREATE CLUSTERED INDEX
tst ON #Bar (FooID,colB DESC)

UPDATE #Foo SET colA = Bar.colB

FROM #Foo Foo

LEFT JOIN #Bar Bar ON Foo.ID = Bar.FooID



select * from #Foo

ROLLBACK TRAN

Oops.  Note that colA's value has changed due to the new ordering inferred by the clustered index!

So far, so good.  I still don't see the big deal with UPDATE...FROM syntax *if* you understand the issues above.  Ah, but that's the rub.  Too often I see code like this:  

BEGIN TRAN

UPDATE 
#Foo SET colA (SELECT TOP 1 colB FROM #Bar Bar WHERE Bar.FooID Foo.ID)

FROM #Foo Foo



select from #Foo

ROLLBACK TRAN

What is this code trying to do?  Very simply, it is bringing back the FIRST row from Bar that has the given FooID, using a correlated subquery.  I'm not a mind reader, but the developer was probably thinking, "I have a situation where I am bringing back possibly more than one row for the given key, I better restrict it since I don't really understand what SQL Server will do.  I can accomplish that by using a TOP clause."  

I have problems with this approach:

  • You still haven't guaranteed *which* key will be returned (although that's not part of our requirement, it's still ambiguous and you shouldn't code ambiguity if possible).
  • You've taken a FROM...LEFT JOIN syntax that the optimizer will be able to handle in a performant manner and replaced it with a TOP clause in a correlated subquery.  Correlated subqueries *often* perform far worse than straight JOINs (especially against larger tables...this is the RBAR effect).  Also, TOP clauses incur the overhead of a SORT operator.  
  • It's not as easy to read or convert from a SELECT statement that I can test to an UPDATE.   

If the requirement were changed to return "the most recent entry from #Bar" then the TOP solution might be acceptable over smaller result sets, assuming of course the addition of an ORDER BY clause.  

Takeaways

  1. If you use UPDATE...FROM syntax be aware of ambiguities.  
  2. If you use UPDATE...FROM...LEFT JOIN be aware that you have ZERO control over which row from the LEFT JOIN'd table will be returned if you have a 1:M situation.  

TSQL

Pages