DaveWentzel.com            All Things Data

Service Broker Demystified - Monitoring the Canaries

Service Broker doesn't have a native series of monitoring scripts nor a GUI to show you the health of your queues.  In this post I'll show you what I monitor and why.  I even have some interesting "canaries" that I monitor that will tell me when my SSB design is nearing an imminent failure.

People tell me all the time that they don't want to use Service Broker because it is too confusing. I started a blog series called Service Broker Demystified because SB really isn't that tough if you don't get lost in the weeds.  One of the biggest reasons Service Broker is so scary is because there is no (good) GUI to set it up or monitor it.  In the next post, Service Broker Demystified - How to Radically Simplify SB, I'll cover how to set up your SSB infrastructure reliably without a GUI.  But today I'll show you how to monitor your SSB infrastructure without a (native) GUI.  

However, monitoring for the sake of monitoring is rarely beneficial.  Likewise, it's not great if your monitoring only alerts you after your SSB queues have become deactivated (probably due to an error somewhere).  What you want to be able to do is look for patterns in your monitoring data that are indicative of imminent failure.  I call these the canaries.  Canaries are the birds coal miners used in the 'Ol Days to alert them if toxic gases were accumulating in the mine.  If the canary died the miner knew to get out of the mine posthaste.  

Another issue that is scary for those new to SSB is the fact that if errors are thrown during processing who will see them and where will they get logged?  With SSB it's really important to make your process as bullet-proof and resilient to errors as possible.  

Lastly, SSB has its own lingo that confuses the uninitiated.  Terms such as "poison messages", "dropped queue monitors", and "conversation population explosion" does nothing but add to the esoteric nature of Service Broker.  

I have a script that you can download that monitors both error conditions and canaries.  I strongly recommend implementing this script on BOTH your dev and prod instances where Service Broker is being used.  I'm sure it doesn't cover all failure conditions, but the code has never let me down.  

I've recently written an entire blog post on what my monitor code does so I'll just refer you to that:  Monitoring Service Broker.  Instead, let's actually take a look at some of the more interesting code.  

Queue Monitors Stuck in a NOTIFIED State

This query will show you queue monitors that are seeing new messages in your queue but are not properly processing them.  A queue monitor exists for any queue with an activator procedure attached to it.  You'll see this more in dev environments where your activator code is more volatile.  If you activator is not properly issuing a RECEIVE you'll see this.  A queue monitor may temporarily be in a NOTIFIED state if all queue readers are busy working existing messages...but you should not see this state for more than a few seconds.  Other causes of this error:  permissions changed, queue names changed, etc.  

Disabled Queues

Queues can become disabled for various reasons.  Generally this will happen on activated queues.  The likely cause is your activator stopped working.  The reasons are infinite but usually this means that either your activator code changed (it's a dev env) or something in your infrastructure changed (permissions maybe) if this is a prod environment.  In this case activation is still enabled but the queue itself is disabled. 

Poison Message Detection

Actually, this testing is looking for queues that are not able to receive messages.  In general this is almost always due to poison messages.  A "poison message" is any message that causes your activator to ROLLBACK 5 times in a row.  There may be nothing wrong with the message, it could be that your activator isn't handling an edge case you didn't code for.  Once you fix the problem, simply run ALTER QUEUE WITH STATUS = ON;.  The most common cause in production is improperly handled deadlocks that you didn't catch during testing.  

The Canaries -- Conversation Population Explosion

This isn't a failure condition, but it's a good indicator that failure is imminent.  If you start to see a bunch of conversations in a queue that are not in a CLOSED state then you may either have a really busy/bursty system, or you have one side of a dialog that isn't performing END CONVERSATION properly.

The Canaries --Various PerfMon Counters

There are various PerfMon counters for SSB that I find invaluable.  Here is the list.  I've included the Alarm Threshold so you can plug the query into whatever tool you use for monitoring...such as Nagios.  

Counter Name Purpose Alarm Threshold
Activation Errors Total An aggregate count for errors against all activated queues.   > 0
Corrupted Messages Total Something is sending your service a corrupted message.   > 0
Broker Transaction Rollbacks This could be innocuous, but it's always worth investigating > 0
Task Limit Reached This means that max_queue_readers were running but there were still messages in the queue that could have been serviced if another activator was allowed to run.  As this number rises you know that your SSB activity is increasing and you may need to revisit your design.   Up to you

SQL SENDs/sec

SQL RECEIVEs/sec

These values are good to benchmark for trending purposes.  Again, as activity increases you may need to revisit your design if you begin to see bottlenecks.   --

Summary

One reason SSB is intimidating is because there is no good out-of-the-box monitoring tools.  My monitor scripts are one way to get you up and running quickly.  Whatever you decide to use for monitoring you should at least make sure you are looking for failure conditions.  When your queues stop processing you may not know until users start complaining.  So, along with faillure conditions, make sure you monitor the canaries...those conditions which let you know that your SSB design is experiencing "problems".  


You have just read "Service Broker Demystified - Monitoring the Canaries" on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

2 comments

Comment: 
Hey Dave...thank you for your SB troubleshooting queries. Very thorough and helpful! Quick question: we have an SB setup that works well most of the time. The majority of the time the SB implementation sends a single row as an XML message payload from an INSERT or UPDATE operation no problem. Occasionally, there are operations that are executed which cause massive changes (in comparison to the normal operations) to the tune of 1000s or 10,000s of rows. When this happens the SB implementation appears to have no problem creating the XML result, creating the message payload, and successfully sending the message from initiator to target. I can verify this because there are no rows remains in the initiator's sys.transmission_queue for any significant amount of time. The problem appears to be on the target side. The messages are successfully received and pulled off the queue as expected, but it appears to take forever to consume, shred, process, and act on the message received. I say 'appears' because I have not yet been able to find any proof (the silver bullet that is) to back this up, only experience and anecdotal evidence. This situation does not happen often, but when this happens the process that normally works very quickly (read: near real time) instead slows to a crawl. It will take hours or days to complete if not otherwise acted on and resolved. Do you know where I can look when this happens to troubleshoot more deeply? Thanks! Alex

Comment: 
When it happens take a peek at the performance DMVs, sys.processes, or maybe the blocked process threshold report.  Without looking at your activator I can't really guess where it is but that should narrow it down.  

Add new comment