Service Broker doesn't have a native series of monitoring scripts nor a GUI to show you the health of your queues. In this post I'll show you what I monitor and why. I even have some interesting "canaries" that I monitor that will tell me when my SSB design is nearing an imminent failure.
People tell me all the time that they don't want to use Service Broker because it is too confusing. I started a blog series called Service Broker Demystified because SB really isn't that tough if you don't get lost in the weeds. One of the biggest reasons Service Broker is so scary is because there is no (good) GUI to set it up or monitor it. In the next post, [[Service Broker Demystified - How to Radically Simplify SB]], I'll cover how to set up your SSB infrastructure reliably without a GUI. But today I'll show you how to monitor your SSB infrastructure without a (native) GUI.
However, monitoring for the sake of monitoring is rarely beneficial. Likewise, it's not great if your monitoring only alerts you after your SSB queues have become deactivated (probably due to an error somewhere). What you want to be able to do is look for patterns in your monitoring data that are indicative of imminent failure. I call these the canaries. Canaries are the birds coal miners used in the 'Ol Days to alert them if toxic gases were accumulating in the mine. If the canary died the miner knew to get out of the mine posthaste.
Another issue that is scary for those new to SSB is the fact that if errors are thrown during processing who will see them and where will they get logged? With SSB it's really important to make your process as bullet-proof and resilient to errors as possible.
Lastly, SSB has its own lingo that confuses the uninitiated. Terms such as "poison messages", "dropped queue monitors", and "conversation population explosion" does nothing but add to the esoteric nature of Service Broker.
I have a script that you can download that monitors both error conditions and canaries. I strongly recommend implementing this script on BOTH your dev and prod instances where Service Broker is being used. I'm sure it doesn't cover all failure conditions, but the code has never let me down.
I've recently written an entire blog post on what my monitor code does so I'll just refer you to that: [[Monitoring Service Broker]]. Instead, let's actually take a look at some of the more interesting code.
Queue Monitors Stuck in a NOTIFIED State
This query will show you queue monitors that are seeing new messages in your queue but are not properly processing them. A queue monitor exists for any queue with an activator procedure attached to it. You'll see this more in dev environments where your activator code is more volatile. If you activator is not properly issuing a RECEIVE you'll see this. A queue monitor may temporarily be in a NOTIFIED state if all queue readers are busy working existing messages...but you should not see this state for more than a few seconds. Other causes of this error: permissions changed, queue names changed, etc.
Queues can become disabled for various reasons. Generally this will happen on activated queues. The likely cause is your activator stopped working. The reasons are infinite but usually this means that either your activator code changed (it's a dev env) or something in your infrastructure changed (permissions maybe) if this is a prod environment. In this case activation is still enabled but the queue itself is disabled.
Poison Message Detection
Actually, this testing is looking for queues that are not able to receive messages. In general this is almost always due to poison messages. A "poison message" is any message that causes your activator to ROLLBACK 5 times in a row. There may be nothing wrong with the message, it could be that your activator isn't handling an edge case you didn't code for. Once you fix the problem, simply run
ALTER QUEUE WITH STATUS = ON;. The most common cause in production is improperly handled deadlocks that you didn't catch during testing.
The Canaries -- Conversation Population Explosion
This isn't a failure condition, but it's a good indicator that failure is imminent. If you start to see a bunch of conversations in a queue that are not in a CLOSED state then you may either have a really busy/bursty system, or you have one side of a dialog that isn't performing
END CONVERSATION properly.
The Canaries --Various PerfMon Counters
There are various PerfMon counters for SSB that I find invaluable. Here is the list. I've included the Alarm Threshold so you can plug the query into whatever tool you use for monitoring...such as Nagios.
|Counter Name||Purpose||Alarm Threshold|
|Activation Errors Total||An aggregate count for errors against all activated queues.||> 0|
|Corrupted Messages Total||Something is sending your service a corrupted message.||> 0|
|Broker Transaction Rollbacks||This could be innocuous, but it's always worth investigating||> 0|
|Task Limit Reached||This means that max_queue_readers were running but there were still messages in the queue that could have been serviced if another activator was allowed to run. As this number rises you know that your SSB activity is increasing and you may need to revisit your design.||Up to you|
|These values are good to benchmark for trending purposes. Again, as activity increases you may need to revisit your design if you begin to see bottlenecks.||--|
One reason SSB is intimidating is because there is no good out-of-the-box monitoring tools. My monitor scripts are one way to get you up and running quickly. Whatever you decide to use for monitoring you should at least make sure you are looking for failure conditions. When your queues stop processing you may not know until users start complaining. So, along with faillure conditions, make sure you monitor the canaries...those conditions which let you know that your SSB design is experiencing "problems".
sql server service broker service broker demystified