Blocking and Contention with sysjobhistory

IMHO Microsoft has a "bug" where whenever a job completes and status is logged to sysjobhistory a check is done to see if "purging" is enabled and if so a purge is performed to retain the configured amount of job history.  For most jobs this works just great.  However, the configuration by MS is often too coarse-grained for many customers, especially customers with MANY jobs that execute every minute or less, for instance.  In these situations that purging of sysjobhistory takes an excessive amount of resources.  Usually in these cases you don't really need to maintain any of this history at all.  

In these cases you may want finer-grain control over which jobs you maintain history for, as well as for how long we maintain it.  That's what I have written to overcome this problem.  But first, how do you know if you are affected by this "bug"?  

Blocking on sysjobhistory

If you are monitoring blocking on your system you will see contention against sysjobhistory coming from the procedure called sp_jobhistory_row_limiter.  This code will run whenever a job completes execution and you have job history logging enabled.  Within that stored procedure is the following code block:

SELECT @current_rows_per_job = COUNT(*)
FROM msdb.dbo.sysjobhistory with (TABLOCKX)
WHERE (job_id = @job_id)  

Notice the TABLOCKX?  That is the problem.  It essentially serializes all access against that table.  So, if you have a lot of frequently-executed jobs you can see that this will become a bottleneck and will block.  

The code I have created (available here for download) works around these issues by:  

  1. turning off job history log limits if they are enabled.  This means YOU are responsible for purging.  
  2. Creates a new table that allows you to specify, at the JobName or Category level, exactly how many rows you wish to maintain
  3. You can also specify a (default) value that will be applied when the JobName or Category does not have an entry.   
  4. We don't actually target a specific number of rows to maintain, rather, a number of days of history to maintain.  Logically, I think it makes more sense to maintain "days of history" vs number of rows.  This is just my opinion.  It's also a little bit more efficient code since I can just purge by date vs getting a target number of rows to maintain.  
  5. We can target different "days to maintain" for success vs failure rows.  This is important.  If I run a job every minute I may not care about saving ANY successes, but I may want to see multiple years' worth of failures.  
  6. Runs as its own sqlagent job.  I would have rather made this a Service Broker "task" but oh well.  

I have also included a series of TSQLT tests.  These are database unit tests.  I feel that all code should be unit tested not just to prove that it works, but also as a self-documenting tool.  There are a lot of nuances in this code and I have them well-documented in the unit tests.  

I hope someone finds this code useful.