DaveWentzel.com            All Things Data

Stability Issues

 

Application Saturation
Load testing can identify bottlenecks to the application under load as well as its capacity to see if it meets service level requirements. 
 
As load increases, throughput increases, until maximum resource utilization on the bottleneck device is reached.  Maximum possible throughput is reached here.  Saturation and q'ing occur here.  Q'ing manifests itself by degradation in response times.  This can be expressed by
 
Q = X*R
 
Q is the number of bytes in the system, X is the throughput, and R is the response time. 
 
In LR we can see saturation as a degradation in response times with increased load defined as the number of simultaneous Vusers using a graph the LRA tool can create.  See Figure A. 
 
Figure A
 
 
Q'ing does NOT always manifest itself as a degradation in response times.  If a message q'ing technology like MSMQ is used to de-couple the user presentation from the back-end processing the user will not detect response time issues.  Figure B shows this.  The purple line is the number of Vusers, the yellow line is the number of transactions.  Note the lag. 
 
Figure B
 
 
Figure C shows a little flowchart that I use to detect application saturation issues using PerfMon counters from a LR test. 
 
Figure C
 
Thrashing
Sometimes applications perform poorly due to misconfigured hardware or problems in the design of the software that results in "thrashing". 
 
One way to find this is using PerfMon counters.  Context Switches/Sec is a good starting point.  MS has different recommendations scattered throughout their documentation but a good guideline is that the counter should not yield more than 15,000 context switches/sec/processor for a given host under load. 
 
So, assume you have an app where the counter is showing 200,000 and there are eight CPU's.  In this case the context switches/sec/processor would be 25,000.  This would be worth investigating further. 
 
So what does high context switches/sec mean?  It depends.  If there is a high rate of interrupts/sec there could be a hardware issue.  Specifically I've seen this where the NIC is configured in PIO mode and not in DMA mode.  So what should the interrupts/sec be in a properly configured system?  MS gives no guidance, this is a gut call.   Intel suggests >5,000/sec is high. 
 
System calls/sec can also be correlated to context switches.  When both are high we have a possible software bottleneck.  When both are high then we should look at the processes running, can we move some elsewhere...we can look at Available Memory MB, etc.  (we could have a memory leak). 
 
In the real world an application can be "too parallel", ie it has too many threads active simultaneiously. 
 
Software Aging
As software continues to run over a period of time without a restart it tends to accumulate errors and memory leaks that lead to performance problems or failure.  Other problems...unreleased locks, I/O handle leaks, data corruption, unterminated threads, fragmentation of memory and storage. 
 
What's the fix?  Restart the app or reboot.  If the degraded performance disappears and a LR test can resume, then you have a software aging problem. 
 
How can we measure this, diagnose it, and avoid it?  First off, this may not happen purely under load...it can be a time-based problem and load tests generally only run for a few hours.  This is another reason why an occassionally "24 hour load test" can be a good thing.  Software aging can be determined through load testing by running tests with different levels of load and plotting the results.  The slope of the line gives the load-dependent portion of the aging rate.  A diagnosis flowchart I use is available in Figure D. 
 
Figure D
 
 
Garbage Collection Issues
In .Net garbage collection can cause spikes in all performance counters across possibly all hosts.  Usually temporary increases in response time also occur.  A quick google search can yield the cause and effects. 
 
How can we fix this?  Avoid use of the Finalize() method which forces referenced objects to go through two generations of garbage collection vs the usual one.  Figure E has the diagnosis flowsheet I use for garbage collection issues. 
 
Figure E
 
General Stability Errors
Sometimes you get Error 500 IIS errors (internal server errors).  The error doesn't point to cause so what can we do to diagnose and correct them based on PerfMon and LR data? 
 
If they are random and not load related then they are stability errors.  These can be classified as:
  • Race conditions
  • Boundary/limit problems (buffer overruns)
  • Resource Leaks
  • Deadlocks
  • Timeouts

In this case the LR test functions as the "error detector".  Increasing the load increases the probability of detecting the error, usually.  So if you experience these errors in the real world you might get lucky and be able to use LR as the tool to generate and detect the error.  So the question becomes, are the stability errors caused by the load, or merely detected by LR at higher loads? 

One way to tell is to vary the load by +/- 10%.  If the error rate scales linearly this indicates the errors are not load-induced.  If the error rate acceleerates with increased load then the errors may be load-induced.  Figure F shows my flowchart. 

 

Figure F

 

In the flowchart I mention Poisson clumping, what's that?  You can diagnose load-induced errors by observing the pattern of the errors (if any).  If the load tests are designed correctly then the peaks in the error rate should correspond fairly closely with the peaks in the throughput of the application when the two values are plotted together.  Figure G shows the Errors/sec (an LR metric) plottec against the Vusers.  Note the correlation.  Why do they correlate?  Up until a certain level of lead is reached where significant q'ing occurs, throughput increases with load.  In Figure G the errors didn't start until 50 mins into the test. 

Now we should vary the load by +/- 10% around the reference load to avoid raising the load to a level where any load-induced errors start occurring.  Now with the test running we can diagnose the cause of the errors with various profiling tools. 

 

Figure G

 

 

Performance Management Home

Add new comment