Archive for the ‘Performance’ Category

Performance Spotlight: Detecting Multithreaded Bottlenecks in .NET

Case study as hands-on software architect for a large three-tier business application for global services and consulting firm.  

Takeaways: Quickly detect performance bottlenecks in multithreaded .NET applications by measuring and analyzing “processor queue length” and “thread contention” performance counters in your monitoring practice. 

Metric: Frequent or sustained processor queue lengths of more than two per processor may indicate a bottleneck or locking issues.  If coupled with high thread contention levels or high CPU utilization, it may point to locking or thread synchronization issues.

Situation

Performance is a critical aspect of successful project delivery and usability.

Recently, I was appointed software architect for a large business-critical web application with a global audience.  This application was having performance problems, and the team was trying to get it to pass its performance measurement test (PMT).  Earlier in the year, the the application had been the subject of a month-long review by an outside team of scalability consultants.

The web application is a classic three-tier Microsoft stack with an ASP.NET front end, a WCF business services tier, and a SQL Server backend. It spans roughly one million lines of code including test suites – the larger and more complex applications are, the harder it is generally to find problems.  On the whole, I found the solution very well coded, with advanced use of templates, interfaces, design patterns, and layering.

Solution network diagram

I took a look at the monitoring data they were gathering during performance stress test sessions.  There was a basic set of performance counters, and not much stood out: for the front-end web and middle-tier services, CPU was only moderately loaded at 25-40%, and all other major subsystems – memory, disk, network – were barely taxed at all.  The SQL Server appeared to be barely turning over, with single-digit CPU usage and no stress on other subsystems.

Where was the problem?  

Only one performance counter really stood out as spiky and hyperactive: System\Processor Queue Length.  This is one of those counters that a lot of developers may have heard about, but have never really used or looked at.  PQL was regularly spiking to 150-200, and maxing out at 250 (click for larger image): 

Performance stress test showing spiking processor queue length

The general guidelines from Microsoft say that an average PQL of two or more per core indicates a bottleneck:

“A collection of one or more threads that is ready but not able to run on the processor due to another active thread that is currently running is called the processor queue. The clearest symptom of a processor bottleneck is a sustained or recurring queue of more than two threads

What this means is that on a 12-core hyper-threaded server (24 logical cores), in our application threads were stacked 10 to 20 deep on each processor just waiting for the CPU.   This is the opposite of high performance: more specifically, it’s a bottleneck.

Advancing the analysis

Based on this & research, I compiled a more detailed set of performance counters that included .NET thread contention levels: the Contention Rate / sec counter in the .NET CLR LocksAndThreads category.  These counters were collected in a subsequent PMT run:

Detail of contention rate / sec

This shows that on average there are ~90 contentions per second where a thread is waiting on or blocked by a lock – and sometimes up to 300 per second.   By itself, 300 contentions per second might not be an issue – but coupled with a high PQL it seemed suspect.

Finding the culprit

When troubleshooting systemic issues, start with the core/common code structures and move outward.

How do you find the cause of excessive contention in a million lines of code that’s studded with locks, much less on a new project you’re not familiar with?  Answer: very strategically. 

I’ve had many years’ experience as a consultant troubleshooting clients’ solutions and reviewing new solution codebases.  If you’re lucky, you’ll be able to reproduce the issue in a development environment where you can use profiling tools to identify performance hotspots or concurrency bottlenecks.  For the harder issues to solve, the first thing I will do is to code review any core/common code related to suspected areas of functionality.  In this case, searching for Parallel.For statements would have been a great bet…

On this application, I had already in fact flagged a piece of code in a core component.  Caches are essential to high performance, and the business logic service tier was home to more than 20 in-memory caches, all of which derived from a common base class, SynchronizedCache. 

Sample cache declaration showing use of common base class

In this base class the developers parallelized the cache search functionality, presumably to take advantage of nice things like a 72-core production environment.  Unfortunately, there’s a lock statement in the heart of it:

Parallel.For with a lock in its heart!

Developers experienced with high-performance multithreading know to avoid locking within parallelized code: it’s a real performance killer.   Here the code, with many thread combing through a cache that could contain a million entities, locks the shared matches list when it finds a match, causing all other matching threads to pile up in a traffic jam behind its lock. 

A search-heavy run through the .NET profiler in Visual Studio 2012 supported this analysis.  The little “flame” icon shows this line as a hotspot, which by the way doesn’t conclusively indicate a bottleneck – one might reasonably look at this and expect that this would be “hot” since it’s doing a lot of search work: 

Visual Studio profiler showing hotspot on Parallel.For

But this is a classic multithreading anti-pattern.  Locking — and blocking — on a shared resource (the matched results list in this case) negates the fundamental reason for multithreading.  Multithreading splits the work up among processors, and locking and synchronization in general brings them together.  It also adds context-switching work for the processor, since each thread match forces, on average, a context switch out and back during a lock wait.

Fixing it

Best practice in multithreading: minimize locking on shared shared resources.

So we’d like to search a large cache, dividing the work among many threads in parallel, but not lock on a central list; how do we accomplish that?   The solution is straightforward: have a results list for each thread, then merge the thread’s results into the master list when complete.  The “penalty” for this is creating a few extra lists — one per thread – which is almost zero.  There’s even a variant of Parallel.For that has a delegate for local thread init action (create local results list) and completion action (lock and merge to master) – perfect!

After reworking this tiny stretch of code, the results were dramatic: PQL dropped to zero, thread contention dropped to zero, and – most surprisingly – CPU utilization dropped dramatically, from its range of 25-40%, to around 8-10%.  To me, this says that the CPU was having to do a lot of work juggling all the threads and locks and context switches that resulted from the anti-pattern. 

With the fix in, a 90-minute performance measurement test shows the radical improvement:

Performance counters after fix

Conclusion

Multithreading can be a curse as well as a blessing; monitoring processor queue length and thread contention will reveal how effectively your application is utilizing processor threads and cores.  Just a single statement in a million lines of code can dramatically impact optimal performance and throughput.

Add the following performance counters to your monitoring bag of tricks to identify bottlenecks in multithreaded .NET solutions:

  • System\Processor Queue Length.  Look for any spikes or sustained queue lengths of more than one or two per processor.
  • .NET CLR ThreadsAndLocks\Contention rate /sec.   Make sure you understand what in your code  is causing contention; look for spikes or sustained contention.
  • Processor\% Processor Time.   I am sure you’re already monitoring this!

More Reading