Costly Performance Monitoring Events Intel - performance

I was recently able to set up Intel performance monitoring for processors using Sandy Bridge micro-architecture to monitor for split-lock errors, which can be highly detrimental to performance and speed in code that runs frequently. Now that I have been able to use this to locate and fix these errors, I was curious of any other types of events that I could monitor for that could negatively effect performance.
Which events take the biggest toll on code-speed / efficiency?
List of events available to me can be found here in Chapter 19:
https://software.intel.com/sites/default/files/managed/7c/f1/253669-sdm-vol-3b.pdf
Thanks!

Related

Difference between profiling and dignostics(performance counters and monitoring)

I am working on implementing prototype performance monitoring system, I went through multiple documents and resources for understanding the concept but am still confused between profiling and dignostics. Can somebody provide an explanation of these two terms, their relation and when/where do we use them?
"Profiling" usually means mapping things happening in the system (e.g., performance monitoring events) to processes, or to functions (or instructions) within processes. Examples of profiling tools in the Unix/Linux world include "gprof" and "oprofile". Intel's "VTune Amplifier" is another commonly used profiler. Some profilers are limited to looking at the performance of a single process, while others (usually requiring elevated privileges) monitor all processes (including the kernel) operating on the system during the measurement period.
"Diagnostics" is not a term I see very often in performance monitoring, but from the context I would assume that this means looking for evidence of "trouble" in the overall operation of the system. As an example, the performance monitoring system at https://github.com/TACC/tacc_stats collects hardware and software performance monitoring data on each server. In TACC's operation, the data is reviewed automatically to look for matches to a variety of heuristics related to known patterns of poor performance (e.g., all memory accesses being made to one socket in a 2-socket system). The data is also used by human performance analysts in response to user queries and is aggregated to provide an overview of performance-related characteristics by application area.

Idle network filter driver performance on windows

I have come across a strange issue regarding network driver filters on Windows.
It seems that merely installing a network driver filter will cause a degradation in performance.
I am testing different scenarios of 1 Gigabit bandwidth connections and experience an increase in CPU interrupts and lower overall network utilization.
The installed driver in question is completely in packet passthrough mode (No packet reaches usermode).
Is the driver to blame, or will every installed network filter driver cause degradation even when it is not doing anything rather than passing on the packets in kernel mode to the next driver on the stack?
What will be the effects of such a driver on a virtual machine?
After searching all over I have come to no conclusions.
I would be very grateful for any advice whatsoever!
The OS has a fast-path for the case when NDIS filters are not present. Even if the filter does very little, its mere presence can inhibit the fast-path. There is another fast-path for when no WFP filter is installed. The WFP fast-path has an even more significant effect on performance. So it's not too surprising that installing a no-op filter (WFP or NDIS) will have a small effect on performance.
The effect should be so small that it should be difficult to measure. For NDIS, I expect much less than 1% impact to key metrics. For WFP, I expect less than 1% in small-scale (1Gbps), and possibly a little more at larger scale (10Gbps+). In no case should a typical PC struggle to operate a full line rate of 1Gbps, using synthetic workload data.
I issue a generic caution that performance measurement is subtle. It's way too easy to generate convincing graphs that are spoiled by some external factor. Be wary of drawing conclusions until you've thoroughly "debugged" your data itself.

How to measure memory bandwidth utilization on Windows?

I have a highly threaded program but I believe it is not able to scale well across multiple cores because it is already saturating all the memory bandwidth.
Is there any tool out there which allows to measure how much of the memory bandwidth is being used?
Edit: Please note that typical profilers show things like memory leaks and memory allocation, which I am not interested in.
I am only whether the memory bandwidth is being saturated or not.
If you have a recent Intel processor, you might try to use Intel(r) Performance Counter Monitor: http://software.intel.com/en-us/articles/intel-performance-counter-monitor/ It can directly measure consumed memory bandwidth from the memory controllers.
I'd recommend the Visual Studio Sample Profiler which can collect sample events on specific hardware counters. For example, you can choose to sample on cache misses. Here's an article explaining how to choose the CPU counter, though there are other counters you can play with as well.
it would be hard to find a tool that measured memory bandwidth utilization for your application.
But since the issue you face is a suspected memory bandwidth problem, you could try and measure if your application is generating a lot of page faults / sec, which would definitely mean that you are no where near the theoretical memory bandwidth.
You should also measure how cache friendly your algorithms are. If they are thrashing the cache, your memory bandwidth utilization will be severely hampered. Google "measuring cache misses" on good sources that tells you how to do this.
It isn't possible to properly measure memory bus utilisation with any kind of software-only solution. (it used to be, back in the 80's or so. But then we got piplining, cache, out-of-order execution, multiple cores, non-uniform memory architectues with multiple busses, etc etc etc).
You absolutely have to have hardware monitoring the memory bus, to determine how 'busy' it is.
Fortunately, most PC platforms do have some, so you just need the drivers and other software to talk to it:
wenjianhn comments that there is a project specficially for intel hardware (which they call the Processor Counter Monitor) at https://github.com/opcm/pcm
For other architectures on Windows, I am not sure. But there is a project (for linux) which has a grab-bag of support for different architectures at https://github.com/RRZE-HPC/likwid
In principle, a computer engineer could attach a suitable oscilloscope to almost any PC and do the monitoring 'directly', although this is likely to require both a suitably-trained computer engineer as well as quite high performance test instruments (read: both very costly).
If you try this yourself, know that you'll likely need instruments or at least analysis which is aware of the protocol of the bus you're intending to monitor for utilisation.
This can sometimes be really easy, with some busses - eg old parallel FIFO hardware, which usually has a separate wire for 'fifo full' and another for 'fifo empty'.
Such chips are used usually between a faster bus and a slower one, on a one-way link. The 'fifo full' signal, even it it normally occasionally triggers, can be monitored for excessively 'long' levels: For the example of a USB 2.0 Hi-Speed link, this happens when the OS isn't polling the USB fifo hardware on time. Measuring the frequency and duration of these 'holdups' then lets you measure bus utilisation, but only for this USB 2.0 bus.
For a PC memory bus, I guess you could also try just monitoring how much power your RAM interface is using - which perhaps may scale with use. This might be quite difficult to do, but you may 'get lucky'. You want the current of the supply which feeds VccIO for the bus. This should actually work much better for newer PC hardware than those ancient 80's systems (which always just ran at full power when on).
A fairly ordinary oscilloscope is enough for either of those examples - you just need one that can trigger only on 'pulses longer than a given width', and leave it running until it does, which is a good way to do 'soak testing' over long periods.
You monitor utiliation either way by looking for the change in 'idle' time.
But modern PC memory busses are quite a bit more complex, and also much faster.
To do it directly by tapping the bus, you'll need at least an oscilloscope (and active probes) designed explicitly for monitoring the generation of DDR bus your PC has, along with the software analysis option (usually sold separately) to decode the protocol enough to figure out the kind of activity which is occuring on it, from which you can figure out what kind of activity you want to measure as 'idle'.
You may even need a motherboard designed to allow you to make those measurements also.
This isn't so staightfoward as just looking for periods of no activity - all DRAM needs regular refresh cycles at the very least, which may or may not happen along with obvious bus activity (some DRAM's do it automatically, some need a specific command to trigger it, some can continue to address and transfer data from banks not in refresh, some can't, etc).
So the instrument needs to be able to analyse the data deeply enough for you extract how busy it is.
Your best, and simplest bet is to find a PC hardware (CPU) vendor who has tools which do what you want, and buy that hardware so you can use those tools.
This might even involve running your application in a VM, so you can benefit from better tools in a different OS hosting it.
To this end, you'll likely want to try Linux KVM (yes, even for Windows - there are windows guest drivers for it), and also pin down your VM to specific CPUs, whilst you also configure linux to avoid putting other jobs on those same CPUs.

Optimal CPU utilization thresholds

I have built software that I deploy on Windows 2003 server. The software runs as a service continuously and it's the only application on the Windows box of importance to me. Part of the time, it's retrieving data from the Internet, and part of the time it's doing some computations on that data. It's multi-threaded -- I use thread pools of roughly 4-20 threads.
I won't bore you with all those details, but suffice it to say that as I enable more threads in the pool, more concurrent work occurs, and CPU use rises. (as does demand for other resources, like bandwidth, although that's of no concern to me -- I have plenty)
My question is this: should I simply try to max out the CPU to get the best bang for my buck? Intuitively, I don't think it makes sense to run at 100% CPU; even 95% CPU seems high, almost like I'm not giving the OS much space to do what it needs to do. I don't know the right way to identify best balance. I guessing I could measure and measure and probably find that the best throughput is achived at a CPU avg utilization of 90% or 91%, etc. but...
I'm just wondering if there's a good rule of thumb about this??? I don't want to assume that my testing will take into account all kinds of variations of workloads. I'd rather play it a bit safe, but not too safe (or else I'm underusing my hardware).
What do you recommend? What is a smart, performance minded rule of utilization for a multi-threaded, mixed load (some I/O, some CPU) application on Windows?
Yep, I'd suggest 100% is thrashing so wouldn't want to see processes running like that all the time. I've always aimed for 80% to get a balance between utilization and room for spikes / ad-hoc processes.
An approach i've used in the past is to crank up the pool size slowly and measure the impact (both on CPU and on other constraints such as IO), you never know, you might find that suddenly IO becomes the bottleneck.
CPU utilization shouldn't matter in this i/o intensive workload, you care about throughput, so try using a hill climbing approach and basically try programmatically injecting / removing worker threads and track completion progress...
If you add a thread and it helps, add another one. If you try a thread and it hurts remove it.
Eventually this will stabilize.
If this is a .NET based app, hill climbing was added to the .NET 4 threadpool.
UPDATE:
hill climbing is a control theory based approach to maximizing throughput, you can call it trial and error if you want, but it is a sound approach. In general, there isn't a good 'rule of thumb' to follow here because the overheads and latencies vary so much, it's not really possible to generalize. The focus should be on throughput & task / thread completion, not CPU utilization. For example, it's pretty easy to peg the cores pretty easily with coarse or fine-grained synchronization but not actually make a difference in throughput.
Also regarding .NET 4, if you can reframe your problem as a Parallel.For or Parallel.ForEach then the threadpool will adjust number of threads to maximize throughput so you don't have to worry about this.
-Rick
Assuming nothing else of importance but the OS runs on the machine:
And your load is constant, you should aim at 100% CPU utilization, everything else is a waste of CPU. Remember the OS handles the threads so it is indeed able to run, it's hard to starve the OS with a well behaved program.
But if your load is variable and you expect peaks you should take in consideration, I'd say 80% CPU is a good threshold to use, unless you know exactly how will that load vary and how much CPU it will demand, in which case you can aim for the exact number.
If you simply give your threads a low priority, the OS will do the rest, and take cycles as it needs to do work. Server 2003 (and most Server OSes) are very good at this, no need to try and manage it yourself.
I have also used 80% as a general rule-of-thumb for target CPU utilization. As some others have mentioned, this leaves some headroom for sporadic spikes in activity and will help avoid thrashing on the CPU.
Here is a little (older but still relevant) advice from the Weblogic crew on this issue: http://docs.oracle.com/cd/E13222_01/wls/docs92/perform/basics.html#wp1132942
If you feel your load is very even and predictable you could push that target a little higher, but unless your user base is exceptionally tolerant of periodic slow responses and your project budget is incredibly tight, I'd recommend adding more resources to your system (adding a CPU, using a CPU with more cores, etc.) over making a risky move to try to squeeze out another 10% CPU utilization out of your existing platform.

Open source profiler for analyzing low-level architectural inefficiencies?

Modern processors use all sorts of tricks to bridge the gap between the large speed of their processing elements and the tardiness of the external memory. In performance-critical applications the way you structure your code can often have a considerable influence on its efficiency. For instance, researchers using the SLO analyzer were able to fix cache locality problems and double the execution speed of several SPEC2000 benchmark programs. I'm looking for recommendations for an open source tool that utilizes a processor's performance monitoring support to locate and analyze architectural inefficiencies, such as cache misses, branch mispredicts, front end stalls, cache pollution through address aliasing, long latency instructions, and TLB misses. I'm aware of Intel's VTune (commercial), AMD's CodeAnalysist (free, but not open source), and Cachegrind (relies on simulation).
For linux, oprofile works well. Actually AMD's CodeAnalysist uses oprofile as its backend.
Oprofile uses processor's intenal performance tunning mechanism to analyze architectural inefficiency.

Resources