CPU cycles vs. total CPU time - windows

On Windows, GetProcessTimes() and QueryProcessCycleTime() can be used to get totals for all threads of an app. I expected (apparently naively) to find a proportional relationship between the total number of cycles and the total processor time (user + kernel). When converted to the same units (seconds) and expressed at a percent of the app's running time, they're not even close; and the ratio between them varies greatly.
Right after an app starts, they're fairly close.
3.6353% CPU cycles
5.2000% CPU time
0.79 Ratio
But this ratio increases as an app remains idle (below, after 11 hours, mostly idle).
0.0474% CPU cycles
0.0039% CPU time
12.16 Ratio
Apparently, cycles are counted that don't contribute to user or kernel time. I'm curious about how it works. Please enlighten me.
Thanks.
Vince

The GetProcessTimes and the QueryProcessCycleTime values are calculated in different ways. GetProcesTimes/GetThreadTimes are updated in response to timer interrupts, while QueryProcessCycleTime values are based on the tracking of actual thread execution times. These different ways of measuring may cause vastly different timing results when both API results are used and compared. Especially since the GetThreadTimes includes only fully completed time-slot values for its thread counters (see http://blog.kalmbachnet.de/?postid=28), which usually results in incorrect timings.
Since GetProcessTimes will in general report less time than actually spent (due to not always completing its time-slice) it makes sense
that its CPU time percentage will decrease over time compared to the cycle measurement percentage.

Related

What is an appropriate model for detecting CPU usage abnormality?

I am currently working on a model for detecting CPU usage abnormality for our regression test cases. The goal is to keep tracking the CPU usage day after day, and raise an alarm if the CPU usage of any particular test case rockets up so that the relevant developer can kick in in time for investigation.
The model I've figured out is like this:
Measure the CPU usage of the target process every second. Just read the /proc/#pid/stat and make the #CPU time# divided by the #wall time# during the passed second (which means the #wall time# must be 1).
When the test case finishes, we get an array of CPU usage data (how long the array is depends on how long the test case runs). Go through this array and get a summary array with 5 elements, which represents the distribution of CPU usage -- [0~20%, 20~40%, 40~60%, 60~80%, +80%], such as [0.1, 0.2, 0.2, 0.4, 0.1].
0~20%: How many measurements over the total measurements have CPU usage between 0 and 20%
20~40%: How many measurements over the total measurements have CPU usage between 20% and 40%
40~60%: How many measurements over the total measurements have CPU usage between 40% and 60%
60~80%: How many measurements over the total measurements have CPU usage between 60% and 80%
+80%: How many measurements over the total measurements have CPU usage over 80% (actually the CPU usage might be higher than 100% if multi-threading is enabled)
The sum of this summary array should be 1.
Then I do calculation on the summary array and get a score with this formula:
summary[0]*1 + summary[1]*1 + summary[2]*2 + summary[3]*4 + summary[4]*8
The principle is: higher CPU usage gets higher penalty.
The higher the final score is, the higher the overall CPU usage is. For any particular test case, I can get one score every day. If the score rockets up some day, an alarm is raised.
To verify this model, I've repeatedly run a randomly selected set of test cases for a few hundred times, on different machines. I found that on any particular host, the scores fluctuate, but within a reasonable range. However, different hosts live in different ranges.
This is a diagram showing the scores of 600 runs for a particular test case. Different colors indicate different hosts.
Here are a few questions:
Does it make sense that the scores on different machines locate in different ranges?
If I want to trace CPU usage for a test case, should I run it merely on one dedicated machine day after day?
Is there any better model for tracing CPU usage and raising alarms?
You should have an O(n) model for every function based on what you believe it should be doing. Then generate your regression cases against it by having variations on n. If the function does not display expected performance as n increases, something is wrong. If it is too good, the problem is in your expectations and test cases.
Another approach is to come up with a theoretical ideal run metric. These tend to save time via hard coding or looking up in massive tables. Here, additional error checking, etc required will detract from your efficiency compared to the ideal run method.
Only way to prove cpu use is bad is to come up with a better alternative. If you could do so, then you should have designed your dev process to produce the better algorithm in the first place!

How to interprete ETW graphs for sampled and precise CPU usage when they contradict

I'm having difficulties pinning down where our application is spending its time.
Looking at the flame graphs of an ETW trace from the sampled and the precise CPU Usage, they contradict each other. Below are the graphs for a 1 second duration
According to the "CPU Usage (Sampled)" graph
vbscript.dll!COleScript::ParseScriptText is a big contributor in the total performance.
ws2_32.dll!recv is a small contributor.
According to the "CPU Usage (Precise)" graph
Essentially, this shows it's the other way around?
vbscript.dll!COleScript::ParseScriptText is a small contributor, only taking up 3.95 ms of CPU.
ws2_32.dll!recv is a big contributor, taking up 915,09 ms of CPU.
What am I missing or misinterpreting?
CPU Usage (Sampled)
CPU Usage (Precise)
There is a simple explanation:
CPU Usage (Precise) is based on context switch data and it can therefore give an extremely accurate measure of how much time a thread spends on-CPU - how much time it is running. However because it is based on context switch data it only knows what the call stack is when a thread goes on/off CPU. It has no idea what it is doing in-between. Therefore, CPU Usage (Precise) data knows how much time a thread is using but it has no ideas where that time is spent.
CPU Usage (Sampled) is less accurate in regards to how much CPU time a thread consumes, but it is quite good (statistically speaking, absent systemic bias) at telling you where time is spent.
With CPU Usage (Sampled) you still need to be careful about inclusive versus exclusive time (time spent in a function versus spent in its descendants) in order to interpret data correctly but it sounds like that data is what you want.
For more details see https://randomascii.wordpress.com/2015/09/24/etw-central/ which has documentation of all of the columns for these two tables, as well as case studies of using them both (one to investigate CPU usage scenarios, the other to investigate CPU idle scenarios)

Is it preferable to use the total time taken for a canonical workload as a benchmark or count the cycles/time taken by the individual operations?

I'm designing a benchmark for a critical system operation. Ideally the benchmark can be used to detect performance regressions. I'm debating between using the total time for a large workload passed into the operation or counting the cycles taken by the operation as the measurement criterion for the benchmark.
The time to run each iteration of the operation in question is fast perhaps 300-500 nanoseconds.
A total time is much easier to measure accurately / reliably, and the measurement overhead is irrelevant. It's what I'd recommend, as long as you're sure you can stop your compiler from optimizing across iterations of whatever you're measuring. (Check the generated asm if necessary).
If you think your runtime might be data-dependent and want to look into variation across iterations, then you might consider recording timestamps somehow. But 300 ns is only ~1k clock cycles on a 3.3GHz CPU, and recording a timestamp takes some time. So you definitely need to worry about measurement overhead.
Assuming you're on x86, raw rdtsc around each operation is pretty lightweight, but out-of-order execution can reorder the timestamps with the work. Get CPU cycle count?, and clflush to invalidate cache line via C function.
An lfence; rdtsc; lfence to stop the timing from reordering with each iteration of the workload will block out-of-order execution of the steps of the workload, distorting things. (The out-of-order execution window on Skylake is a ROB size of 224 uops. At 4 per clock that's a small fraction of 1k clock cycles, but in lower-throughput code with stalls for cache misses there could be significant overlap between independent iterations.)
Any standard timing functions like C++ std::chrono will normally call library functions that ultimately use rdtsc, but with many extra instructions. Or worse, will make an actual system call taking well over a hundred clock cycles to enter/leave the kernel, and more with Meltdown+Spectre mitigation enabled.
However, one thing that might work is using Intel-PT (https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing) to record timestamps on taken branches. Without blocking out-of-order exec at all, you can still get timestamps on when the loop branch in your repeat loop executed. This may well be independent of your workload and able to run soon after its issued into the out-of-order part of the core, but that can only happen a limited distance ahead of the oldest not-yet-retired instruction.

GPU affects core calculation and or RAM access (high jitter)?

i have a kthread which runs alone on one core from a multi-core CPU. This kthread disables all IRQs for that core, runs a loop as fast as possible and measures the maximum loop duration with the help of the TSC. The whole ACPI stuff is disabled (no frequency scaling, no power saving, etc.).
My problem is, that the maximum loop duration apparently depends on the gpu.
When the system is used normal (a little bit office, Internet and programming stuff / not really busy) then the maximum loop duration is around 5 us :-(
The same situation, but with a stressed CPU (the other three cores are 100% busy) leads to a maximum loop duration of approximately 1 us :-|
But when the GPU is switching into idle mode (turning-off the screen), then the maximum loop duration is going down to less than 300 ns :-)
Why is that? And how can i influence this behavior? I thought the CPU and the RAM are directly connected. I recognized, that the maximum loop duration becomes better on a system with a external graphic card for the first situation. For the second and third case i couldn't see a difference. I also tested AMD and Intel systems without success - always the same :-(
I'm fine with the second case. But is it possible to achieve that without stressing the CPU additionally?
Many thanks in advance!
Billy

Questions on Measuring Time Using the CPU Clock

I'm aware of the standard methods of getting time deltas using CPU clock counters on various operating systems. My question is, how do such operating systems account for the change in CPU frequency for power saving purposes. I initially thought this could be explained based on the fact that OS's use specific calls to measure frequency to get the corrected frequency based on which core is being used, what frequency it's currently set to, etc. But then I realized, wouldn't that make any time delta inaccurate if the CPU frequency was lowered and raised back to it's original value in between two clock queries.
For example take the following scenario:
Query the CPU cycles. Operating system lowers CPU frequency for power saving. Some other code is run here. Operating system raises CPU frequency for performance. Query the CPU cycles. Calculate delta as cycle difference divided by frequency.
This would yield an inaccurate delta since the CPU frequency was not constant between the two queries. How is this worked around by the operating system or programs that have to work with time deltas using CPU cycles?
see this wrong clock cycle measurements with rdtsc
there are more ways how to deal with it
set CPU clock to max
read the link above to see how to do it?
use PIT instead of RDTSC
PIT is programmable interrupt timer (Intel 8253 if I remember correctly) it is present on all PC motherboards since x286 (and maybe even before) but the resolution is only ~119KHz and not all OS give you access to it.
combine PIT and RDTSC
just measure the CPU clock by PIT repeatedly when is stable enough start your measurement (and remain scanning for CPU clock change). If CPU clock changes during measurement then throw away the measurement and start again

Resources