How do I obtain CPU cycle count in Win32? - winapi

In Win32, is there any way to get a unique cpu cycle count or something similar that would be uniform for multiple processes/languages/systems/etc.
I'm creating some log files, but have to produce multiple logfiles because we're hosting the .NET runtime, and I'd like to avoid calling from one to the other to log. As such, I was thinking I'd just produce two files, combine them, and then sort them, to get a coherent timeline involving cross-world calls.
However, GetTickCount does not increase for every call, so that's not reliable. Is there a better number, so that I get the calls in the right order when sorting?
Edit: Thanks to #Greg that put me on the track to QueryPerformanceCounter, which did the trick.

Heres an interesting article! says not to use RDTSC, but to instead use QueryPerformanceCounter.
Conclusion:
Using regular old timeGetTime() to do
timing is not reliable on many
Windows-based operating systems
because the granularity of the system
timer can be as high as 10-15
milliseconds, meaning that
timeGetTime() is only accurate to
10-15 milliseconds. [Note that the
high granularities occur on NT-based
operation systems like Windows NT,
2000, and XP. Windows 95 and 98 tend
to have much better granularity,
around 1-5 ms.]
However, if you call
timeBeginPeriod(1) at the beginning of
your program (and timeEndPeriod(1) at
the end), timeGetTime() will usually
become accurate to 1-2 milliseconds,
and will provide you with extremely
accurate timing information.
Sleep() behaves similarly; the length
of time that Sleep() actually sleeps
for goes hand-in-hand with the
granularity of timeGetTime(), so after
calling timeBeginPeriod(1) once,
Sleep(1) will actually sleep for 1-2
milliseconds,Sleep(2) for 2-3, and so
on (instead of sleeping in increments
as high as 10-15 ms).
For higher precision timing
(sub-millisecond accuracy), you'll
probably want to avoid using the
assembly mnemonic RDTSC because it is
hard to calibrate; instead, use
QueryPerformanceFrequency and
QueryPerformanceCounter, which are
accurate to less than 10 microseconds
(0.00001 seconds).
For simple timing, both timeGetTime
and QueryPerformanceCounter work well,
and QueryPerformanceCounter is
obviously more accurate. However, if
you need to do any kind of "timed
pauses" (such as those necessary for
framerate limiting), you need to be
careful of sitting in a loop calling
QueryPerformanceCounter, waiting for
it to reach a certain value; this will
eat up 100% of your processor.
Instead, consider a hybrid scheme,
where you call Sleep(1) (don't forget
timeBeginPeriod(1) first!) whenever
you need to pass more than 1 ms of
time, and then only enter the
QueryPerformanceCounter 100%-busy loop
to finish off the last < 1/1000th of a
second of the delay you need. This
will give you ultra-accurate delays
(accurate to 10 microseconds), with
very minimal CPU usage. See the code
above.

You can use the RDTSC CPU instruction (assuming x86). This instruction gives the CPU cycle counter, but be aware that it will increase very quickly to its maximum value, and then reset to 0. As the Wikipedia article mentions, you might be better off using the QueryPerformanceCounter function.

System.Diagnostics.Stopwatch.GetTimestamp() return the number of CPU cycle since a time origin (maybe when the computer start, but I'm not sure) and I've never seen it not increased between 2 calls.
The CPU Cycles will be specific for each computer so you can't use it to merge log file between 2 computers.

RDTSC output may depend on the current core's clock frequency, which for modern CPUs is neither constant nor, in a multicore machine, consistent.
Use the system time, and if dealing with feeds from multiple systems use an NTP time source. You can get reliable, consistent time readings that way; if the overhead is too much for your purposes, using the HPET to work out time elapsed since the last known reliable time reading is better than using the HPET alone.

Use the GetTickCount and add another counter as you merge the log files. Won't give you perfect sequence between the different log files, but it will at least keep all logs from each file in the correct order.

Related

Is it preferable to use the total time taken for a canonical workload as a benchmark or count the cycles/time taken by the individual operations?

I'm designing a benchmark for a critical system operation. Ideally the benchmark can be used to detect performance regressions. I'm debating between using the total time for a large workload passed into the operation or counting the cycles taken by the operation as the measurement criterion for the benchmark.
The time to run each iteration of the operation in question is fast perhaps 300-500 nanoseconds.
A total time is much easier to measure accurately / reliably, and the measurement overhead is irrelevant. It's what I'd recommend, as long as you're sure you can stop your compiler from optimizing across iterations of whatever you're measuring. (Check the generated asm if necessary).
If you think your runtime might be data-dependent and want to look into variation across iterations, then you might consider recording timestamps somehow. But 300 ns is only ~1k clock cycles on a 3.3GHz CPU, and recording a timestamp takes some time. So you definitely need to worry about measurement overhead.
Assuming you're on x86, raw rdtsc around each operation is pretty lightweight, but out-of-order execution can reorder the timestamps with the work. Get CPU cycle count?, and clflush to invalidate cache line via C function.
An lfence; rdtsc; lfence to stop the timing from reordering with each iteration of the workload will block out-of-order execution of the steps of the workload, distorting things. (The out-of-order execution window on Skylake is a ROB size of 224 uops. At 4 per clock that's a small fraction of 1k clock cycles, but in lower-throughput code with stalls for cache misses there could be significant overlap between independent iterations.)
Any standard timing functions like C++ std::chrono will normally call library functions that ultimately use rdtsc, but with many extra instructions. Or worse, will make an actual system call taking well over a hundred clock cycles to enter/leave the kernel, and more with Meltdown+Spectre mitigation enabled.
However, one thing that might work is using Intel-PT (https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing) to record timestamps on taken branches. Without blocking out-of-order exec at all, you can still get timestamps on when the loop branch in your repeat loop executed. This may well be independent of your workload and able to run soon after its issued into the out-of-order part of the core, but that can only happen a limited distance ahead of the oldest not-yet-retired instruction.

If a CPU is always executing instructions how do we measure its work?

Let us say we have a fictitious single core CPU with Program Counter and basic instruction set such as Load, Store, Compare, Branch, Add, Mul and some ROM and RAM. Upon switching on it executes a program from ROM.
Would it be fair to say the work the CPU does is based on the type of instruction it's executing. For example, a MUL operating would likely involve more transistors firing up than say Branch.
However from an outside perspective if the clock speed remains constant then surely the CPU could be said to be running at 100% constantly.
How exactly do we establish a paradigm for measuring the work of the CPU? Is there some kind of standard metric perhaps based on the type of instructions executing, the power consumption of the CPU, number of clock cycles to complete or even whether it's accessing RAM or ROM.
A related second question is what does it mean for the program to "stop". Usually does it just branch in an infinite loop or does the PC halt and the CPU waits for an interupt?
First of all, that a CPU is always executing some code is just an approximation these days. Computer systems have so-called sleep states which allow for energy saving when there is not too much work to do. Modern CPUs can also throttle their speed in order to improve battery life.
Apart from that, there is a difference between the CPU executing "some work" and "useful work". The CPU by itself can't tell, but the operating system usually can. Except for some embedded software, a CPU will never be running a single job, but rather an operating system with different processes within it. If there is no useful process to run, the Operating System will schedule the "idle task" which mostly means putting the CPU to sleep for some time (see above) or jsut burning CPU cycles in a loop which does nothing useful. Calculating the ratio of time spent in idle task to time spent in regular tasks gives the CPU's business factor.
So while in the old days of DOS when the computer was running (almost) only a single task, it was true that it was always doing something. Many applications used so-called busy-waiting if they jus thad to delay their execution for some time, doing nothing useful. But today there will almost always be a smart OS in place which can run the idle process than can put the CPU to sleep, throttle down its speed etc.
Oh boy, this is a toughie. It’s a very practical question as it is a measure of performance and efficiency, and also a very subjective question as it judges what instructions are more or less “useful” toward accomplishing the purpose of an application. The purpose of an application could be just about anything, such as finding the solution to a complex matrix equation or rendering an image on a display.
In addition, modern processors do things like clock gating in power idle states. The oscillator is still producing cycles, but no instructions execute due to certain circuitry being idled due to cycles not reaching them. These are cycles that are not doing anything useful and need to be ignored.
Similarly, modern processors can execute multiple instructions simultaneously, execute them out of order, and predict and execute which instructions will be executed next before your program (i.e. the IP or Instruction Pointer) actually reaches them. You don’t want to include instructions whose execution never actually complete, such as because the processor guesses wrong and has to flush those instructions, e.g. as due to a branch mispredict. So a better metric is counting those instructions that actually complete. Instructions that complete are termed “retired”.
So we should only count those instructions that complete (i.e. retire), and cycles that are actually used to execute instructions (i.e. unhalted).)
Perhaps the most practical general metric for “work” is CPI or cycles-per-instruction: CPI = CPU_CLK_UNHALTED.CORE / INST_RETIRED.ANY. CPU_CLK_UNHALTED.CORE are cycles used to execute actual instructions (vs those “wasted” in an idle state). INST_RETIRED are those instructions that complete (vs those that don’t due to something like a branch mispredict).
Trying to get a more specific metric, such as the instructions that contribute to the solution of a matrix multiple, and excluding instructions that don’t directly contribute to computing the solution, such as control instructions, is very subjective and difficult to gather statistics on. (There are some that you can, such as VECTOR_INTENSITY = VPU_ELEMENTS_ACTIVE / VPU_INSTRUCTIONS_EXECUTED which is the number of SIMD vector operations, such as SSE or AVX, that are executed per second. These instructions are more likely to directly contribute to the solution of a mathematical solution as that is their primary purpose.)
Now that I’ve talked your ear off, check out some of the optimization resources at your local friendly Intel developer resource, software.intel.com. Particularly, check out how to effectively use VTune. I’m not suggesting you need to get VTune though you can get a free or very discounted student license (I think). But the material will tell you a lot about increasing your programs performance (i.e. optimizing), which is, if you think about it, increasing the useful work your program accomplishes.
Expanding on Michał's answer a bit:
Program written for modern multi-tasking OSes are more like a collection of event handlers: they effectively setup listeners for I/O and then yield control back to the OS. The OS wake them up each time there is something to process (e.g. user action, data from device) and they "go to sleep" by calling into the OS once they've finished processing. Most OSes will also preempt in case one process hog the CPU for too long and starve the others.
The OS can then keep tabs on how long each process are actually running (by remembering the start and end time of each run) and generate the statistics like CPU time and load (ready process queue length).
And to answer your second question:
To stop mostly means a process is no longer scheduled and all associated resource (scheduling data structures, file handles, memory space, ...) destroyed. This usually require the process to call a special OS call (syscall/interrupt) so the OS can release the resources gracefully.
If however a process run into an infinite loop and stops responding to OS events, then it can only be forcibly stopped (by simply not running it anymore).

Go - System Time in pico-second

Is it possible in go to get system time in less than nano second means in pico-second or like that? Actually, I want to measure two consecutive events time gap which I can't catch in nano-second in our fast system.
The cost of calling profiling functions/instructions on modern hardware is larger (and mor espread and prone to deviance) than the interval you're going to measure. So even if you try, you'll get erroneous results.
Consider tracking time lapse for 100 events, if that's at all possible.
The time resolution is hardware and operating system dependent. The Go time.Nanoseconds function provides up to nanosecond time resolution. On a PC, it's usually 1000 nanoseconds (1 microsecond) or 100 nanoseconds at best.

Is 16 milliseconds an unusually long length of time for an unblocked thread running on Windows to be waiting for execution?

Recently I was doing some deep timing checks on a DirectShow application I have in Delphi 6, using the DSPACK components. As part of my diagnostics, I created a Critical Section class that adds a time-out feature to the usual Critical Section object found in most Windows programming languages. If the time duration between the first Acquire() and the last matching Release() is more than X milliseconds, an Exception is thrown.
Initially I set the time-out at 10 milliseconds. The code I have wrapped in Critical Sections is pretty fast using mostly memory moves and fills for most of the operations contained in the protected areas. Much to my surprise I got fairly frequent time-outs in seemingly random parts of the code. Sometimes it happened in a code block that iterates a buffer list and does certain quick operations in sequence, other times in tiny sections of protected code that only did a clearing of a flag between the Acquire() and Release() calls. The only pattern I noticed is that the durations found when the time-out occurred were centered on a median value of about 16 milliseconds. Obviously that's a huge amount of time for a flag to be set in the latter example of an occurrence I mentioned above.
So my questions are:
1) Is it possible for Windows thread management code to, on a fairly frequent basis (about once every few seconds), to switch out an unblocked thread and not return to it for 16 milliseconds or longer?
2) If that is a reasonable scenario, what steps can I take to lessen that occurrence and should I consider elevating my thread priorities?
3) If it is not a reasonable scenario, what else should I look at or try as an analysis technique to diagnose the real problem?
Note: I am running on Windows XP on an Intel i5 Quad Core with 3 GB of memory. Also, the reason why I need to be fast in this code is due to the size of the buffer in milliseconds I have chosen in my DirectShow filter graphs. To keep latency at a minimum audio buffers in my graph are delivered every 50 milliseconds. Therefore, any operation that takes a significant percentage of that time duration is troubling.
Thread priorities determine when ready threads are run. There's, however, a starvation prevention mechanism. There's a so-called Balance Set Manager that wakes up every second and looks for ready threads that haven't been run for about 3 or 4 seconds, and if there's one, it'll boost its priority to 15 and give it a double the normal quantum. It does this for not more than 10 threads at a time (per second) and scans not more than 16 threads at each priority level at a time. At the end of the quantum, the boosted priority drops to its base value. You can find out more in the Windows Internals book(s).
So, it's a pretty normal behavior what you observe, threads may be not run for seconds.
You may need to elevate priorities or otherwise consider other threads that are competing for the CPU time.
sounds like normal windows behaviour with respect to timer resolution unless you explicitly go for some of the high precision timers. Some details in this msdn link
First of all, I am not sure if Delphi's Now is a good choice for millisecond precision measurements. GetTickCount and QueryPerformanceCoutner API would be a better choice.
When there is no collision in critical section locking, everything runs pretty fast, however if you are trying to enter critical section which is currently locked on another thread, eventually you hit a wait operation on an internal kernel object (mutex or event), which involves yielding control on the thread and waiting for scheduler to give control back later.
The "later" above would depend on a few things, including priorities mentioned above, and there is one important things you omitted in your test - what is the overall CPU load at the time of your testing. The more is the load, the less chances to get the thread continue execution soon. 16 ms time looks perhaps a bit still within reasonable tolerance, and all in all it might depends on your actual implementation.

What is the clock source for the count returned by QueryPerfomanceCounter

I was under the impression that QueryPerformanceCounter was actually accessing the counter that feeds the HPET (High Performance Event Timer)---the difference of course being that HPET is a timer which send an interrupt when the counter value matches the desired interval whereas to make a timer "out of" QueryPerformanceCounter you have to write your own loop in software.
The only reason I had assumed the hardware behind the two was the same is because somewhere I had read that QueryPerformanceCounter was accessing a counter on the chipset.
http://www.gamedev.net/reference/programming/features/timing/ claims that QueryPerformanceCounter use chipset timers which apparently have a specified clock rate. However, I can verify that QueryPerformanceFrequency returns wildly different numbers on different machines, and in fact, the number can change slightly from boot to boot.
The numbers returned can sometimes be totally ridiculous---implying ticks in the nanosecond range. Of course when put together it all works; that is, writing timer software using QueryPerformanceCounter/QueryPerformanceFrequency allows you to get proper timing and latency is pretty low.
A software timer using these functions can be pretty good. For example, with an interval of 1 millisecond, over 30 seconds it's easy to nearly 100% of ticks to fall within 10% of the intended interval. With an interval of 100 microseconds, you still get a high success rate (99.7%) but the worst ticks can be way off (200 microseconds).
I'm wondering if the clock behind the HPET is the same. Supposedly HPET should still increase accuracy since it is a hardware timer, but as of yet I don't know how to access it in Windows.
Sounds like Microsoft has made these functions use "whatever best timer there is":
http://www.microsoft.com/whdc/system/sysinternals/mm-timer.mspx
Did you try updating your CPU driver for your AMD multicore system? Did you check whether your motherboard chipset is on the "bad" list? Did you try setting the CPU affinity?
One can also use the RTC-based time functions and/or a skip-detecting heuristic to eliminate trouble with QPC.
This has some hints: CPU clock frequency and thus QueryPerformanceCounter wrong?
Please improve this. It is a community wiki.

Resources