So I have been working on the source code of a complex application (written by hundreds of programmers) for a while now. And among other things, I have created some time checking functions, along with suitable data structures to measure execution periods of different segments of the main loop and run some analysis on these measurements.
Here's a pseudocode that helps explaining:
void FunctionA()
//Prints the different between each slice, and the slice before it,
//starting from slice number 1.
Most measurements were very reasonable, for instance assigning a value to a local variable will cost less than a fraction of a microsecond. Most functions will execute from start to finish in a few microseconds, and rarely ever reach one millisecond.
I then ran a few tests for half an hour or so, and I found some strange results that I couldn't quite understand. Certain functions will be called, and when measuring the time from the moment of calling the function (last line in 'calling' code) to the first line inside the 'called' function will take a very long time, up to a 30 milliseconds period. That's happening in a loop that would otherwise complete a full iteration in less than 8 milliseconds.
To get a picture of that, in the pseudocode I included, the time period between the slice number 0, and the slice number 1, or the time between the slice number 3, and the slice number 4 is measured. This the sort of periods I am referring to. It is the measured time between calling a function, and running the first line inside the called function.
QuestionA. Could this behavior be due to thread, or process switching by the OS? Does calling a function is a uniquely vulnerable spot to that? The OS I am working on is Windows 10.
Interestingly enough, there was never a last line in a function returning to the first line after the call in the 'calling' code problem at all ( periods from slice number 2 to 3 or from 5 to 6 in pseudocode)! And all measurements were always less than 5 microseconds.
QuestionB. Could this be, in any way, due to the time measurement method I am using? Could switching between different cores gives an allusion of slower than actually is context switching due to clock differences? (although I never found a single negative delta time so far, which seems to refute this hypothesis altogether). Again, the OS I am working on is Windows 10.
My time measuring function looks looks this:
FORCEINLINE double Seconds()
Windows::LARGE_INTEGER Cycles;
// add big number to make bugs apparent where return value is being passed to float
return Cycles.QuadPart * GetSecondsPerCycle() + 16777216.0;

QuestionA. Could this behavior be due to thread, or process switching by the OS?
Yes. Thread switches can happen at any time (e.g. when a device sends an IRQ that causes a different higher priority thread to unblock and preempt your thread immediately) and this can/will cause unexpected time delays in your thread.
Does calling a function is a uniquely vulnerable spot to that?
There's nothing particularly special about calling your own functions that makes them uniquely vulnerable. If the function involves the kernel's API a thread switch can be more likely, and some things (e.g. calling "sleep()") are almost guaranteed to cause a thread switch.
Also there's potential interaction with virtual memory management - often things (e.g. your executable file, your code, your data) use "memory mapped files" where accessing it for the first time may cause OS to fetch the code or data from disk (and your thread can be blocked until the code or data it wanted arrived from disk); and rarely used code or data can also be sent to swap space and need to be fetched.
QuestionB. Could this be, in any way, due to the time measurement method I am using?
In practice it's likely that Windows' QueryPerformanceCounter() is implemented with an RDTSC instruction (assuming 80x86 CPU/s) and doesn't involve the kernel at all, and for modern hardware it's likely that this is monatomic. In theory Windows could emulate RDTSC and/or implement QueryPerformanceCounter() in another way to guard against security problems (timing side channels), as has been recommended by Intel for about 30 years now, but this is unlikely (modern operating systems, including but not limited to Windows, tend to care more about performance than security); and in theory your hardware/CPU could be so old (about 10+ years old) that Windows has to implement QueryPerformanceCounter() in a different way, or you could be using some other CPU (e.g. ARM and not 80x86).
In other words; it's unlikely (but not impossible) that the time measurement method you're using is causing any timing problems.


Runtime of GPU-based simulation unexplainable?

I am developing a GPU-based simulation using OpenGL and GLSL-Shaders and i found that performance increases when I add additional (unnecessary) GL-commands.
The simulation runs entirely on GPU without any transfers and basically consists of a loop performing 2500 algorithmically identical time steps. I carefully implemented caching of GLSL-uniform locations and removed any GL-state requests (glGet* etc) to maximize speed. To measure wall clock time i've put a glFinish after the main loop and take the elapsed time afterwards.
Normal total runtime for all iterations is 490ms.
Now, if i add a single additional glGetUniformLocation(...) command at the end of EACH time step, it requires only 475ms in total, which is 3 percent faster. (Please note that this is relevant to me since later i will perform a lot more timesteps)
I've looked at a timeline captured with Nvidia nsight and found that, in case A, all opengl commands are issued within the first 140ms and the glFinish takes 348ms until completion of all GPU-work. In case B the issuing of opengl commands is spread out over a significantly longer time (410ms) and the glFinish only takes 64ms yielding the faster 475ms in total.
I also noticed, that hardware command queue is much more full of work packets most of the time in case B, whereas in case A there is only one item waiting most of the time (however, there are no visible idle times).
So my questions are:
Why is B faster?
Why are the command packages issued more uniformly to the hardware queue over time in case A?
How can speed be enhanced without adding additional commands?
I am using Visual c++, VS2008 on Win7 x64.
IMHO this question can not be answered definitely. For what it's worth I experimentally determined, that glFinish (and …SwapBuffers for that matter) have weird runtime time behavior. I'm currently developing my own VR rendering library and prior to that I spend some significant time profiling the timelines of OpenGL commands and their interaction with the graphics system. And what I found out was, that the only thing that's consistent is, that glFinish + …SwapBuffers have a very inconsistent timing behavior.
What could happen is, that this glGetUniformLocation call pulls the OpenGL driver into a "busy" state. If you call glFinish immediately afterwards it may use a different method for waiting (for example it may spin in a while loop waiting for a flag) for the GPU than if you just call glFinish (it may for example wait for a signal or a condition variable and is thus subject to the kernels scheduling behavior).

programatically determine amount of time remaining before preemption

i am trying to implement some custom lock-free structures. its operates similar to a stack so it has a take() and a free() method and operates on pointer and underlying array. typically it uses optimistic conncurrency. free() writes a dummy value to pointer+1 increments the pointer and writes the real value to the new address. take() reads the value at pointer in a spin/sleep style until it doesnt read the dummy value and then decrements the pointer. in both operations changes to the pointer are done with compare and swap and if it fails, the whole operation starts again. the purpose of the dummy value is to insure consistency since the write operation can be preempted after the pointer is incremented.
this situation leads me to wonder weather it is possible to prevent preemtion in that critical place by somhow determining how much time is left before the thread will be preempted by the scheduler for another thread. im not worried about hardware interrupts. im trying to eliminate the possible sleep from my reading function so that i can rely on a pure spin.
is this at all possible?
are there other means to handle this situation?
EDIT: to clarify how this may be helpful, if the critical operation is interrupted, it will effectively be like taking out an exclusive lock, and all other threads will have to sleep before they could continue with their operations
EDIT: i am not hellbent on having it solved like this, i am merely trying to see if its possible. the probability of that operation being interrupted in that location for a very long time is extremely unlikely and if it does happen it will be OK if all the other operations need to sleep so that it can complete.
some regard this as premature optimization, but this is just my pet project. regardless - that does not exclude research and sience from attempting to improve techniques. even though computer sience has reasonably matured and every new technology we use today is just an implementation of what was already known 40 years ago, we should not stop to be creative to address even the smallest of concerns, like trying to make a reasonable set of operations atomic woithout too much performance implications.
Such information surely exists somewhere, but it is of no use for you.
Under "normal conditions", you can expect upwards of a dozen DPCs and upwards of 1,000 interrupts per second. These do not respect your time slices, they occur when they occur. Which means, on the average, you can expect 15-16 interrupts within a time slice.
Also, scheduling does not strictly go quantum by quantum. The scheduler under present Windows versions will normally let a thread run for 2 quantums, but may change its opinion in the middle if some external condition changes (for example, if an event object is signalled).
Insofar, even if you know that you still have so and so many nanoseconds left, whatever you think you know might not be true at all.
Cnnot be done without time-travel. You're stuffed.

Is 16 milliseconds an unusually long length of time for an unblocked thread running on Windows to be waiting for execution?

Recently I was doing some deep timing checks on a DirectShow application I have in Delphi 6, using the DSPACK components. As part of my diagnostics, I created a Critical Section class that adds a time-out feature to the usual Critical Section object found in most Windows programming languages. If the time duration between the first Acquire() and the last matching Release() is more than X milliseconds, an Exception is thrown.
Initially I set the time-out at 10 milliseconds. The code I have wrapped in Critical Sections is pretty fast using mostly memory moves and fills for most of the operations contained in the protected areas. Much to my surprise I got fairly frequent time-outs in seemingly random parts of the code. Sometimes it happened in a code block that iterates a buffer list and does certain quick operations in sequence, other times in tiny sections of protected code that only did a clearing of a flag between the Acquire() and Release() calls. The only pattern I noticed is that the durations found when the time-out occurred were centered on a median value of about 16 milliseconds. Obviously that's a huge amount of time for a flag to be set in the latter example of an occurrence I mentioned above.
So my questions are:
1) Is it possible for Windows thread management code to, on a fairly frequent basis (about once every few seconds), to switch out an unblocked thread and not return to it for 16 milliseconds or longer?
2) If that is a reasonable scenario, what steps can I take to lessen that occurrence and should I consider elevating my thread priorities?
3) If it is not a reasonable scenario, what else should I look at or try as an analysis technique to diagnose the real problem?
Note: I am running on Windows XP on an Intel i5 Quad Core with 3 GB of memory. Also, the reason why I need to be fast in this code is due to the size of the buffer in milliseconds I have chosen in my DirectShow filter graphs. To keep latency at a minimum audio buffers in my graph are delivered every 50 milliseconds. Therefore, any operation that takes a significant percentage of that time duration is troubling.
Thread priorities determine when ready threads are run. There's, however, a starvation prevention mechanism. There's a so-called Balance Set Manager that wakes up every second and looks for ready threads that haven't been run for about 3 or 4 seconds, and if there's one, it'll boost its priority to 15 and give it a double the normal quantum. It does this for not more than 10 threads at a time (per second) and scans not more than 16 threads at each priority level at a time. At the end of the quantum, the boosted priority drops to its base value. You can find out more in the Windows Internals book(s).
So, it's a pretty normal behavior what you observe, threads may be not run for seconds.
You may need to elevate priorities or otherwise consider other threads that are competing for the CPU time.
sounds like normal windows behaviour with respect to timer resolution unless you explicitly go for some of the high precision timers. Some details in this msdn link
First of all, I am not sure if Delphi's Now is a good choice for millisecond precision measurements. GetTickCount and QueryPerformanceCoutner API would be a better choice.
When there is no collision in critical section locking, everything runs pretty fast, however if you are trying to enter critical section which is currently locked on another thread, eventually you hit a wait operation on an internal kernel object (mutex or event), which involves yielding control on the thread and waiting for scheduler to give control back later.
The "later" above would depend on a few things, including priorities mentioned above, and there is one important things you omitted in your test - what is the overall CPU load at the time of your testing. The more is the load, the less chances to get the thread continue execution soon. 16 ms time looks perhaps a bit still within reasonable tolerance, and all in all it might depends on your actual implementation.

Who calls delay_tsc() on Linux

I used OProfile to profiling my Linux box. During the profiling processes, I've found that besides "native_safe_halt" function, the "delay_tsc" is the second most significant function consuming cpu cycles (around 10%). It seems delay_tsc() is a busy loop. But who calls it and what is its function?
Nobody calls it directly since it's a local function inside that piece of source you link to. The way to call it is by the published __delay() function.
When you call __delay(), this will use the delay_fn function pointer (also local to that file) to select one of several delay functions. By default, the one selected is delay_loop(), which uses x86 instructions to try and mark time.
However, if use_tsc_delay() has been called (at boot time), it switches the function pointer to delay_tsc(), which uses the time stamp counter (a CPU counter) to mark time.
It's called by any kernel code that wants a reasonably reliable, high-resolution delay function. You can see all the code in the kernel that references __delay here (quite a few places).
I think it's probably pretty safe, in terms of profiling, to ignore the time spent in that function since its intent is to delay. In other words, it's not useful work that's taking a long time to perform - if callers didn't want to delay, they wouldn't call it.
Some examples from that list:
A watchdog timer uses it to pace the cores so that their output is not mixed up with each other, by delaying for some multiple of the current core ID.
The ATI frame buffer driver appears to use it for delays between low-level accesses to the hardware. In fact, it's used quite a bit for that purpose in many device drivers.
It's used during start-up to figure out the relationship between jiffies and the actual hardware speeds.

How do I obtain CPU cycle count in Win32?

In Win32, is there any way to get a unique cpu cycle count or something similar that would be uniform for multiple processes/languages/systems/etc.
I'm creating some log files, but have to produce multiple logfiles because we're hosting the .NET runtime, and I'd like to avoid calling from one to the other to log. As such, I was thinking I'd just produce two files, combine them, and then sort them, to get a coherent timeline involving cross-world calls.
However, GetTickCount does not increase for every call, so that's not reliable. Is there a better number, so that I get the calls in the right order when sorting?
Edit: Thanks to #Greg that put me on the track to QueryPerformanceCounter, which did the trick.
Heres an interesting article! says not to use RDTSC, but to instead use QueryPerformanceCounter.
Using regular old timeGetTime() to do
timing is not reliable on many
Windows-based operating systems
because the granularity of the system
timer can be as high as 10-15
milliseconds, meaning that
timeGetTime() is only accurate to
10-15 milliseconds. [Note that the
high granularities occur on NT-based
operation systems like Windows NT,
2000, and XP. Windows 95 and 98 tend
to have much better granularity,
around 1-5 ms.]
However, if you call
timeBeginPeriod(1) at the beginning of
your program (and timeEndPeriod(1) at
the end), timeGetTime() will usually
become accurate to 1-2 milliseconds,
and will provide you with extremely
accurate timing information.
Sleep() behaves similarly; the length
of time that Sleep() actually sleeps
for goes hand-in-hand with the
granularity of timeGetTime(), so after
calling timeBeginPeriod(1) once,
Sleep(1) will actually sleep for 1-2
milliseconds,Sleep(2) for 2-3, and so
on (instead of sleeping in increments
as high as 10-15 ms).
For higher precision timing
(sub-millisecond accuracy), you'll
probably want to avoid using the
assembly mnemonic RDTSC because it is
hard to calibrate; instead, use
QueryPerformanceFrequency and
QueryPerformanceCounter, which are
accurate to less than 10 microseconds
(0.00001 seconds).
For simple timing, both timeGetTime
and QueryPerformanceCounter work well,
and QueryPerformanceCounter is
obviously more accurate. However, if
you need to do any kind of "timed
pauses" (such as those necessary for
framerate limiting), you need to be
careful of sitting in a loop calling
QueryPerformanceCounter, waiting for
it to reach a certain value; this will
eat up 100% of your processor.
Instead, consider a hybrid scheme,
where you call Sleep(1) (don't forget
timeBeginPeriod(1) first!) whenever
you need to pass more than 1 ms of
time, and then only enter the
QueryPerformanceCounter 100%-busy loop
to finish off the last < 1/1000th of a
second of the delay you need. This
will give you ultra-accurate delays
(accurate to 10 microseconds), with
very minimal CPU usage. See the code
You can use the RDTSC CPU instruction (assuming x86). This instruction gives the CPU cycle counter, but be aware that it will increase very quickly to its maximum value, and then reset to 0. As the Wikipedia article mentions, you might be better off using the QueryPerformanceCounter function.
System.Diagnostics.Stopwatch.GetTimestamp() return the number of CPU cycle since a time origin (maybe when the computer start, but I'm not sure) and I've never seen it not increased between 2 calls.
The CPU Cycles will be specific for each computer so you can't use it to merge log file between 2 computers.
RDTSC output may depend on the current core's clock frequency, which for modern CPUs is neither constant nor, in a multicore machine, consistent.
Use the system time, and if dealing with feeds from multiple systems use an NTP time source. You can get reliable, consistent time readings that way; if the overhead is too much for your purposes, using the HPET to work out time elapsed since the last known reliable time reading is better than using the HPET alone.
Use the GetTickCount and add another counter as you merge the log files. Won't give you perfect sequence between the different log files, but it will at least keep all logs from each file in the correct order.
