When I wait on a non-signaled Event using the WaitForSingleObject function, I find that in some cases the call will return WAIT_TIMEOUT in less than the specified timeout period. Simply looping on the call with a timeout set to 1000ms, I've seen the call return in periods as low as 990ms (running on WinXP). I'm using QueryPerformanceCounter to get a system-clock independent time measurement, so I don't think clock drift is likely to be an answer.
This behavior doesn't present any practical problems for me, but I'd like to understand it better. It looks like it may be working at roughly the resolution of a timer tick. Does Microsoft publish any further details on the precision of this function? Should I expect greater precision in Vista?
Yes, WaitForSingleObject uses the timer tick resolution, it does not use a high-resolution timer like QueryPerformanceCounter.
http://msdn.microsoft.com/en-us/library/ms687069(VS.85).aspx, the MSDN article on "Wait Functions" expands on this:
The accuracy of the specified time-out
interval depends on the resolution of
the system clock. The system clock
"ticks" at a constant rate. If the
time-out interval is less than the
resolution of the system clock, the
wait may time out in less than the
specified length of time. If the
time-out interval is greater than one
tick but less than two, the wait can
be anywhere between one and two ticks,
and so on.
This article also explains how to use timeBeginPeriod to increase system clock resolution - but this is not recommended.
I can think of several reasons why. First, a higher resolution is not needed for nearly all use cases of WaitForSingleObject. Using a high-resolution timer would require the kernel to constantly poll the timer (not feasible since kernel code is not guaranteed to be always running) or reprogram it frequently to generate an interrupt (since there could be multiple WaitForSingleObjects and most likely only a single programmable interrupt).
On the other hand, there already is a timing source that is constantly updatable at a resolution that is more than good enough for WaitForSingleObject, SetWaitableTimer, and Sleep.
Related
Sorry for my weak english, by preemption I mean forced context
(process) switch applied to my process.
My question is :
If I write and run my own program game in such way that it does 20 millisecond period work, then 5 millisecond sleep, and then windows pump (peek message/dispatch message) in loop again and again - is it ever preempted by force in windows or no, this preemption does not occur?
I suppose that this preemption would occur if I would not voluntary give control back to system by sleep or peek/dispatch in by a larger amount of time. Here, will it occur or not?
The short answer is: Yes, it can be, and it will be preempted.
Not only driver events (interrupts) can preempt your thread at any time, such thing may also happen due to temporary priority boost, for example when a waitable object is signalled on which a thread is blocked, or for example due to another window becoming the topmost window. Or, another process might simply adjust its priority class.
There is no way (short of giving your process realtime priority, and this is a very bad idea -- forget about it immediately) to guarantee that no "normal" thread will preempt you, and even then hardware interrupts will preempt you, and certain threads such as the one handling disk I/O and the mouse will compete with you over time quantums. So, even if you run with realtime priority (which is not truly "realtime"), you still have no guarantee, but you seriously interfere with important system services.
On top of that, Sleeping for 5 milliseconds is unprecise at best, and unreliable otherwise.
Sleeping will make your thread ready (ready does not mean "it will run", it merely means that it may run -- if and only if a time slice becomes available and no other ready thread is first in line) on the next scheduler tick. This effectively means that the amount of time you sleep is rounded to the granularity of the system timer resolution (see timeBeginPeriod function), plus some unknown time.
By default, the timer resolution is 15.6ms, so your 5ms will be 7.8 seconds on the average (assuming the best, uncontended case), but possibly a lot more. If you adjust the system timer resolution to 1ms (which is often the lowest possible, though some systems allow 0.5ms), it's somewhat better, but still not precise or reliable. Plus, making the scheduler run more often burns a considerable amount of CPU cycles in interrupts, and power. Therefore, it is not something that is generally advisable.
To make things even worse, you cannot even rely on Sleep's rounding mode, since Windows 2000/XP round differently from Windows Vista/7/8.
It can be interrupted by a driver at any time. The driver may signal another thread and then ask the OS to schedule/dispatch. The newly-ready thread may well run instead of yours.
These desktop OS, like Windows, do not provide any real-time guarantees - they were not designed to provide it.
Recently I was doing some deep timing checks on a DirectShow application I have in Delphi 6, using the DSPACK components. As part of my diagnostics, I created a Critical Section class that adds a time-out feature to the usual Critical Section object found in most Windows programming languages. If the time duration between the first Acquire() and the last matching Release() is more than X milliseconds, an Exception is thrown.
Initially I set the time-out at 10 milliseconds. The code I have wrapped in Critical Sections is pretty fast using mostly memory moves and fills for most of the operations contained in the protected areas. Much to my surprise I got fairly frequent time-outs in seemingly random parts of the code. Sometimes it happened in a code block that iterates a buffer list and does certain quick operations in sequence, other times in tiny sections of protected code that only did a clearing of a flag between the Acquire() and Release() calls. The only pattern I noticed is that the durations found when the time-out occurred were centered on a median value of about 16 milliseconds. Obviously that's a huge amount of time for a flag to be set in the latter example of an occurrence I mentioned above.
So my questions are:
1) Is it possible for Windows thread management code to, on a fairly frequent basis (about once every few seconds), to switch out an unblocked thread and not return to it for 16 milliseconds or longer?
2) If that is a reasonable scenario, what steps can I take to lessen that occurrence and should I consider elevating my thread priorities?
3) If it is not a reasonable scenario, what else should I look at or try as an analysis technique to diagnose the real problem?
Note: I am running on Windows XP on an Intel i5 Quad Core with 3 GB of memory. Also, the reason why I need to be fast in this code is due to the size of the buffer in milliseconds I have chosen in my DirectShow filter graphs. To keep latency at a minimum audio buffers in my graph are delivered every 50 milliseconds. Therefore, any operation that takes a significant percentage of that time duration is troubling.
Thread priorities determine when ready threads are run. There's, however, a starvation prevention mechanism. There's a so-called Balance Set Manager that wakes up every second and looks for ready threads that haven't been run for about 3 or 4 seconds, and if there's one, it'll boost its priority to 15 and give it a double the normal quantum. It does this for not more than 10 threads at a time (per second) and scans not more than 16 threads at each priority level at a time. At the end of the quantum, the boosted priority drops to its base value. You can find out more in the Windows Internals book(s).
So, it's a pretty normal behavior what you observe, threads may be not run for seconds.
You may need to elevate priorities or otherwise consider other threads that are competing for the CPU time.
sounds like normal windows behaviour with respect to timer resolution unless you explicitly go for some of the high precision timers. Some details in this msdn link
First of all, I am not sure if Delphi's Now is a good choice for millisecond precision measurements. GetTickCount and QueryPerformanceCoutner API would be a better choice.
When there is no collision in critical section locking, everything runs pretty fast, however if you are trying to enter critical section which is currently locked on another thread, eventually you hit a wait operation on an internal kernel object (mutex or event), which involves yielding control on the thread and waiting for scheduler to give control back later.
The "later" above would depend on a few things, including priorities mentioned above, and there is one important things you omitted in your test - what is the overall CPU load at the time of your testing. The more is the load, the less chances to get the thread continue execution soon. 16 ms time looks perhaps a bit still within reasonable tolerance, and all in all it might depends on your actual implementation.
Is there anyway at all in the windows environment to sleep for ~1 microsecond? After researching and reading many threads and various websites, I have not been able to see that this is possible. Since the scheduler appears to be the limiting factor and it operates
at the 1 millisecond level, then I believe it can't be done without going to a real time OS.
It may not be the most portable, and I've not used these functions myself, but it might be possible to use the information in the High-Resolution Timer section of this link and block: QueryPerformanceCounter
Despite the fact that windows is claimed to be not a "real-time" OS, events can be generated at microsecond resolution. The use of a combination of system time (file_time) and the performance counter frequency has been described at other places. However, careful
implementation with taking care about processor affinity and process/thread priorities opens the door to timed events at microsecond resolution.
Since the windows scheduler and the windows timer services do rely on the systems interrupt
mechanism, the microsecond can only be tuned for by polling. Particulary on multicore systems
polling is not so ugly anymore. And the polling only has to last for the shortest possible
interrupt period. The multimedia timer interface allows to put the interrupt period down to
about 1ms, thus one can get near the desired (microsecond resolution) time and the polling will last for 1ms at most.
My implementation of microsecond resolution time services for windows, test code and an extensive description can be found at the
Windows Timestamp Project located at windowstimestamp.com
I was under the impression that QueryPerformanceCounter was actually accessing the counter that feeds the HPET (High Performance Event Timer)---the difference of course being that HPET is a timer which send an interrupt when the counter value matches the desired interval whereas to make a timer "out of" QueryPerformanceCounter you have to write your own loop in software.
The only reason I had assumed the hardware behind the two was the same is because somewhere I had read that QueryPerformanceCounter was accessing a counter on the chipset.
http://www.gamedev.net/reference/programming/features/timing/ claims that QueryPerformanceCounter use chipset timers which apparently have a specified clock rate. However, I can verify that QueryPerformanceFrequency returns wildly different numbers on different machines, and in fact, the number can change slightly from boot to boot.
The numbers returned can sometimes be totally ridiculous---implying ticks in the nanosecond range. Of course when put together it all works; that is, writing timer software using QueryPerformanceCounter/QueryPerformanceFrequency allows you to get proper timing and latency is pretty low.
A software timer using these functions can be pretty good. For example, with an interval of 1 millisecond, over 30 seconds it's easy to nearly 100% of ticks to fall within 10% of the intended interval. With an interval of 100 microseconds, you still get a high success rate (99.7%) but the worst ticks can be way off (200 microseconds).
I'm wondering if the clock behind the HPET is the same. Supposedly HPET should still increase accuracy since it is a hardware timer, but as of yet I don't know how to access it in Windows.
Sounds like Microsoft has made these functions use "whatever best timer there is":
http://www.microsoft.com/whdc/system/sysinternals/mm-timer.mspx
Did you try updating your CPU driver for your AMD multicore system? Did you check whether your motherboard chipset is on the "bad" list? Did you try setting the CPU affinity?
One can also use the RTC-based time functions and/or a skip-detecting heuristic to eliminate trouble with QPC.
This has some hints: CPU clock frequency and thus QueryPerformanceCounter wrong?
Please improve this. It is a community wiki.
In Win32, is there any way to get a unique cpu cycle count or something similar that would be uniform for multiple processes/languages/systems/etc.
I'm creating some log files, but have to produce multiple logfiles because we're hosting the .NET runtime, and I'd like to avoid calling from one to the other to log. As such, I was thinking I'd just produce two files, combine them, and then sort them, to get a coherent timeline involving cross-world calls.
However, GetTickCount does not increase for every call, so that's not reliable. Is there a better number, so that I get the calls in the right order when sorting?
Edit: Thanks to #Greg that put me on the track to QueryPerformanceCounter, which did the trick.
Heres an interesting article! says not to use RDTSC, but to instead use QueryPerformanceCounter.
Conclusion:
Using regular old timeGetTime() to do
timing is not reliable on many
Windows-based operating systems
because the granularity of the system
timer can be as high as 10-15
milliseconds, meaning that
timeGetTime() is only accurate to
10-15 milliseconds. [Note that the
high granularities occur on NT-based
operation systems like Windows NT,
2000, and XP. Windows 95 and 98 tend
to have much better granularity,
around 1-5 ms.]
However, if you call
timeBeginPeriod(1) at the beginning of
your program (and timeEndPeriod(1) at
the end), timeGetTime() will usually
become accurate to 1-2 milliseconds,
and will provide you with extremely
accurate timing information.
Sleep() behaves similarly; the length
of time that Sleep() actually sleeps
for goes hand-in-hand with the
granularity of timeGetTime(), so after
calling timeBeginPeriod(1) once,
Sleep(1) will actually sleep for 1-2
milliseconds,Sleep(2) for 2-3, and so
on (instead of sleeping in increments
as high as 10-15 ms).
For higher precision timing
(sub-millisecond accuracy), you'll
probably want to avoid using the
assembly mnemonic RDTSC because it is
hard to calibrate; instead, use
QueryPerformanceFrequency and
QueryPerformanceCounter, which are
accurate to less than 10 microseconds
(0.00001 seconds).
For simple timing, both timeGetTime
and QueryPerformanceCounter work well,
and QueryPerformanceCounter is
obviously more accurate. However, if
you need to do any kind of "timed
pauses" (such as those necessary for
framerate limiting), you need to be
careful of sitting in a loop calling
QueryPerformanceCounter, waiting for
it to reach a certain value; this will
eat up 100% of your processor.
Instead, consider a hybrid scheme,
where you call Sleep(1) (don't forget
timeBeginPeriod(1) first!) whenever
you need to pass more than 1 ms of
time, and then only enter the
QueryPerformanceCounter 100%-busy loop
to finish off the last < 1/1000th of a
second of the delay you need. This
will give you ultra-accurate delays
(accurate to 10 microseconds), with
very minimal CPU usage. See the code
above.
You can use the RDTSC CPU instruction (assuming x86). This instruction gives the CPU cycle counter, but be aware that it will increase very quickly to its maximum value, and then reset to 0. As the Wikipedia article mentions, you might be better off using the QueryPerformanceCounter function.
System.Diagnostics.Stopwatch.GetTimestamp() return the number of CPU cycle since a time origin (maybe when the computer start, but I'm not sure) and I've never seen it not increased between 2 calls.
The CPU Cycles will be specific for each computer so you can't use it to merge log file between 2 computers.
RDTSC output may depend on the current core's clock frequency, which for modern CPUs is neither constant nor, in a multicore machine, consistent.
Use the system time, and if dealing with feeds from multiple systems use an NTP time source. You can get reliable, consistent time readings that way; if the overhead is too much for your purposes, using the HPET to work out time elapsed since the last known reliable time reading is better than using the HPET alone.
Use the GetTickCount and add another counter as you merge the log files. Won't give you perfect sequence between the different log files, but it will at least keep all logs from each file in the correct order.