Sorry for my weak english, by preemption I mean forced context
(process) switch applied to my process.
My question is :
If I write and run my own program game in such way that it does 20 millisecond period work, then 5 millisecond sleep, and then windows pump (peek message/dispatch message) in loop again and again - is it ever preempted by force in windows or no, this preemption does not occur?
I suppose that this preemption would occur if I would not voluntary give control back to system by sleep or peek/dispatch in by a larger amount of time. Here, will it occur or not?
The short answer is: Yes, it can be, and it will be preempted.
Not only driver events (interrupts) can preempt your thread at any time, such thing may also happen due to temporary priority boost, for example when a waitable object is signalled on which a thread is blocked, or for example due to another window becoming the topmost window. Or, another process might simply adjust its priority class.
There is no way (short of giving your process realtime priority, and this is a very bad idea -- forget about it immediately) to guarantee that no "normal" thread will preempt you, and even then hardware interrupts will preempt you, and certain threads such as the one handling disk I/O and the mouse will compete with you over time quantums. So, even if you run with realtime priority (which is not truly "realtime"), you still have no guarantee, but you seriously interfere with important system services.
On top of that, Sleeping for 5 milliseconds is unprecise at best, and unreliable otherwise.
Sleeping will make your thread ready (ready does not mean "it will run", it merely means that it may run -- if and only if a time slice becomes available and no other ready thread is first in line) on the next scheduler tick. This effectively means that the amount of time you sleep is rounded to the granularity of the system timer resolution (see timeBeginPeriod function), plus some unknown time.
By default, the timer resolution is 15.6ms, so your 5ms will be 7.8 seconds on the average (assuming the best, uncontended case), but possibly a lot more. If you adjust the system timer resolution to 1ms (which is often the lowest possible, though some systems allow 0.5ms), it's somewhat better, but still not precise or reliable. Plus, making the scheduler run more often burns a considerable amount of CPU cycles in interrupts, and power. Therefore, it is not something that is generally advisable.
To make things even worse, you cannot even rely on Sleep's rounding mode, since Windows 2000/XP round differently from Windows Vista/7/8.
It can be interrupted by a driver at any time. The driver may signal another thread and then ask the OS to schedule/dispatch. The newly-ready thread may well run instead of yours.
These desktop OS, like Windows, do not provide any real-time guarantees - they were not designed to provide it.
Related
I've wondered about this for a very long time.
I understand that GUI programming is event-driven. I understand that most GUI programms will feature an event loop which loops through all events on the message queue. I also understand that it does so by calling some kind of Operating System method like "get_message()", which will block the thread until a message is received. In this sense, when no events are happening, the thread is sleeping peacefully.
My question, however, is: how does the Operating System check for available messages?Somewhere down the stack I assume there must be a loop which is continually checking for new events. Such a loop cannot possibly feature any blocking, because if so, there must be another looping thread which is 'always-awake', ready to wake the first. However, I also appreciate that this cannot be true, because otherwise I would expect to see 100% of at least one processor core in use at all times, checking over and over and over and over....
I have considered that perhaps the checking thread has a small sleep between each iteration. This would certainly explain why an idle system isn't using 100% CPU. But then I recalled how events are usually responded to immediately. Take a mouse movement for example: the cursor is being constantly redrawn, in sync with the physical movements.
Is there something fundamental, perhaps, in CPU architectures that allows threads to be woken at the hardware level, when certain memory addresses change value?
I'm otherwise out of ideas! Could anyone please help to explain what's really happening?
Yes there is: hardware interrupts.
When a key is pressed or the mouse is moved, or a network packet arrives, or data is read from some other device, or a timer elapses, the OS receives a hardware interrupt.
Threads or applications wanting to do I/O have to call a function in the OS, which returns the requested data, or, suspends the calling thread if the data is not available yet. This suspension simply means the thread is not considered for scheduling, until some condition changes - in this case, the requested data must be available. Such threads are said to be 'IO blocked'.
When the OS receives an interrupt indicating some device has some data, it looks through it's list of suspended threads to see if there is one that is suspended because it is waiting for that data, and then removes the suspension,
making it eligible for scheduling again.
In this interrupt-driven way, no CPU time is wasted 'polling' for data.
Recently I was doing some deep timing checks on a DirectShow application I have in Delphi 6, using the DSPACK components. As part of my diagnostics, I created a Critical Section class that adds a time-out feature to the usual Critical Section object found in most Windows programming languages. If the time duration between the first Acquire() and the last matching Release() is more than X milliseconds, an Exception is thrown.
Initially I set the time-out at 10 milliseconds. The code I have wrapped in Critical Sections is pretty fast using mostly memory moves and fills for most of the operations contained in the protected areas. Much to my surprise I got fairly frequent time-outs in seemingly random parts of the code. Sometimes it happened in a code block that iterates a buffer list and does certain quick operations in sequence, other times in tiny sections of protected code that only did a clearing of a flag between the Acquire() and Release() calls. The only pattern I noticed is that the durations found when the time-out occurred were centered on a median value of about 16 milliseconds. Obviously that's a huge amount of time for a flag to be set in the latter example of an occurrence I mentioned above.
So my questions are:
1) Is it possible for Windows thread management code to, on a fairly frequent basis (about once every few seconds), to switch out an unblocked thread and not return to it for 16 milliseconds or longer?
2) If that is a reasonable scenario, what steps can I take to lessen that occurrence and should I consider elevating my thread priorities?
3) If it is not a reasonable scenario, what else should I look at or try as an analysis technique to diagnose the real problem?
Note: I am running on Windows XP on an Intel i5 Quad Core with 3 GB of memory. Also, the reason why I need to be fast in this code is due to the size of the buffer in milliseconds I have chosen in my DirectShow filter graphs. To keep latency at a minimum audio buffers in my graph are delivered every 50 milliseconds. Therefore, any operation that takes a significant percentage of that time duration is troubling.
Thread priorities determine when ready threads are run. There's, however, a starvation prevention mechanism. There's a so-called Balance Set Manager that wakes up every second and looks for ready threads that haven't been run for about 3 or 4 seconds, and if there's one, it'll boost its priority to 15 and give it a double the normal quantum. It does this for not more than 10 threads at a time (per second) and scans not more than 16 threads at each priority level at a time. At the end of the quantum, the boosted priority drops to its base value. You can find out more in the Windows Internals book(s).
So, it's a pretty normal behavior what you observe, threads may be not run for seconds.
You may need to elevate priorities or otherwise consider other threads that are competing for the CPU time.
sounds like normal windows behaviour with respect to timer resolution unless you explicitly go for some of the high precision timers. Some details in this msdn link
First of all, I am not sure if Delphi's Now is a good choice for millisecond precision measurements. GetTickCount and QueryPerformanceCoutner API would be a better choice.
When there is no collision in critical section locking, everything runs pretty fast, however if you are trying to enter critical section which is currently locked on another thread, eventually you hit a wait operation on an internal kernel object (mutex or event), which involves yielding control on the thread and waiting for scheduler to give control back later.
The "later" above would depend on a few things, including priorities mentioned above, and there is one important things you omitted in your test - what is the overall CPU load at the time of your testing. The more is the load, the less chances to get the thread continue execution soon. 16 ms time looks perhaps a bit still within reasonable tolerance, and all in all it might depends on your actual implementation.
Is there anyway at all in the windows environment to sleep for ~1 microsecond? After researching and reading many threads and various websites, I have not been able to see that this is possible. Since the scheduler appears to be the limiting factor and it operates
at the 1 millisecond level, then I believe it can't be done without going to a real time OS.
It may not be the most portable, and I've not used these functions myself, but it might be possible to use the information in the High-Resolution Timer section of this link and block: QueryPerformanceCounter
Despite the fact that windows is claimed to be not a "real-time" OS, events can be generated at microsecond resolution. The use of a combination of system time (file_time) and the performance counter frequency has been described at other places. However, careful
implementation with taking care about processor affinity and process/thread priorities opens the door to timed events at microsecond resolution.
Since the windows scheduler and the windows timer services do rely on the systems interrupt
mechanism, the microsecond can only be tuned for by polling. Particulary on multicore systems
polling is not so ugly anymore. And the polling only has to last for the shortest possible
interrupt period. The multimedia timer interface allows to put the interrupt period down to
about 1ms, thus one can get near the desired (microsecond resolution) time and the polling will last for 1ms at most.
My implementation of microsecond resolution time services for windows, test code and an extensive description can be found at the
Windows Timestamp Project located at windowstimestamp.com
A simple game loop runs somewhat like this:
MSG msg;
while (running){
if (PeekMessage(&msg, hWnd, 0, 0, PM_REMOVE)){
TranslateMessage(&msg);
DispatchMessage(&msg);
}
else
try{
onIdle();
}
catch(std::exception& e){
onError(e.what());
close();
}
}
(taken from this question)
I'm using windows as an example here for the sake of the example, it could be any platform. Correct me if I'm mistaken, but such a loop would use 100% of a cpu/core (uses 50% of one core on my computer) since it's always checking for the state of the running variable.
My question is, would it be better (performance-wise) to implement the game loop using the OS' (in the example Windows) timer functions, setting the required interval according to the desired number of ticks of game logic wanted per second? I ask this because I assume the timer functions use the CPU's RTC interrupts.
Typically, a game will keep drawing new frames all the time even when the user does not do anything. This would normally happen where you have your onIdle() call. If your game only updates the window/screen when the user presses a button or such, or sporadically in between, then MSGWaitForMultipleObjects is a good option.
However, in a continuous-animation game, you would normally not want to block or sleep in the render thread at all if you can help it, instead you want to render at maximum speed and rely on vertical sync as a throttle. The reason for that is that timing, blocking, and sleeping is unprecise at best, and unreliable at worst, and it will quite possibly add disturbing artefacts to your animations.
What you normally want to do is push everything belonging to a frame to the graphics API (most likely OpenGL since you said "any platform") as fast as you can, signal a worker thread to start doing the game logic and physics etc, and then block on SwapBuffers.
All timers, timeouts, and sleep are limited by the scheduler's resolution, which is 15ms under Windows (can be set to 1ms using timeBeginPeriod). At 60fps, a frame is 16.666ms, so blocking for 15ms is catastrophic, but even 1ms is still a considerable time. There is nothing you can do to get a better resolution (this is considerably better under Linux).
Sleep, or any blocking function that has a timeout, guarantees that your process sleeps for at least as long as you asked for (in fact, on Posix systems, it may sleep less if an interrupt occurred). It does not give you a guarantee that your thread will run as soon as the time is up, or any other guarantees.
Sleep(0) under Windows is even worse. The documentation says "the thread will relinquish the remainder of its time slice but remain ready. Note that a ready thread is not guaranteed to run immediately". Reality has it that it works kind of ok most of the time, blocking anywhere from "not at all" to 5-10ms, but on occasions I've seen Sleep(0) block for 100ms too, which is a desaster when it happens.
Using QueryPerformanceCounter is asking for trouble -- just don't do it. On some systems (Windows 7 with a recent CPU) it will work just fine because it uses a reliable HPET, but on many other systems you will see all kinds of strange "time travel" or "reverse time" effects which nobody can explain or debug. That is because on those systems, the result of QPC reads the TSC counter which depends on the CPU's frequency. CPU frequency is not constant, and TSC values on multicore/multiprocessor systems need not be consistent.
Then there is the synchronization issue. Two separate timers, even when ticking at the same frequency, will necessarily become unsynchronized (because there is no such thing as "same"). There is a timer in your graphics card doing the vertical sync already. Using this one for synchronisation will mean everything will work, using a different one will mean things will eventually be out of sync. Being out of sync may have no effect at all, may draw a half finished frame, or may block for one full frame, or whatever.
While missing a frame is usually not that much of a big deal (if it's just one and happens rarely), in this case it is something you can totally avoid in the first place.
Depends on who's performance you want to optimize for. If you want the fastest frame rate for your game, then use the loop you posted. If your game is full screen, this is likely the right thing.
If you want to limit the frame rate and allow the desktop and other apps to get some cyles, then a timer approach would suffice. In Windows, you would simply use a hidden window, a SetTimer call, and change PeekMessage to GetMessage. Just remember that WM_TIMER intervals are not precise. When you need to figure out how much real time has elapsed since the last frame, call GetTickCount instead of assuming that you woke up exactly on the time interval.
A hybrid approach that works well with respect to games and desktop apps is to use the loop you posted above, but insert a Sleep(0) after your OnIdle function returns.
MSGWaitForMultipleObjects should be used in a windows game message loop as you can get it to automatically stop waiting for messages according to a definable timeout. This can throttle your OnIdle calls a lot while ensuring that the window responds to messages quickly.
If your window is not full screen then a SetTimer calling a timer proc or posting WM_TIMER messages to your game's WindowProc is required. Unless you don't mind the animation stalling whenever the user, for example, clicks on your window's title bar. Or you pop up a modal dialog of some kind.
On Windows there is no way to access actual timer interrupts. That stuff is buried away under many layers of hardware abstraction. timer interrupts are handled by drivers to signal kernel objects, that user mode code can use in calls to WaitForMultipleObjects. This means that the kernel will wake up your thread to handle a kernel timer associated with your thread when you called SetTimer(), at which point GetMessage/PeekMessage will, if the message queue is empty, synthesize a WM_TIMER message.
Never use GetTickCount or WM_TIMER for timing in a game, they have horrible resolution. Instead, use QueryPerformanceCounter.
I have seen a question on why "polling is bad". In terms of minimizing the amount of processor time used by one thread, would it be better to do a spin wait (i.e. poll for a required change in a while loop) or wait on a kernel object (e.g. a kernel event object in windows)?
For context, assume that the code would be required to run on any type of processor, single core, hyperthreaded, multicore, etc. Also assume that a thread that would poll or wait can't continue until the polling result is satisfactory if it polled instead of waiting. Finally, the time between when a thread starts waiting (or polling) and when the condition is satisfied can potentially vary from a very short time to a long time.
Since the OS is likely to more efficiently "poll" in the case of "waiting", I don't want to see the "waiting just means someone else does the polling" argument, that's old news, and is not necessarily 100% accurate.
Provided the OS has reasonable implementations of these type of concurrency primitives, it's definitely better to wait on a kernel object.
Among other reasons, this lets the OS know not to schedule the thread in question for additional timeslices until the object being waited-for is in the appropriate state. Otherwise, you have a thread which is constantly getting rescheduled, context-switched-to, and then running for a time.
You specifically asked about minimizing the processor time for a thread: in this example the thread blocking on a kernel object would use ZERO time; the polling thread would use all sorts of time.
Furthermore, the "someone else is polling" argument needn't be true. When a kernel object enters the appropriate state, the kernel can look to see at that instant which threads are waiting for that object...and then schedule one or more of them for execution. There's no need for the kernel (or anybody else) to poll anything in this case.
Waiting is the "nicer" way to behave. When you are waiting on a kernel object your thread won't be granted any CPU time as it is known by the scheduler that there is no work ready. Your thread is only going to be given CPU time when it's wait condition is satisfied. Which means you won't be hogging CPU resources needlessly.
I think a point that hasn't been raised yet is that if your OS has a lot of work to do, blocking yeilds your thread to another process. If all processes use the blocking primitives where they should (such as kernel waits, file/network IO etc.) you're giving the kernel more information to choose which threads should run. As such, it will do more work in the same amount of time. If your application could be doing something useful while waiting for that file to open or the packet to arrive then yeilding will even help you're own app.
Waiting does involve more resources and means an additional context switch. Indeed, some synchronization primitives like CLR Monitors and Win32 critical sections use a two-phase locking protocol - some spin waiting is done fore actually doing a true wait.
I imagine doing the two-phase thing would be very difficult, and would involve lots of testing and research. So, unless you have the time and resources, stick to the windows primitives...they already did the research for you.
There are only few places, usually within the OS low-level things (interrupt handlers/device drivers) where spin-waiting makes sense/is required. General purpose applications are always better off waiting on some synchronization primitives like mutexes/conditional variables/semaphores.
I agree with Darksquid, if your OS has decent concurrency primitives then you shouldn't need to poll. polling usually comes into it's own on realtime systems or restricted hardware that doesn't have an OS, then you need to poll, because you might not have the option to wait(), but also because it gives you finegrain control over exactly how long you want to wait in a particular state, as opposed to being at the mercy of the scheduler.
Waiting (blocking) is almost always the best choice ("best" in the sense of making efficient use of processing resources and minimizing the impact to other code running on the same system). The main exceptions are:
When the expected polling duration is small (similar in magnitude to the cost of the blocking syscall).
Mostly in embedded systems, when the CPU is dedicated to performing a specific task and there is no benefit to having the CPU idle (e.g. some software routers built in the late '90s used this approach.)
Polling is generally not used within OS kernels to implement blocking system calls - instead, events (interrupts, timers, actions on mutexes) result in a blocked process or thread being made runnable.
There are four basic approaches one might take:
Use some OS waiting primitive to wait until the event occurs
Use some OS timer primitive to check at some defined rate whether the event has occurred yet
Repeatedly check whether the event has occurred, but use an OS primitive to yield a time slice for an arbitrary and unknown duration any time it hasn't.
Repeatedly check whether the event has occurred, without yielding the CPU if it hasn't.
When #1 is practical, it is often the best approach unless delaying one's response to the event might be beneficial. For example, if one is expecting to receive a large amount of serial port data over the course of several seconds, and if processing data 100ms after it is sent will be just as good as processing it instantly, periodic polling using one of the latter two approaches might be better than setting up a "data received" event.
Approach #3 is rather crude, but may in many cases be a good one. It will often waste more CPU time and resources than would approach #1, but it will in many cases be simpler to implement and the resource waste will in many cases be small enough not to matter.
Approach #2 is often more complicated than #3, but has the advantage of being able to handle many resources with a single timer and no dedicated thread.
Approach #4 is sometimes necessary in embedded systems, but is generally very bad unless one is directly polling hardware and the won't have anything useful to do until the event in question occurs. In many circumstances, it won't be possible for the condition being waited upon to occur until the thread waiting for it yields the CPU. Yielding the CPU as in approach #3 will in fact allow the waiting thread to see the event sooner than would hogging it.