Handling the game loop with OS timers - windows

A simple game loop runs somewhat like this:
MSG msg;
while (running){
if (PeekMessage(&msg, hWnd, 0, 0, PM_REMOVE)){
TranslateMessage(&msg);
DispatchMessage(&msg);
}
else
try{
onIdle();
}
catch(std::exception& e){
onError(e.what());
close();
}
}
(taken from this question)
I'm using windows as an example here for the sake of the example, it could be any platform. Correct me if I'm mistaken, but such a loop would use 100% of a cpu/core (uses 50% of one core on my computer) since it's always checking for the state of the running variable.
My question is, would it be better (performance-wise) to implement the game loop using the OS' (in the example Windows) timer functions, setting the required interval according to the desired number of ticks of game logic wanted per second? I ask this because I assume the timer functions use the CPU's RTC interrupts.

Typically, a game will keep drawing new frames all the time even when the user does not do anything. This would normally happen where you have your onIdle() call. If your game only updates the window/screen when the user presses a button or such, or sporadically in between, then MSGWaitForMultipleObjects is a good option.
However, in a continuous-animation game, you would normally not want to block or sleep in the render thread at all if you can help it, instead you want to render at maximum speed and rely on vertical sync as a throttle. The reason for that is that timing, blocking, and sleeping is unprecise at best, and unreliable at worst, and it will quite possibly add disturbing artefacts to your animations.
What you normally want to do is push everything belonging to a frame to the graphics API (most likely OpenGL since you said "any platform") as fast as you can, signal a worker thread to start doing the game logic and physics etc, and then block on SwapBuffers.
All timers, timeouts, and sleep are limited by the scheduler's resolution, which is 15ms under Windows (can be set to 1ms using timeBeginPeriod). At 60fps, a frame is 16.666ms, so blocking for 15ms is catastrophic, but even 1ms is still a considerable time. There is nothing you can do to get a better resolution (this is considerably better under Linux).
Sleep, or any blocking function that has a timeout, guarantees that your process sleeps for at least as long as you asked for (in fact, on Posix systems, it may sleep less if an interrupt occurred). It does not give you a guarantee that your thread will run as soon as the time is up, or any other guarantees.
Sleep(0) under Windows is even worse. The documentation says "the thread will relinquish the remainder of its time slice but remain ready. Note that a ready thread is not guaranteed to run immediately". Reality has it that it works kind of ok most of the time, blocking anywhere from "not at all" to 5-10ms, but on occasions I've seen Sleep(0) block for 100ms too, which is a desaster when it happens.
Using QueryPerformanceCounter is asking for trouble -- just don't do it. On some systems (Windows 7 with a recent CPU) it will work just fine because it uses a reliable HPET, but on many other systems you will see all kinds of strange "time travel" or "reverse time" effects which nobody can explain or debug. That is because on those systems, the result of QPC reads the TSC counter which depends on the CPU's frequency. CPU frequency is not constant, and TSC values on multicore/multiprocessor systems need not be consistent.
Then there is the synchronization issue. Two separate timers, even when ticking at the same frequency, will necessarily become unsynchronized (because there is no such thing as "same"). There is a timer in your graphics card doing the vertical sync already. Using this one for synchronisation will mean everything will work, using a different one will mean things will eventually be out of sync. Being out of sync may have no effect at all, may draw a half finished frame, or may block for one full frame, or whatever.
While missing a frame is usually not that much of a big deal (if it's just one and happens rarely), in this case it is something you can totally avoid in the first place.

Depends on who's performance you want to optimize for. If you want the fastest frame rate for your game, then use the loop you posted. If your game is full screen, this is likely the right thing.
If you want to limit the frame rate and allow the desktop and other apps to get some cyles, then a timer approach would suffice. In Windows, you would simply use a hidden window, a SetTimer call, and change PeekMessage to GetMessage. Just remember that WM_TIMER intervals are not precise. When you need to figure out how much real time has elapsed since the last frame, call GetTickCount instead of assuming that you woke up exactly on the time interval.
A hybrid approach that works well with respect to games and desktop apps is to use the loop you posted above, but insert a Sleep(0) after your OnIdle function returns.

MSGWaitForMultipleObjects should be used in a windows game message loop as you can get it to automatically stop waiting for messages according to a definable timeout. This can throttle your OnIdle calls a lot while ensuring that the window responds to messages quickly.
If your window is not full screen then a SetTimer calling a timer proc or posting WM_TIMER messages to your game's WindowProc is required. Unless you don't mind the animation stalling whenever the user, for example, clicks on your window's title bar. Or you pop up a modal dialog of some kind.
On Windows there is no way to access actual timer interrupts. That stuff is buried away under many layers of hardware abstraction. timer interrupts are handled by drivers to signal kernel objects, that user mode code can use in calls to WaitForMultipleObjects. This means that the kernel will wake up your thread to handle a kernel timer associated with your thread when you called SetTimer(), at which point GetMessage/PeekMessage will, if the message queue is empty, synthesize a WM_TIMER message.

Never use GetTickCount or WM_TIMER for timing in a game, they have horrible resolution. Instead, use QueryPerformanceCounter.

Related

How are sleeping threads woken, at the lowest level?

I've wondered about this for a very long time.
I understand that GUI programming is event-driven. I understand that most GUI programms will feature an event loop which loops through all events on the message queue. I also understand that it does so by calling some kind of Operating System method like "get_message()", which will block the thread until a message is received. In this sense, when no events are happening, the thread is sleeping peacefully.
My question, however, is: how does the Operating System check for available messages?Somewhere down the stack I assume there must be a loop which is continually checking for new events. Such a loop cannot possibly feature any blocking, because if so, there must be another looping thread which is 'always-awake', ready to wake the first. However, I also appreciate that this cannot be true, because otherwise I would expect to see 100% of at least one processor core in use at all times, checking over and over and over and over....
I have considered that perhaps the checking thread has a small sleep between each iteration. This would certainly explain why an idle system isn't using 100% CPU. But then I recalled how events are usually responded to immediately. Take a mouse movement for example: the cursor is being constantly redrawn, in sync with the physical movements.
Is there something fundamental, perhaps, in CPU architectures that allows threads to be woken at the hardware level, when certain memory addresses change value?
I'm otherwise out of ideas! Could anyone please help to explain what's really happening?
Yes there is: hardware interrupts.
When a key is pressed or the mouse is moved, or a network packet arrives, or data is read from some other device, or a timer elapses, the OS receives a hardware interrupt.
Threads or applications wanting to do I/O have to call a function in the OS, which returns the requested data, or, suspends the calling thread if the data is not available yet. This suspension simply means the thread is not considered for scheduling, until some condition changes - in this case, the requested data must be available. Such threads are said to be 'IO blocked'.
When the OS receives an interrupt indicating some device has some data, it looks through it's list of suspended threads to see if there is one that is suspended because it is waiting for that data, and then removes the suspension,
making it eligible for scheduling again.
In this interrupt-driven way, no CPU time is wasted 'polling' for data.

forced preemption on windows (occurs or not here)

Sorry for my weak english, by preemption I mean forced context
(process) switch applied to my process.
My question is :
If I write and run my own program game in such way that it does 20 millisecond period work, then 5 millisecond sleep, and then windows pump (peek message/dispatch message) in loop again and again - is it ever preempted by force in windows or no, this preemption does not occur?
I suppose that this preemption would occur if I would not voluntary give control back to system by sleep or peek/dispatch in by a larger amount of time. Here, will it occur or not?
The short answer is: Yes, it can be, and it will be preempted.
Not only driver events (interrupts) can preempt your thread at any time, such thing may also happen due to temporary priority boost, for example when a waitable object is signalled on which a thread is blocked, or for example due to another window becoming the topmost window. Or, another process might simply adjust its priority class.
There is no way (short of giving your process realtime priority, and this is a very bad idea -- forget about it immediately) to guarantee that no "normal" thread will preempt you, and even then hardware interrupts will preempt you, and certain threads such as the one handling disk I/O and the mouse will compete with you over time quantums. So, even if you run with realtime priority (which is not truly "realtime"), you still have no guarantee, but you seriously interfere with important system services.
On top of that, Sleeping for 5 milliseconds is unprecise at best, and unreliable otherwise.
Sleeping will make your thread ready (ready does not mean "it will run", it merely means that it may run -- if and only if a time slice becomes available and no other ready thread is first in line) on the next scheduler tick. This effectively means that the amount of time you sleep is rounded to the granularity of the system timer resolution (see timeBeginPeriod function), plus some unknown time.
By default, the timer resolution is 15.6ms, so your 5ms will be 7.8 seconds on the average (assuming the best, uncontended case), but possibly a lot more. If you adjust the system timer resolution to 1ms (which is often the lowest possible, though some systems allow 0.5ms), it's somewhat better, but still not precise or reliable. Plus, making the scheduler run more often burns a considerable amount of CPU cycles in interrupts, and power. Therefore, it is not something that is generally advisable.
To make things even worse, you cannot even rely on Sleep's rounding mode, since Windows 2000/XP round differently from Windows Vista/7/8.
It can be interrupted by a driver at any time. The driver may signal another thread and then ask the OS to schedule/dispatch. The newly-ready thread may well run instead of yours.
These desktop OS, like Windows, do not provide any real-time guarantees - they were not designed to provide it.

Win32 game loop that doesn't spike the CPU

There are plenty of examples in Windows of applications triggering code at fairly high and stable framerates without spiking the CPU.
WPF/Silverlight/WinRT applications can do this, for example. So can browsers and media players. How exactly do they do this, and what API calls would I make to achieve the same effect from a Win32 application?
Clock polling doesn't work, of course, because that spikes the CPU. Neither does Sleep(), because you only get around 50ms granularity at best.
They are using multimedia timers. You can find information on MSDN here
Only the view is invalidated (f.e. with InvalidateRect)on each multimedia timer event. Drawing happens in the WM_PAINT / OnPaint handler.
Actually, there's nothing wrong with sleep.
You can use a combination of QueryPerformanceCounter/QueryPerformanceFrequency to obtain very accurate timings and on average you can create a loop which ticks forward on average exactly when it's supposed to.
I have never seen a sleep to miss it's deadline by as much as 50 ms however, I've seen plenty of naive timers that drift. i.e. accumalte a small delay and conincedentally updates noticable irregular intervals. This is what causes uneven framerates.
If you play a very short beep on every n:th frame, this is very audiable.
Also, logic and rendering can be run independently of each other. The CPU might not appear to be that busy, but I bet you the GPU is hard at work.
Now, about not hogging the CPU. CPU usage is just a break down of CPU time spent by a process under a given sample (the thread schedulerer actually tracks this). If you have a target of 30 Hz for your game. You're limited to 33ms per frame, otherwise you'll be lagging behind (too slow CPU or too slow code), if you can't hit this target you won't be running at 30 Hz and if you hit it under 33ms then you can yield processor time, effectivly freeing up resources.
This might be an intresting read for you as well.
On a side note, instead of yielding time you could effecivly be doing prepwork for future computations. Some games when they are not under the heaviest of loads actually do things as sorting and memory defragmentation, a little bit here and there, adds up in the end.

How can I tell Windows XP/7 not to switch threads during a certain segment of my code?

I want to prevent a thread switch by Windows XP/7 in a time critical part of my code that runs in a background thread. I'm pretty sure I can't create a situation where I can guarantee that won't happen, because of higher priority interrupts from system drivers, etc. However, I'd like to decrease the probability of a thread switch during that part of my code to the minimum that I can. Are there any create-thread flags or Window API calls that can assist me? General technique tips are appreciated too. If there is a way to get this done without having to raise the threads priority to real-time-critical that would be great, since I worry about creating system performance issues for the user if I do that.
UPDATE: I am adding this update after seeing the first responses to my original post. The concrete application that motivated the question has to do with real-time audio streaming. I want to eliminate every bit of delay I can. I found after coding up my original design that a thread switch can cause a 70ms or more delay at times. Since my app is between two sockets acting as a middleman for delivering audio, the instant I receive an audio buffer I want to immediately turn around and push it out the the destination socket. My original design used two cooperating threads and a semaphore since the there was one thread managing the source socket, and another thread for the destination socket. This architecture evolved from the fact the two devices behind the sockets are disparate entities.
I realized that if I combined the two sockets onto the same thread I could write a code block that reacted immediately to the socket-data-received message and turned it around to the destination socket in one shot. Now if I can do my best to avoid an intervening thread switch, that would be the optimal coding architecture for minimizing delay. To repeat, I know I can't guarantee this situation, but I am looking for tips/suggestions on how to write a block of code that does this and minimizes as best as I can the chance of an intervening thread switch.
Note, I am aware that O/S code behind the sockets introduces (potential) delays of its own.
AFAIK there are no such flags in CreateThread or etc (This also doesn't make sense IMHO). You may snooze other threads in your process from execution during in critical situations (by enumerating them and using SuspendThread), as well as you theoretically may enumerate & suspend threads in other processes.
OTOH snoozing threads is generally not a good idea, eventually you may call some 3rd-party code that would implicitly wait for something that should be accomplished in another threads, which you suspended.
IMHO - you should use what's suggested for the case - playing with thread/process priorities (also you may consider SetThreadPriorityBoost). Also the OS tends to raise the priority to threads that usually don't use CPU aggressively. That is, threads that work often but for short durations (before calling one of the waiting functions that suspend them until some condition) are considered to behave "nicely", and they get prioritized.

Is it better to poll or wait?

I have seen a question on why "polling is bad". In terms of minimizing the amount of processor time used by one thread, would it be better to do a spin wait (i.e. poll for a required change in a while loop) or wait on a kernel object (e.g. a kernel event object in windows)?
For context, assume that the code would be required to run on any type of processor, single core, hyperthreaded, multicore, etc. Also assume that a thread that would poll or wait can't continue until the polling result is satisfactory if it polled instead of waiting. Finally, the time between when a thread starts waiting (or polling) and when the condition is satisfied can potentially vary from a very short time to a long time.
Since the OS is likely to more efficiently "poll" in the case of "waiting", I don't want to see the "waiting just means someone else does the polling" argument, that's old news, and is not necessarily 100% accurate.
Provided the OS has reasonable implementations of these type of concurrency primitives, it's definitely better to wait on a kernel object.
Among other reasons, this lets the OS know not to schedule the thread in question for additional timeslices until the object being waited-for is in the appropriate state. Otherwise, you have a thread which is constantly getting rescheduled, context-switched-to, and then running for a time.
You specifically asked about minimizing the processor time for a thread: in this example the thread blocking on a kernel object would use ZERO time; the polling thread would use all sorts of time.
Furthermore, the "someone else is polling" argument needn't be true. When a kernel object enters the appropriate state, the kernel can look to see at that instant which threads are waiting for that object...and then schedule one or more of them for execution. There's no need for the kernel (or anybody else) to poll anything in this case.
Waiting is the "nicer" way to behave. When you are waiting on a kernel object your thread won't be granted any CPU time as it is known by the scheduler that there is no work ready. Your thread is only going to be given CPU time when it's wait condition is satisfied. Which means you won't be hogging CPU resources needlessly.
I think a point that hasn't been raised yet is that if your OS has a lot of work to do, blocking yeilds your thread to another process. If all processes use the blocking primitives where they should (such as kernel waits, file/network IO etc.) you're giving the kernel more information to choose which threads should run. As such, it will do more work in the same amount of time. If your application could be doing something useful while waiting for that file to open or the packet to arrive then yeilding will even help you're own app.
Waiting does involve more resources and means an additional context switch. Indeed, some synchronization primitives like CLR Monitors and Win32 critical sections use a two-phase locking protocol - some spin waiting is done fore actually doing a true wait.
I imagine doing the two-phase thing would be very difficult, and would involve lots of testing and research. So, unless you have the time and resources, stick to the windows primitives...they already did the research for you.
There are only few places, usually within the OS low-level things (interrupt handlers/device drivers) where spin-waiting makes sense/is required. General purpose applications are always better off waiting on some synchronization primitives like mutexes/conditional variables/semaphores.
I agree with Darksquid, if your OS has decent concurrency primitives then you shouldn't need to poll. polling usually comes into it's own on realtime systems or restricted hardware that doesn't have an OS, then you need to poll, because you might not have the option to wait(), but also because it gives you finegrain control over exactly how long you want to wait in a particular state, as opposed to being at the mercy of the scheduler.
Waiting (blocking) is almost always the best choice ("best" in the sense of making efficient use of processing resources and minimizing the impact to other code running on the same system). The main exceptions are:
When the expected polling duration is small (similar in magnitude to the cost of the blocking syscall).
Mostly in embedded systems, when the CPU is dedicated to performing a specific task and there is no benefit to having the CPU idle (e.g. some software routers built in the late '90s used this approach.)
Polling is generally not used within OS kernels to implement blocking system calls - instead, events (interrupts, timers, actions on mutexes) result in a blocked process or thread being made runnable.
There are four basic approaches one might take:
Use some OS waiting primitive to wait until the event occurs
Use some OS timer primitive to check at some defined rate whether the event has occurred yet
Repeatedly check whether the event has occurred, but use an OS primitive to yield a time slice for an arbitrary and unknown duration any time it hasn't.
Repeatedly check whether the event has occurred, without yielding the CPU if it hasn't.
When #1 is practical, it is often the best approach unless delaying one's response to the event might be beneficial. For example, if one is expecting to receive a large amount of serial port data over the course of several seconds, and if processing data 100ms after it is sent will be just as good as processing it instantly, periodic polling using one of the latter two approaches might be better than setting up a "data received" event.
Approach #3 is rather crude, but may in many cases be a good one. It will often waste more CPU time and resources than would approach #1, but it will in many cases be simpler to implement and the resource waste will in many cases be small enough not to matter.
Approach #2 is often more complicated than #3, but has the advantage of being able to handle many resources with a single timer and no dedicated thread.
Approach #4 is sometimes necessary in embedded systems, but is generally very bad unless one is directly polling hardware and the won't have anything useful to do until the event in question occurs. In many circumstances, it won't be possible for the condition being waited upon to occur until the thread waiting for it yields the CPU. Yielding the CPU as in approach #3 will in fact allow the waiting thread to see the event sooner than would hogging it.

Resources