MSG::time is later than timeGetTime - windows

After noticing some timing descrepencies with events in my code, I boiled the problem all the way down to my Windows Message Loop.
Basically, unless I'm doing something strange, I'm experiencing this behaviour:-
MSG message;
while (PeekMessage(&message, _applicationWindow.Handle, 0, 0, PM_REMOVE))
{
int timestamp = timeGetTime();
bool strange = message.time > timestamp; //strange == true!!!
TranslateMessage(&message);
DispatchMessage(&message);
}
The only rational conclusion I can draw is that MSG::time uses a different timing mechanism then timeGetTime() and therefore is free to produce differing results.
Is this the case or am i missing something fundemental?

Could this be a signed unsigned issue? You are comparing a signed int (timestamp) to an unsigned DWORD (msg.time).
Also, the clock wraps every 40ish days - when that happens strange could well be true.
As an aside, if you don't have a great reason to use timeGetTime, you can use GetTickCount here - it saves you bringing in winmm.
The code below shows how you should go about using times - you should never compare the times directly, because clock wrapping messes that up. Instead you should always subtract the start time from the current time and look at the interval.
// This is roughly equivalent code, however strange should never be true
// in this code
DWORD timestamp = GetTickCount();
bool strange = (timestamp - msg.time < 0);

I don't think it's advisable to expect or rely on any particular relationship between the absolute values of timestamps returned from different sources. For one thing, the multimedia timer may have a different resolution from the system timer. For another, the multimedia timer runs in a separate thread, so you may encounter synchronisation issues. (I don't know if each CPU maintains its own independent tick count.) Furthermore, if you are running any sort of time synchronisation service, it may be making its own adjustments to your local clock and affecting the timestamps you are seeing.

Are you by any chance running an AMD dual core? There is an issue where since each core has a separate timer and can run at different speeds, the timers can diverge from each other. This can manifest itself in negative ping times, for example.
I had similar issues when measuring timeouts in different threads using GetTickCount().
Install this driver (IIRC) to resolve the issue.

MSG.time is based on GetTickCount(), and timeGetTime() uses the multimedia timer, which is completely independent of GetTickCount(). I would not be surprised to see that one timer has 'ticked' before the other.

Related

glutMainLoop() vs glutTimerFunc()?

I know that glutMainLoop() is used to call display over and over again, maintaining a constant frame rate. At the same time, if I also have glutTimerFunc(), which calls glutPostRedisplay() at the end, so it can maintain a different framerate.
When they are working together, what really happens ? Does the timer function add on to the framerate of main loop and make it faster ? Or does it change the default refresh rate of main loop ? How do they work in conjunction ?
I know that glutMainLoop() is used to call display over and over again, maintaining a constant frame rate.
Nope! That's not what glutMainLoop does. The purpose of glutMainLoop is to pull operating system events, check if timers elapsed, see if windows have to be redrawn and then call into the respective callback functions registered by the user. This happens in a loop and usually this loop is started from the main entry point of the program, hence the name "main - loop".
When they are working together, what really happens ? Does the timer function add on to the framerate of main loop and make it faster ? Or does it change the default refresh rate of main loop ? How do they work in conjunction?
As already told, dispatching timers is part of the responsibility of glutMainLoop, so you can't have GLUT timers without that. More importantly if there happened no events and no re-display was posted and if there's not idle function registerd, glutMainLoop will "block" the program until some interesting happens (i.e. no CPU cycles are being consumed).
Essentially it goes like
void glutMainLoop(void)
{
for(;;){
/* ... */
foreach(t in timers){
if( t.elapsed() ){
t.callback(…);
continue;
}
}
/* ... */
if( display.posted ){
display.callback();
display.posted = false;
continue;
}
idle.callback();
}
}
At the same time, if I also have glutTimerFunc(), which calls glutPostRedisplay() at the end, so it can maintain a different framerate.
The timers provided by GLUT make no guarantees about their precision and jitter. Hence they're not particularly well suited for framerate limiting.
Normally the framerate is limited by v-sync (or it should be), but blocking on v-sync means you can not use that time to do something usefull, because the process is blockd. A better approach is to register an idle function, in which you poll a high resolution timer (on POSIX compliant systems clock_gettime(CLOCK_MONOTONIC, …), on Windows QueryPerformanceCounter) and perform a glutPostRedisplay after one display refresh interval minus the time required for rendering the frame elapsed.
Of course it's hard to predict how long rendering is going to take exactly, so the usual approach is to collect sliding window average and deviation and adjust with that. Also you want to align that timer with v-sync.
This is of course a solved problem (at least in electrical engineering) which can be addressed by a Phase Locked Loop. Essentially you have a "phase comparator" (i.e. something that compares if your timer runs slower or faster than something you want synchronize to), a "charge pump" (a variable you add to or subtract from the delta from the phase comparator), a "loop filter" (sliding window average) and an "oscillator" (a timer) controlled by the loop filtered value in the charge pump.
So you poll the status of the v-sync (not possible with GLUT functions, and not even possible with core OpenGL or even some of the swap control extensions – you'll have to use OS specific functions for that) and compare if your timers lag beind or run fast compared to that. You add that delta to the "charge pump", filter it and feed the result back into the timer. The nice thing about this approach is, that this will automatically adjust to and filter the time spent for rendering frames as well.
From the glutMainLoop doc pages:
glutMainLoop enters the GLUT event processing loop. This routine should be called at most once in a GLUT program. Once called, this routine will never return. It will call as necessary any callbacks that have been registered. (grifos mine)
That means that the idea of glutMainLoop is just processing events, calling anything that is installed. Indeed, I do not believe that it keeps calling display over and over, but only when there is an event that request its redisplay.
This is where glutTimerFunc() comes into the play. It register a timer event callback to be called by glutMainLoop when this event is triggered. Note that this is one of several possible others event callbacks that can be registered. That explains why in doc they use the expression at least.
(...) glutTimerFunc registers the timer callback func to be triggered in at least msecs milliseconds. (...)

determine the time taken for my code to execute in MFC

I need a way to find out the amount of time taken by a function, and a section of my code inside a function, to execute
Does Visual Studio provide any mechanism for doing this, or is it possible to do so from the program using MFC functions? I am new to MFC so I am not sure how this can be done. I thought this should be a pretty straight forward operation but I cannot find any examples on how this may be done either
A quick way, but quite imprecise, is by using GetTickCount():
DWORD time1 = GetTickCount();
// Code to profile
DWORD time2 = GetTickCount();
DWORD timeElapsed = time2-time1;
The problem is GetTickCount() uses the system timer, which has a typical resolution of 10 - 15 ms. So it is only useful with long computations.
It can't tell the difference between a function that takes 2 ms to run and one that takes 9 ms. But if you are in the seconds range, it may well be enough.
If you need more resolution, you can use the performance counter, as RedEye explains.
Or you can try a profiler (maybe this is what you were looking for?). See this question.
There may be better ways, but I do it like this :
// At the start of the function
LARGE_INTEGER lStart;
QueryPerformanceCounter(&lStart);
LARGE_INTEGER lFreq;
QueryPerformanceFrequency(&lFreq);
// At the en of the function
LARGE_INTEGER lEnd;
QueryPerformanceCounter(&lEnd);
TRACE("FunctionName t = %dms\n", (1000*(lEnd.LowPart - lStart.LowPart))/lFreq.LowPart);
I use this method quite a lot for optimising graphics code, finding time taken for screen updates etc. There are other methods of doing the same or similar, but this is quick and simple.

How to check which index in a loop is executing without slow down process?

What is the best way to check which index is executing in a loop without too much slow down the process?
For example I want to find all long fancy numbers and have a loop like
for( long i = 1; i > 0; i++){
//block
}
and I want to learn which i is executing in real time.
Several ways I know to do in the block are printing i every time, or checking if(i % 10000), or adding a listener.
Which one of these ways is the fastest. Or what do you do in similar cases? Is there any way to access the value of the i manually?
Most of my recent experience is with Java, so I'd write something like this
import java.util.concurrent.atomic.AtomicLong;
public class Example {
public static void main(String[] args) {
AtomicLong atomicLong = new AtomicLong(1); // initialize to 1
LoopMonitor lm = new LoopMonitor(atomicLong);
Thread t = new Thread(lm);
t.start(); // start LoopMonitor
while(atomicLong.get() > 0) {
long l = atomicLong.getAndIncrement(); // equivalent to long l = atomicLong++ if atomicLong were a primitive
//block
}
}
private static class LoopMonitor implements Runnable {
private final AtomicLong atomicLong;
public LoopMonitor(AtomicLong atomicLong) {
this.atomicLong = atomicLong;
}
public void run() {
while(true) {
try {
System.out.println(atomicLong.longValue()); // Print l
Thread.sleep(1000); // Sleep for one second
} catch (InterruptedException ex) {}
}
}
}
}
Most AtomicLong implementations can be set in one clock cycle even on 32-bit platforms, which is why I used it here instead of a primitive long (you don't want to inadvertently print a half-set long); look into your compiler / platform details to see if you need something like this, but if you're on a 64-bit platform then you can probably use a primitive long regardless of which language you're using. The modified for loop doesn't take much of an efficiency hit - you've replaced a primitive long with a reference to a long, so all you've added is a pointer dereference.
It won't be easy, but probably the only way to probe the value without affecting the process is to access the loop variable in shared memory with another thread. Threading libraries vary from one system to another, so I can't help much there (on Linux I'd probably use pthreads). The "monitor" thread might do something like probe the value once a minute, sleep()ing in between, and so allowing the first thread to run uninterrupted.
To have a null cost reporting (on multi-cpu computers) : set your index as a "global" property (class-wide for instance), and have a separate thread to read and report the index value.
This report could be timer-based (5 times per seconds or so).
Rq : Maybe you'll need also a boolean stating 'are we in the loop ?'.
Volatile and Caches
If you're going to be doing this in, say, C / C++ and use a separate monitor thread as previously suggested then you'll have to make the global/static loop variable volatile. You don't want the compiler decide deciding to use a register for the loop variable. Some toolchains make that assumption anyway, but there's no harm being explicit about it.
And then there's the small issue of caches. A separate monitor thread nowadays will end up on a separate core, and that'll mean that the two separate cache subsystems will have to agree on what the value is. That will unavoidably have a small impact on the runtime of the loop.
Real real time constraint?
So that begs the question of just how real time is your loop anyway? I doubt that your timing constraint is such that you're depending on it running within a specific number of CPU clock cycles. Two reasons, a) no modern OS will ever come close to guaranteeing that, you'd have to be running on the bare metal, b) most CPUs these days vary their own clock rate behind your back, so you can't count on a specific number of clock cycles corresponding to a specific real time interval.
Feature rich solution
So assuming that your real time requirement is not that constrained, you may wish to do a more capable monitor thread. Have a shared structure protected by a semaphore which your loop occasionally updates, and your monitor thread periodically inspects and reports progress. For best performance the monitor thread would take the semaphore, copy the structure, release the semaphore and then inspect/print the structure, minimising the semaphore locked time.
The only advantage of this approach over that suggested in previous answers is that you could report more than just the loop variable's value. There may be more information from your loop block that you'd like to report too.
Mutex semaphores in, say, C on Linux are pretty fast these days. Unless your loop block is very lightweight the runtime overhead of a single mutex is not likely to be significant, especially if you're updating the shared structure every 1000 loop iterations. A decent OS will put your threads on separate cores, but for the sake of good form you'd make the monitor thread's priority higher than the thread running the loop. This would ensure that the monitoring does actually happen if the two threads do end up on the same core.

Atomic operations in ARM

I've been working on an embedded OS for ARM, However there are a few things i didn't understand about the architecture even after referring to ARMARM and linux source.
Atomic operations.
ARM ARM says that Load and Store instructions are atomic and it's execution is guaranteed to be complete before interrupt handler executes. Verified by looking at
arch/arm/include/asm/atomic.h :
#define atomic_read(v) (*(volatile int *)&(v)->counter)
#define atomic_set(v,i) (((v)->counter) = (i))
However, the problem comes in when i want to manipulate this value atomically using the cpu instructions (atomic_inc, atomic_dec, atomic_cmpxchg etc..) which use LDREX and STREX for ARMv7 (my target).
ARMARM doesn't say anything about interrupts being blocked in this section so i assume an interrupt can occur in between the LDREX and STREX. The thing it does mention is about locking the memory bus which i guess is only helpful for MP systems where there can be more CPUs trying to access same location at same time. But for UP (and possibly MP), If a timer interrupt (or IPI for SMP) fires in this small window of LDREX and STREX, Exception handler executes possibly changes cpu context and returns to the new task, however the shocking part comes in now, it executes 'CLREX' and hence removing any exclusive lock held by previous thread. So how better is using LDREX and STREX than LDR and STR for atomicity on a UP system ?
I did read something about an Exclusive lock monitor, so I've a possible theory that when the thread resumes and executes the STREX, the os monitor causes this call to fail which can be detected and the loop can be re-executed using the new value in the process (branch back to LDREX), Am i right here ?
The idea behind the load-linked/store-exclusive paradigm is that if if the store follows very soon after the load, with no intervening memory operations, and if nothing else has touched the location, the store is likely to succeed, but if something else has touched the location the store is certain to fail. There is no guarantee that stores will not sometimes fail for no apparent reason; if the time between load and store is kept to a minimum, however, and there are no memory accesses between them, a loop like:
do
{
new_value = __LDREXW(dest) + 1;
} while (__STREXW(new_value, dest));
can generally be relied upon to succeed within a few attempts. If computing the new value based on the old value required some significant computation, one should rewrite the loop as:
do
{
old_value = *dest;
new_value = complicated_function(old_value);
} while (CompareAndStore(dest, new_value, old_value) != 0);
... Assuming CompareAndStore is something like:
uint32_t CompareAndStore(uint32_t *dest, uint32_t new_value, uint_32 old_value)
{
do
{
if (__LDREXW(dest) != old_value) return 1; // Failure
} while(__STREXW(new_value, dest);
return 0;
}
This code will have to rerun its main loop if something changes *dest while the new value is being computed, but only the small loop will need to be rerun if the __STREXW fails for some other reason [which is hopefully not too likely, given that there will only be about two instructions between the __LDREXW and __STREXW]
Addendum
An example of a situation where "compute new value based on old" could be complicated would be one where the "values" are effectively a references to a complex data structure. Code may fetch the old reference, derive a new data structure from the old, and then update the reference. This pattern comes up much more often in garbage-collected frameworks than in "bare metal" programming, but there are a variety of ways it can come up even when programming bare metal. Normal malloc/calloc allocators are not generally thread-safe/interrupt-safe, but allocators for fixed-size structures often are. If one has a "pool" of some power-of-two number of data structures (say 255), one could use something like:
#define FOO_POOL_SIZE_SHIFT 8
#define FOO_POOL_SIZE (1 << FOO_POOL_SIZE_SHIFT)
#define FOO_POOL_SIZE_MASK (FOO_POOL_SIZE-1)
void do_update(void)
{
// The foo_pool_alloc() method should return a slot number in the lower bits and
// some sort of counter value in the upper bits so that once some particular
// uint32_t value is returned, that same value will not be returned again unless
// there are at least (UINT_MAX)/(FOO_POOL_SIZE) intervening allocations (to avoid
// the possibility that while one task is performing its update, a second task
// changes the thing to a new one and releases the old one, and a third task gets
// given the newly-freed item and changes the thing to that, such that from the
// point of view of the first task, the thing never changed.)
uint32_t new_thing = foo_pool_alloc();
uint32_t old_thing;
do
{
// Capture old reference
old_thing = foo_current_thing;
// Compute new thing based on old one
update_thing(&foo_pool[new_thing & FOO_POOL_SIZE_MASK],
&foo_pool[old_thing & FOO_POOL_SIZE_MASK);
} while(CompareAndSwap(&foo_current_thing, new_thing, old_thing) != 0);
foo_pool_free(old_thing);
}
If there will not often be multiple threads/interrupts/whatever trying to update the same thing at the same time, this approach should allow updates to be performed safely. If a priority relationship will exist among the things that may try to update the same item, the highest-priority one is guaranteed to succeed on its first attempt, the next-highest-priority one will succeed on any attempt that isn't preempted by the highest-priority one, etc. If one was using locking, the highest-priority task that wanted to perform the update would have to wait for the lower-priority update to finish; using the CompareAndSwap paradigm, the highest-priority task will be unaffected by the lower one (but will cause the lower one to have to do wasted work).
Okay, got the answer from their website.
If a context switch schedules out a process after the process has performed a Load-Exclusive but before it performs the Store-Exclusive, the Store-Exclusive returns a false negative result when the process resumes, and memory is not updated. This does not affect program functionality, because the process can retry the operation immediately.

How can I get a pulse in win32 Assembler (specifically nasm)?

I'm planning on making a clock. An actual clock, not something for Windows. However, I would like to be able to write most of the code now. I'll be using a PIC16F628A to drive the clock, and it has a timer I can access (actually, it has 3, in addition to the clock it has built in). Windows, however, does not appear to have this function. Which makes making a clock a bit hard, since I need to know how long it's been so I can update the current time. So I need to know how I can get a pulse (1Hz, 1KHz, doesn't really matter as long as I know how fast it is) in Windows.
There are many timer objects available in Windows. Probably the easiest to use for your purposes would be the Multimedia Timer, but that's been deprecated. It would still work, but Microsoft recommends using one of the new timer types.
I'd recommend using a threadpool timer if you know your application will be running under Windows Vista, Server 2008, or later. If you have to support Windows XP, use a Timer Queue timer.
There's a lot to those APIs, but general use is pretty simple. I showed how to use them (in C#) in my article Using the Windows Timer Queue API. The code is mostly API calls, so I figure you won't have trouble understanding and converting it.
The LARGE_INTEGER is just an 8-byte block of memory that's split into a high part and a low part. In assembly, you can define it as:
MyLargeInt equ $
MyLargeIntLow dd 0
MyLargeIntHigh dd 0
If you're looking to learn ASM, just do a Google search for [x86 assembly language tutorial]. That'll get you a whole lot of good information.
You could use a waitable timer object. Since Windows is not a real-time OS, you'll need to make sure you set the period long enough that you won't miss pulses. A tenth of a second should be safe most of the time.
Additional:
The const LARGE_INTEGER you need to pass to SetWaitableTimer is easy to implement in NASM, it's just an eight byte constant:
period: dq 100 ; 100ms = ten times a second
Pass the address of period as the second argument to SetWaitableTimer.

Resources