Faster Than 1ms But Slower Than a Loop - performance

I need to execute some lines of code multiple times (say, 300 times) in rapid succession, each time incrementing some variable and then using it do to a task (let's assume it's a task that requires negligible time to complete).
I tried doing it with a timer set to 1 ms, but it runs too slowly. I then tried doing it with a While loop, but that was much too fast. I could use Threading.Sleep but I really hate using that, not to mention it can only sleep as short as 1 ms anyways. I also thought of using Environment.TickCount but I believe that counts in milliseconds as well.
While this program isn't important to me, it got me wondering if such a thing was possible. A loop that could run with "faster than 1 ms intervals," but slower than "as fast as the program can execute it."

One thing that comes to my mind is calculating the waiting for every iteration yourself with a high precision clock like javas System.nanoTime().
Yet, the call itself is relatively costly and will not let you do nano second precision waiting. But for waiting shorter than 1ms and longer than e.g. 1ns this might help.

Related

Why does the execution time of Goroutines differ significantly?

I'm just measuring the execution time of a set of goroutines. That means:
I start measuring, then start 20 goroutines and stop measuring as soon as they finish. I repeat that process like 4 times and then compare the 4 exection times.
Sometimes, these execution times differ significantly:
1st run of the 20 goroutines: 1.2 ms
2nd run of the 20 goroutines: 1.9 ms
3rd run of the 20 goroutines: 1.4 ms
4th run of the 20 goroutines: 17.0 ms!
Why does it sometimes differ so significantly? Is there any way to avoid it?
Why does it sometimes differ so significantly?
Execution time will always be unpredictable to some point, as mentioned in comments to your questions (CPU, disk load, memory, etc.)
Is there any way to avoid it?
There is a way to make your measurements more useful. Go has a built-in benchmark tool (here is a guide on how to use it properly). This tool runs your code just enough times to determine a somewhat deterministic execution time.
In addition to showing average execution time for your code, it can also show useful memory information.

Avoiding cPickle in Ipython's parallel

I have some code that I have paralleled successfully in the sense that it gets an answer, but it is still kind of slow. Using cProfile.run(), I found that 121 seconds (57% of total time) were spent in cPickle.dumps despite a per call time of .003. I don't use this function anywhere else, so it must be occurring due to ipython's parallel.
The way my code works is it does some serial stuff, then runs many simulations in parallel. Then some serial stuff, then a simulation in parallel. It has to repeat this many, many times. Each simulation requires a very large dictionary that I pull in from a module I wrote. I believe this is what is getting pickled many times and slowing the program down.
Is there a way to push a large dictionary to the engines in such a way that it stays there permanently? I think it's getting physically pushed every time I call the parallel function.

How precise is WinAPI's Sleep()?

And I don't mean precision to one millisecond, I'm asking about a situation when I want a delay of an hour using Sleep(60 * 60 * 1000). Will it be an hour and not like 55 or 70 minutes? Is thread guaranteed to wake up and not sleep forever?
Over an hour, the accuracy of a Sleep() call is not that bad, (it's fairly easy to test as well). A Sleep() call will return sufficiently close to the hour that it is not possible to determine any error with a manual stopwatch, (tried it on XP - no reason for it to be any different now, AFAIK).
Errors wrt. wall-time will, of course accumulate if consecutive calls to Sleep(3600*1000) are made, especially if the operations performed at theend of each interval are themselves lengthy and/or the box is seriously overloaded, (ie. many more ready threads than cores).
Why would the thread sleep forever if you ask it to sleep for an hour? If you call Sleep(3600*1000), it will become ready after that time. If it does not, the OS is stuft anyway and you're on your way to a reboot.
The reason why such a Sleep() call might be prefered over some timer is that it's a one-liner and will work anywhere on the caller stack - no need for a message-handler and/or state-machine to handle the timer callback.

Why is the sleep-time of Sleep(1) seems to be variable in Windows?

Last week I needed to test some different algorithmic functions and to make it easy to myself I added some artifical sleeps and simply measured the clock time. Something like this:
start = clock();
for (int i=0;i<10000;++i)
{
...
Sleep(1);
...
}
end = clock();
Since the argument of Sleep is expressed in milliseconds I expected a total wall clock time of about 10 seconds (a big higher because of the algorithms but that's not important now), and that was indeed my result.
This morning I had to reboot my PC because of new Microsoft Windows hot fixes and to my surprise Sleep(1) didn't take 1 millisecond anymore, but about 0.0156 seconds.
So my test results were completely screwed up, since the total time grew from 10 seconds to about 156 seconds.
We tested this on several PC's and apparently on some PC's the result of one Sleep was indeed 1 ms. On other PC's it was 0.0156 seconds.
Then, suddenly, after a while, the time of Sleep dropped to 0.01 second, and then an hour later back to 0.001 second (1 ms).
Is this normal behavior in Windows?
Is Windows 'sleepy' the first hours after reboot and then gradually gets a higher sleep-granularity after a while?
Or are there any other aspects that might explain the change in behavior?
In all my tests no other application was running at the same time (or: at least not taking any CPU).
Any ideas?
OS is Windows 7.
I've not heard about the resolution jumping around like that on its own, but in general the resolution of Sleep follows the clock tick of the task scheduler. So by default it's usually 10 or 15 ms, depending on the edition of Windows. You can set it manually to 1 ms by issuing a timeBeginPeriod.
I'd guess it's the scheduler. Each OS has a certain amount of granularity. If you ask it to do something lower than that, the results aren't perfect. By asking to sleep 1ms (especially very often) the scheduler may decide you're not important and have you sleep longer, or your sleeps may run up against the end of your time slice.
The sleep call is an advisory call. It tells the OS you want to sleep for amount of time X. It can be less than X (due to a signal or something else), or it can be more (as you are seeing).
Another Stack Overflow question has a way to do it, but you have to use winsock.
When you call Sleep the processor is stopping that thread until it can resume at a time >= to the called Sleep time. Sometimes due to thread priority (which in some cases causes Sleep(0) to cause your program to hang indefinitely) your program may resume at a later time because more processor cycles were allocated for another thread to do work (mainly OS threads have higher priority).
I just wrote some words about the sleep() function in the thread Sleep Less Than One Millisecond. The characteristics of the sleep() function depends on the underlying hardware and on the setting of the multimedia timer interface.
A windows update may change the behavior, i.e. Windows 7 does treat things differently compared to Vista. See my comment in that thread and its links to learn more abut the sleep() function.
Most likely the sleep timer has not enough resolution.
What kind of resolution do you get when you call the timeGetDevCaps function as explained in the documentation of the Sleep function?
Windows Sleep granularity is normally 16ms, you get this unless your or some other program changes this. When you got 1ms granularity other days and 16ms others, some other program probably set time slice (effects to your program also). I think that my Labview for example does this.

How do I obtain CPU cycle count in Win32?

In Win32, is there any way to get a unique cpu cycle count or something similar that would be uniform for multiple processes/languages/systems/etc.
I'm creating some log files, but have to produce multiple logfiles because we're hosting the .NET runtime, and I'd like to avoid calling from one to the other to log. As such, I was thinking I'd just produce two files, combine them, and then sort them, to get a coherent timeline involving cross-world calls.
However, GetTickCount does not increase for every call, so that's not reliable. Is there a better number, so that I get the calls in the right order when sorting?
Edit: Thanks to #Greg that put me on the track to QueryPerformanceCounter, which did the trick.
Heres an interesting article! says not to use RDTSC, but to instead use QueryPerformanceCounter.
Conclusion:
Using regular old timeGetTime() to do
timing is not reliable on many
Windows-based operating systems
because the granularity of the system
timer can be as high as 10-15
milliseconds, meaning that
timeGetTime() is only accurate to
10-15 milliseconds. [Note that the
high granularities occur on NT-based
operation systems like Windows NT,
2000, and XP. Windows 95 and 98 tend
to have much better granularity,
around 1-5 ms.]
However, if you call
timeBeginPeriod(1) at the beginning of
your program (and timeEndPeriod(1) at
the end), timeGetTime() will usually
become accurate to 1-2 milliseconds,
and will provide you with extremely
accurate timing information.
Sleep() behaves similarly; the length
of time that Sleep() actually sleeps
for goes hand-in-hand with the
granularity of timeGetTime(), so after
calling timeBeginPeriod(1) once,
Sleep(1) will actually sleep for 1-2
milliseconds,Sleep(2) for 2-3, and so
on (instead of sleeping in increments
as high as 10-15 ms).
For higher precision timing
(sub-millisecond accuracy), you'll
probably want to avoid using the
assembly mnemonic RDTSC because it is
hard to calibrate; instead, use
QueryPerformanceFrequency and
QueryPerformanceCounter, which are
accurate to less than 10 microseconds
(0.00001 seconds).
For simple timing, both timeGetTime
and QueryPerformanceCounter work well,
and QueryPerformanceCounter is
obviously more accurate. However, if
you need to do any kind of "timed
pauses" (such as those necessary for
framerate limiting), you need to be
careful of sitting in a loop calling
QueryPerformanceCounter, waiting for
it to reach a certain value; this will
eat up 100% of your processor.
Instead, consider a hybrid scheme,
where you call Sleep(1) (don't forget
timeBeginPeriod(1) first!) whenever
you need to pass more than 1 ms of
time, and then only enter the
QueryPerformanceCounter 100%-busy loop
to finish off the last < 1/1000th of a
second of the delay you need. This
will give you ultra-accurate delays
(accurate to 10 microseconds), with
very minimal CPU usage. See the code
above.
You can use the RDTSC CPU instruction (assuming x86). This instruction gives the CPU cycle counter, but be aware that it will increase very quickly to its maximum value, and then reset to 0. As the Wikipedia article mentions, you might be better off using the QueryPerformanceCounter function.
System.Diagnostics.Stopwatch.GetTimestamp() return the number of CPU cycle since a time origin (maybe when the computer start, but I'm not sure) and I've never seen it not increased between 2 calls.
The CPU Cycles will be specific for each computer so you can't use it to merge log file between 2 computers.
RDTSC output may depend on the current core's clock frequency, which for modern CPUs is neither constant nor, in a multicore machine, consistent.
Use the system time, and if dealing with feeds from multiple systems use an NTP time source. You can get reliable, consistent time readings that way; if the overhead is too much for your purposes, using the HPET to work out time elapsed since the last known reliable time reading is better than using the HPET alone.
Use the GetTickCount and add another counter as you merge the log files. Won't give you perfect sequence between the different log files, but it will at least keep all logs from each file in the correct order.

Resources