After delving into VBA benchmarking (see also) I'm not satisfied those answers go into sufficient detail. From a similar question about timing in Go, I see there is a difference between measuring absolute time and changes in time. For absolute time, a "Wall clock" should be used, which can be synchronised between machines using the Network Time Protocol for example
Meanwhile "Monatonic Clocks" should be used to measure differences in time, as these are not subject to leap seconds or (according to that linked go answer) changes in frequency of the clock.
Have I got that right? Is there anything else to consider?
Assuming those definitions are correct, which category do each of these clocks belong to, or in other words, which of these clocks will give me the most accurate measurement of changes in time:
VBA Time
VBA Timer
WinApi GetTickCount
WinApi GetSystemTimePreciseAsFileTime
WinApi QueryPerformanceFrequency + QueryPerformanceCounter
Or is it something else?
I may be overlooking other approaches. I say this because some languages like Java get time in nanoseconds rather than microseconds, how is this possible? Surely the Windows Api will tap into the most accurate hardware timer available, which gives microsecond resolution. What's Java doing and can I copy that?
PS, I have no idea how to tag this, please add as you think appropriate
QueryPerformanceFrequency and QueryPerformanceCounter are going to give you the highest resolution timer. They basically wrap the rtdsc instruction which counts elapsed CPU cycles.
Related
There are lots of questions (here, here, here) about mechanisms for getting monotonic time on Windows and their various gotchas and pitfalls. I'm particularly interested in the accuracy (not precision) of the main options.
I'm looking to measure elapsed time on a single machine, when the time is on the order of multiple minutes to an hour. What i know so far:
QueryPerformanceCounter is great for short time intervals, but QPF can have error on the order of 500PPM, which translates to error of 2 seconds over an hour.
More concerning is that even on fairly recent processors, folks are seeing QPC misbehavior.
Microsoft recommends QPC above all else for short-term duration measurements. But short-term isn't defined in any absolute numbers.
GetTickCount64 is often cited as a nice and reliable, less precise alternative for QPC.
I've not found any good details about the accuracy of GetTickCount64. While it is less precise than QPC, how does its accuracy compare? What kind of error might I expect over an hour?
Some programs play with its resolution by using timeBeginPeriod, although I don't think this affects accuracy?
The docs talk about how GetTickCount64's resolution is not affected by adjustments made by the GetSystemTimeAdjustment function. Hopefully this means GetTickCount64 is monotonic and not adjusted ever? It is unusual wording...
GetSystemTimePreciseAsFileTime is an option for same-machine time deltas if I disable automatic time adjustment via SetSystemTimeAdjustment. It is backed by QPC. Is there any benefit to using this over QPC directly? (Perhaps it does sanitization or thread affinity tricks to avoid some of the issues encountered by direct QPC calls?)
One SO QA I found linked to this blog post, which has been particularly useful to read. While it doesn't answer my question directly, it dives into how QPC works on Windows, and how the common linux monotonic time basically uses the same thing.
The gist is that both of them use rtdsc when an invariant TSC on modern hardware is available.
Most of my limited experience with profiling native code is on a GPU rather than on a CPU, but I see some CPU profiling in my future...
Now, I've just read this blog post:
How profilers lie: The case of gprof and KCacheGrind
about how what profilers measure and what they show you, which is likely not what you expect if you're interested in discerning between different call paths and the time spent in them.
My question is: Is this still the case today (5 years later)? That is, do sampling profilers (i.e. those who don't slow execution down terribly) still behave the way gprof used to (or callgrind without --separate-callers=N)? Or do profilers nowadays customarily record the entire call stack when sampling?
No, many modern sampling profilers don't exhibit the problem described regarding gprof.
In fact, even when that was written, the specific problem was actually more a quirk of the way gprof uses a mix of instrumentation and sampling and then tries to reconstruct a hypothetical call graph based on limited caller/callee information and combine that with the sampled timing information.
Modern sampling profilers, such as perf, VTune, and various language-specific profilers to languages that don't compile to native code can capture the full call stack with each sample, which provides accurate times with respect to that issue. Alternately, you might sample without collecting call stacks (which reduces greatly the sampling cost) and then present the information without any caller/callee information which would still be accurate.
This was largely true even in the past, so I think it's fair to say that sampling profilers never, as a group, really exhibited that problem.
Of course, there are still various ways in which profilers can lie. For example, getting results accurate to the instruction level is a very tricky problem, given modern CPUs with 100s of instructions in flight at once, possibly across many functions, and complex performance models where instructions may have a very different in-context cost as compared to their nominal latency and throughput values. Even that tricky issues can be helped with "hardware assist" such as on recent x86 chips with PEBS support and later related features that help you pin-point an instruction in a less biased way.
Regarding gprof, yes, it's still the case today. This is by design, to keep the profiling overhead small. From the up-to-date documentation:
Some of the figures in the call graph are estimates—for example, the
children time values and all the time figures in caller and subroutine
lines.
There is no direct information about these measurements in the profile
data itself. Instead, gprof estimates them by making an assumption
about your program that might or might not be true.
The assumption made is that the average time spent in each call to any
function foo is not correlated with who called foo. If foo used 5
seconds in all, and 2/5 of the calls to foo came from a, then foo
contributes 2 seconds to a’s children time, by assumption.
Regarding KCacheGrind, little has changed since the article was written. You can check out the change log and see that the latest version was published in April 5, 2013, which includes unrelated changes. You can also refer to Josef Weidendorfer's comments under the article (Josef is the author of KCacheGrind).
If you noticed, I contributed several comments to that post you referenced, but it's not just that profilers give you bad information, it's that people fool themselves about what performance actually is.
What is your goal? Is it to A) find out how to make the program as fast as possible? Or is it to B) measure time taken by various functions, hoping that will lead to A? (Hint - it doesn't.) Here's a detailed list of the issues.
To illustrate: You could, for example, be calling a teeny innocent-looking little function somewhere that just happens to invoke nine yards of system code including reading a .dll to extract a string resource in order to internationalize it. This could be taking 50% of wall-clock time and therefore be on the stack 50% of wall-clock time. Would a "CPU-profiler" show it to you? No, because practically all of that 50% is doing I/O. Do you need many many stack samples to know to 3 decimal places exactly how much time it's taking? Of course not. If you only got 10 samples it would be on 5 of them, give or take. Once you know that teeny routine is a big problem, does that mean you're out of luck because somebody else wrote it? What if you knew what the string was that it was looking up? Does it really need to be internationalized, so much so that you're willing to pay a factor of two in slowness just for that? Do you see how useless measurements are when your real problem is to understand qualitatively what takes time?
I could go on and on with examples like this...
I'm trying to find a decent replacement on Windows for clock_gettime(CLOCK_MONOTONIC) and mach_absolute_time.
GetTickCount is annoying because it's deliberately biased: when the system resumes from suspend or hibernation, it seems Windows works out how long it was suspended (from motherboard) then adds that to the tick count to make it look like the tick carried on while the system was powered down.
QueryUnbiasedInterruptTime is friendlier: it's a simple, seemingly-reliable counter that just returns the time the system has been running since boot.
Now, I've experimented with QueryUnbiasedInterruptTime to see what happens when the computer enters each of: sleep, hybrid sleep, and hibernation. It appears that it's correctly updated so that it appears to be monotonic (ie, the system stashes it before hibernation and restores it so that applications aren't confused).
But, there's a note here on the Python dev pages (which, by the way, are a very helpful resource for time function reference!):
QueryUnbiasedInterruptTime() is not monotonic.
I presume someone did a test to determine that. Is that information accurate? If so, out of interest I wonder what you have to do to get the return value to go backwards.
QueryUnbiasedInterruptTime is guaranteed to be monotonic. The interrupt time has no skew added to it, so it always ticks forward although the rate may sometimes differ a little from the wall clock.
I couldn't find any good references for this in MSDN, but I'm now convinced that Hans is right in his comment, the CPython developers must have simply got confused.
The main problem with GetTickCount is the value being only 32 bits, i.e. it resets to zero every 50 days of uptime.
You probably need QueryPerformanceCounter API.
The timeGetTime function documentation says:
The default precision of the timeGetTime function can be five milliseconds or more, depending on the machine. You can use the timeBeginPeriod and timeEndPeriod functions to increase the precision of timeGetTime.
So the precision is system-dependent. But what if I don't want to increase the precision, I just want to know what it is in the current system. Is there a standard way (e.g. an API) to get it? Or should I just poll timeGetTime for a while and look at what comes out and deduce from there?
I'd suggest to use the GetSystemTimeAsFileTime function. This function has low overhead and displays ths system clock. See this answer to get some more details about the granularity of time and APIs to query timer resolutions (e.g. NtQueryTimerResolution). Code to find out how the system file time increments can be found there too.
Windows 8 and Server 2012 provide the new GetSystemTimePreciseAsFileTime function which is supposed to be more accurate. MSDN states with the highest possible level of precision (<1us). However, this only works on W8 and Server 2012 and there is very little documentation about how this additional accuracy is obtained. Seems like MS is going a Linux alike (gettimeofday) way to combine the performace counter frequency with the system clock.
This post may of interest for you too.
Edit: As of February 2014 there is some more detailed information about time matters on MSDN: Acquiring high-resolution time stamps.
So I know that clock() measures clock cycles, and thus isn't very good for measuring time, and I know there are functions like omp_get_wtime() for getting the wall time, but it is frustrating for me that the wall time varies so much, and was wondering if there was some way to measure distinct clock cycles (only one cycle even if more than one thread executed in it). It has to be something relatively simple/native. Thanks
Are you sure that taking time measurements will not work for you? Keep in mind you can only measure to so many milliseconds, depending upon the OS.
See FreeMemory's answer to this question for RDTSC if your using x86 which I've tested and seems to work fine on my system (mac), but see my answer to this question. Also see the criticism of RDTSC here.
It's not usually worth it to get down to too low a level of detail though, other bits and pieces of work the computer needs to do will use up clock cycles, so it will vary depending on load. I find omp_get_wtime() sufficient, though I need to put my code in an extra loop to make sure it takes about a second to ensure consistent results from run to run.