I can't find any clear indication of if/when the 64-bit value returned by QueryPerformanceCounter() gets reset, or overflows and resets back to zero. Hopefully it never overflows because the 64 bits gives space for decades worth of counting at gigahertz rates. However... is there anything other than a computer restart that will reset it?
Empirically, QPC is reset at system startup.
Note that you should not depend on this behavior, since Microsoft do not explicitly state what the "zero point" is for QPC, merely that it is a monotonically increasing value (mod 2^64) that can be used for high precision timing.
Hence they are quite within their rights to modify it's behavior at any time. They could, for example, make it return values that match FILETIME values as would be produced by a call to GetSystemTimeAsFileTime(), with the same resolution, 100ns tick rate. Under these circumstances, it would never reset. At least not in your or my lifetimes.
That said, the following program when run on Windows 10 [Version 6.3.16299] produces pairs of identical values that are the system uptime in seconds.
#include <windows.h>
#include <iostream>
int main()
{
LARGE_INTEGER performanceCount;
LARGE_INTEGER performanceFrequency;
QueryPerformanceFrequency(&performanceFrequency);
for (;;)
{
QueryPerformanceCounter(&performanceCount);
DWORD const systemTicks = timeGetTime();
DWORD const systemSeconds = systemTicks / 1000;
__int64 const performanceSeconds = performanceCount.QuadPart / performanceFrequency.QuadPart;
std::cout << systemSeconds << " " << performanceSeconds << std::endl;
Sleep(1000);
}
return 0;
}
Standard disclaimers apply, your actual mileage may vary, etc. etc. etc.
It seems that some Windows running inside VirtualBox may reset QueryPerformanceCounter every 20 minutes or so: see here.
QPC is more reliable as time goes by, but for better portability a low precision timer should be used such as GetTickCount64.
Related
I'm involved in effort integrating CUDA into some existing software. The software I'm integrating into is pseudo real-time, so it has a memory manager library that manually passes pointers from a single large memory allocation that is allocated up front. CUDA's Unified Memory is attractive to us, since in theory we'd theoretically be able to change this large memory chunk to Unified Memory, have the existing CPU code still work, and allow us to add GPU kernels with very little changes to the existing data I/O stream.
Parts of our existing CPU processing code requires memory to be aligned to certain alignment. cudaMallocManaged() does not allow me to specify the alignment for memory, and I feel like having to copy between "managed" and strict CPU buffers for these CPU sections almost defeats the purpose of UM. Is there a known way to address this issue that I'm missing?
I found this link on Stack Overflow that seems to solve it in theory, but I've been unable to produce good results with this method. Using CUDA 9.1, Tesla M40 (24GB):
#include <stdio.h>
#include <malloc.h>
#include <cuda.h>
#define USE_HOST_REGISTER 1
int main (int argc, char **argv)
{
int num_float = 10;
int num_bytes = num_float * sizeof(float);
float *f_data = NULL;
#if (USE_HOST_REGISTER > 0)
printf(
"%s: Using memalign + cudaHostRegister..\n",
argv[0]);
f_data = (float *) memalign(32, num_bytes);
cudaHostRegister(
(void *) f_data,
num_bytes,
cudaHostRegisterDefault);
#else
printf(
"%s: Using cudaMallocManaged..\n",
argv[0]);
cudaMallocManaged(
(void **) &f_data,
num_bytes);
#endif
struct cudaPointerAttributes att;
cudaPointerGetAttributes(
&att,
f_data);
printf(
"%s: ptr is managed: %i\n",
argv[0],
att.isManaged);
fflush(stdout);
return 0;
}
When using memalign() + cudaHostRegister() (USE_HOST_REGISTER == 1), the last print statement prints 0. Device accesses via kernel launches in larger files unsurprisingly report illegal accesses.
When using cudaMallocManaged() (USE_HOST_REGISTER == 0), the last print statement prints 1 as expected.
edit: cudaHostRegister() and cudaMallocManaged() do return successful error codes for me. Left this error-checking out in my sample I shared, but I did check them during my initial integration work. Just added the code to check, and both still return CUDA_SUCCESS.
Thanks for your insights and suggestions.
There is no method currently available in CUDA to take an existing host memory allocation and convert it into a managed memory allocation.
I feel like I'm missing something obvious here. I can easily make a time_duration with milliseconds, microseconds, or seconds by doing:
time_duration t1 = seconds(1);
time_duration t2 = milliseconds(1000);
time_duration t3 = microseconds(1000000);
But there's no function for nanoseconds. What's the trick to converting a plain integer nanoseconds value to a time_duration?
I'm on amd64 architecture on Debian Linux. Boost version 1.55.
boost::posix_time::microseconds is actually subsecond_duration<boost::posix_time::time_duration, 1000000>. So...
#include <boost/date_time/posix_time/posix_time.hpp>
using nanoseconds = boost::date_time::subsecond_duration<boost::posix_time::time_duration, 1000000000>;
int main() {
boost::posix_time::time_duration t = nanoseconds(1000000000);
std::cout << t << "\n";
}
Prints
00:00:01
UPDATE
Indeed, in the Compile Options for the Boost DateTime library you can see that there's an option to select nanosecond resolution:
By default the posix_time system uses a single 64 bit integer
internally to provide a microsecond level resolution. As an
alternative, a combination of a 64 bit integer and a 32 bit integer
(96 bit resolution) can be used to provide nano-second level
resolutions. The default implementation may provide better performance
and more compact memory usage for many applications that do not
require nano-second resolutions.
To use the alternate resolution (96 bit nanosecond) the variable
BOOST_DATE_TIME_POSIX_TIME_STD_CONFIG must be defined in the library
users project files (ie Makefile, Jamfile, etc). This macro is not
used by the Gregorian system and therefore has no effect when building
the library.
Indeed, you can check it using:
Live On Coliru
#define BOOST_DATE_TIME_POSIX_TIME_STD_CONFIG
#include <boost/date_time/posix_time/posix_time.hpp>
int main() {
using namespace boost::posix_time;
std::cout << nanoseconds(1000000000) << "\n";
}
I am using Alien for Lua to reference the WaitForSingleObject function in the Windows Kernel32.dll.
I am pretty new to Windows programming, so the question I have is about the following #defined variables referenced by the WaitForSingleObject documentation:
If dwMilliseconds is INFINITE, the function will return only when the object is signaled.
What is the INFINITE value? I would naturally assume it to be -1, but I cannot find this to be documented anywhere.
Also, with the following table, it mentions the return values in hexadecimal, but I am confused as to why they have an L character after the last digit. Could this be something as simple as casting it to a Long?
The reason I ask is because Lua uses a Number data type, so I am not sure if I should be checking for this return value via Hex digits (0-F) or decimal digits (0-9)?
The thought crossed my mind to just open a C++ application and print out these values, so I did just that:
#include <windows.h>
#include <process.h>
#include <iostream>
int main()
{
std::cout << INFINITE;
std::cout << WAIT_OBJECT_0;
std::cout << WAIT_ABANDONED;
std::cout << WAIT_TIMEOUT;
std::cout << WAIT_FAILED;
system("pause");
return 0;
}
The final Lua results based off my findings is:
local INFINITE = 4294967295
local WAIT_OBJECT_0 = 0
local WAIT_ABANDONED = 128
local WAIT_TIMEOUT = 258
local WAIT_FAILED = 4294967295
I tried to Google for the same information. Eventually, I found this Q&A.
I found two sources with: #define INFINITE 0xFFFFFFFF
https://github.com/tpn/winsdk-10/blob/master/Include/10.0.10240.0/um/WinBase.h#L704
https://github.com/Alexpux/mingw-w64/blob/master/mingw-w64-tools/widl/include/winbase.h#L365
For function WaitForSingleObject, parameter dwMilliseconds has type DWORD.
From here: https://learn.microsoft.com/en-us/windows/win32/winprog/windows-data-types
I can see: DWORD A 32-bit unsigned integer.
Thus, #RemyLebeau's comment above looks reasonable & valid:
`4294967295` is the same as `-1` when interpreted as a signed integer type instead.`
In short: ((DWORD) -1) == INFINITE
Last comment: Ironically, this "infinite" feels similar to the Boeing 787 problem where they needed to reboot/restart the plane once per 51 days. Feels eerily close / similar!
Unix has a variety of sleep APIs (sleep, usleep, nanosleep). The only Win32 function I know of for sleep is Sleep(), which is in units of milliseconds.
I seen that most sleeps, even on Unix, get rounded up significantly (ie: typically to about 10ms). I've seen that on Solaris, if you run as root, you can get sub 10ms sleeps, and I know this is also possible on HPUX provided the fine grain timers kernel parameter is enabled. Are finer granularity timers available on Windows and if so, what are the APIs?
The sad truth is that there is no good answer to this. Multimedia timers are probably the closest you can get -- they only let you set periods down to 1 ms, but (thanks to timeBeginPeriod) they do actually provide precision around 1 ms, where most of the others do only about 10-15 ms as a rule.
There are a lot of other candidates. At first glance, CreateWaitableTimer and SetWaitableTimer probably seem like the closest equivalent since they're set in 100 ns interals. Unfortunately, you can't really depend on anywhere close to that good of resolution, at least in my testing. In the long term, they probably do provide the best possibility, since they at least let you specify a time of less than 1 ms, even though you can't currently depend on the implementation to provide (anywhere close to) that resolution.
NtDelayExecution seems to be roughly the same, as SetWaitableTimer except that it's undocumented. Unless you're set on using/testing undocumented functions, it seems to me that CreateWaitableTimer/SetWaitableTimer is a better choice just on the basis of being documented.
If you're using thread pools, you could try using CreateThreadPoolTimer and SetThreadPoolTimer instead. I haven't tested them enough to have any certainty about the resolution they really provide, but I'm not particularly optimistic.
Timer queues (CreateTimerQueue, CreateTimerQueueTimer, etc.) are what MS recommends as the replacement for multimedia timers, but (at least in my testing) they don't really provide much better resolution than Sleep.
If you merely want resolution in the nanoseconds range, there's NtDelayExecution in ntdll.dll:
NTSYSAPI NTSTATUS NTAPI NtDelayExecution(BOOLEAN Alertable, PLARGE_INTEGER DelayInterval);
It measures time in 100-nanosecond intervals.
HOWEVER, this probably isn't what you want:
It can delay for much longer than that—as long as a thread time slice (0.5 - 15ms) or two.
Here's code you can use to observe this:
#ifdef __cplusplus
extern "C" {
#endif
#ifdef _M_X64
typedef long long intptr_t;
#else
typedef int intptr_t;
#endif
int __cdecl printf(char const *, ...);
int __cdecl _unloaddll(intptr_t);
intptr_t __cdecl _loaddll(char *);
int (__cdecl * __cdecl _getdllprocaddr(intptr_t, char *, intptr_t))(void);
typedef union _LARGE_INTEGER *PLARGE_INTEGER;
typedef long NTSTATUS;
typedef NTSTATUS __stdcall NtDelayExecution_t(unsigned char Alertable, PLARGE_INTEGER Interval); NtDelayExecution_t *NtDelayExecution = 0;
typedef NTSTATUS __stdcall NtQueryPerformanceCounter_t(PLARGE_INTEGER PerformanceCounter, PLARGE_INTEGER PerformanceFrequency); NtQueryPerformanceCounter_t *NtQueryPerformanceCounter = 0;
#ifdef __cplusplus
}
#endif
int main(int argc, char *argv[]) {
long long delay = 1 * -(1000 / 100) /* relative 100-ns intervals */, counts_per_sec = 0;
long long counters[2];
intptr_t ntdll = _loaddll("ntdll.dll");
NtDelayExecution = (NtDelayExecution_t *)_getdllprocaddr(ntdll, "NtDelayExecution", -1);
NtQueryPerformanceCounter = (NtQueryPerformanceCounter_t *)_getdllprocaddr(ntdll, "NtQueryPerformanceCounter", -1);
for (int i = 0; i < 10; i++) {
NtQueryPerformanceCounter((PLARGE_INTEGER)&counters[0], (PLARGE_INTEGER)&counts_per_sec);
NtDelayExecution(0, (PLARGE_INTEGER)&delay);
NtQueryPerformanceCounter((PLARGE_INTEGER)&counters[1], (PLARGE_INTEGER)&counts_per_sec);
printf("Slept for %lld microseconds\n", (counters[1] - counters[0]) * 1000000 / counts_per_sec);
}
return 0;
}
My output:
Slept for 9455 microseconds
Slept for 15538 microseconds
Slept for 15401 microseconds
Slept for 15708 microseconds
Slept for 15510 microseconds
Slept for 15520 microseconds
Slept for 1248 microseconds
Slept for 996 microseconds
Slept for 984 microseconds
Slept for 1010 microseconds
The MinGW answer in long form:
MinGW and Cygwin provides a nanosleep() implementation under <pthread.h>. Source code:
In Cygwin and MSYS2: signal.cc and cygwait.cc (LGPLv3+; with linking exception)
This is based on NtCreateTimer and WaitForMultipleObjects.
In MinGW-W64: nanosleep.c and thread.c (Zope Public License)
This is based on WaitForSingleObject and Sleep.
In addition, gnulib (GPLv3+) has a higher-precision implementation in nanosleep.c. This performs a busy-loop over QueryPerformanceCounter for short (<1s) intervals and Sleep for longer intervals.
You can use the usual timeBeginPeriod trick ethanpil linked to with all the underlying NT timers.
Windows provides Multimedia timers that are higher resolution than Sleep(). The actual resolution supported by the OS can be obtained at runtime.
You may want to look into
timeBeginPeriod / timeEndPeriod
and/or
QueryPerformanceCounter
See here for more information: http://www.geisswerks.com/ryan/FAQS/timing.html
particulary the section towards the bottom: High-precision 'Sleeps'
Yeah there is , under 'pthread.h' mingw compiler
I'm trying to extract boot time by getting current time SYSTEMTIME structure, then converting it to FILETIME which I then convert to ULARGE_INTEGER from which I subtract GetTickCount64() and then proceed on converting everything back to SYSTEMTIME.
I'm comparing this function to 'NET STATISTICS WORKSTATION' and for some reason my output is off by several hours, which don't seem to match any timezone differences.
Here's visual studio example code:
#include "stdafx.h"
#include <windows.h>
#include <tchar.h>
#include <strsafe.h>
#define KILOBYTE 1024
#define BUFF KILOBYTE
int _tmain(int argc, _TCHAR* argv[])
{
ULARGE_INTEGER ticks, ftime;
SYSTEMTIME current, final;
FILETIME ft, fout;
OSVERSIONINFOEX osvi;
char output[BUFF];
int retval=0;
ZeroMemory(&osvi, sizeof(OSVERSIONINFOEX));
ZeroMemory(&final, sizeof(SYSTEMTIME));
GetVersionEx((OSVERSIONINFO *) &osvi);
if (osvi.dwBuildNumber >= 6000) ticks.QuadPart = GetTickCount64();
else ticks.QuadPart = GetTickCount();
//Convert miliseconds to 100-nanosecond time intervals
ticks.QuadPart = ticks.QuadPart * 10000;
//GetLocalTime(¤t); -- //doesn't really fix the problem
GetSystemTime(¤t);
SystemTimeToFileTime(¤t, &ft);
printf("INITIAL: Filetime lowdatetime %u, highdatetime %u\r\n", ft.dwLowDateTime, ft.dwHighDateTime);
ftime.LowPart=ft.dwLowDateTime;
ftime.HighPart=ft.dwHighDateTime;
//subtract boot time interval from current time
ftime.QuadPart = ftime.QuadPart - ticks.QuadPart;
//Convert ULARGE_INT back to FILETIME
fout.dwLowDateTime = ftime.LowPart;
fout.dwHighDateTime = ftime.HighPart;
printf("FINAL: Filetime lowdatetime %u, highdatetime %u\r\n", fout.dwLowDateTime, fout.dwHighDateTime);
//Convert FILETIME back to system time
retval = FileTimeToSystemTime(&fout, &final);
printf("Return value is %d\r\n", retval);
printf("Current time %d-%.2d-%.2d %.2d:%.2d:%.2d\r\n", current.wYear, current.wMonth, current.wDay, current.wHour, current.wMinute, current.wSecond);
printf("Return time %d-%.2d-%.2d %.2d:%.2d:%.2d\r\n", final.wYear, final.wMonth, final.wDay, final.wHour, final.wMinute, final.wSecond);
return 0;
}
I ran it and found that it works correctly when using GetLocalTime as opposed to GetSystemTime, which is expressed in UTC. So it would make sense that GetSystemTime would not necessarily match the "clock" on the PC.
Other than that, though, the issue could possibly be the call to GetVersionEx. As written, I think it will always return zeros for all values. You need this line prior to calling it:
osvi.dwOSVersionInfoSize = sizeof( osvi );
Otherwise that dwBuildNumber will be zero and it will call GetTickCount, which is only good for 49 days or so. On the other hand, if that were the case, I think you would get a result with a much larger difference.
I'm not completely sure that (as written) the check is necessary to choose which tick count function to call. If GetTickCount64 doesn't exist, the app would not load due to the missing entrypoint (unless perhaps delay loading was used ... I'm not sure in that case). I believe that it would be necessary to use LoadLibrary and GetProcAddress to make the decision dynamically between those two functions and have it work on an older platform.