Huge difference between the same CUDA kernel execution time

Huge difference between the same CUDA kernel execution time - time

I'm launching a set of kernels multiple (30) times.
Every test of these 30 (they are deterministic, at every test a set of kernels is called 10 times and this number is fixed), at the beginning, I do cudaSetDevice(0) and everything gets malloc'd and memcpy'd.
When the test is done and the execution time was taken, everything is cudaFree'd.
Here's a sample output from my program:
avg: 81.7189
times:
213.0105 202.8020 196.8834 202.4001 197.7123 215.4658 199.5302 198.6519 200.8467
203.7865 20.2014 20.1881 21.0537 20.8805 20.1986 20.6036 20.9458 20.9473 20.292
9 20.9167 21.0686 20.4563 24.5359 21.1530 21.7075 23.3320 20.5921 20.6506 19.933
1 20.8211
The first 10 kernels take about 200 ms, while the others take about 20 ms.
Apparently every kernel calculates the same values, they all print the correct one. But since I malloc every test in the same order, couldn't the GPU memory still have the same values from the previous execution?
Also, kernels aren't returning errors because I'm checking them. Every kernel launch has cudaThreadSynchronize() for debugging purposes and error checking right after them with this macro:
#define CUDA_ERROR_CHECK if( (error = cudaGetLastError()) != cudaSuccess) printf("CUDA error: %s\n", cudaGetErrorString(error));
Why is this happening?
I'm getting the execution times from windows functions:
void StartCounter()
{
LARGE_INTEGER li;
if(!QueryPerformanceFrequency(&li))
cout << "QueryPerformanceFrequency failed!\n";
PCFreq = double(li.QuadPart)/1000.0;
QueryPerformanceCounter(&li);
CounterStart = li.QuadPart;
}
void StopCounter()
{
LARGE_INTEGER li;
QueryPerformanceCounter(&li);
double time = double(li.QuadPart-CounterStart)/PCFreq;
v.push_back(time);
}
Edit:
The mallocs, copys and other stuff aren't being timed. I only time the execution time (kernel launch and sync).
Visual Studio 2010's optimizations are turned on. Everything is set to maximize speed. CUDA's optimizations are on as well.

Measuring kernel execution time using QueryPerformanceTime is wrong, because host call device and than they are working in parallel. You ara propably measuring only call time.
To check kernel execution time use as ahmad mentioned cudaEvents:
cudaEvent_t start, stop;
float time;
cudaEventCreate(&start);
cudaEventCreate(&stop);
...
cudaEventRecord(start, 0);
yourkernel <<< n_blocks, block_size >>> (a_d, N);
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
...
cudaEventElapsedTime(&time, start, stop);
printf ("Time for the kernel: %f ms\n", time);
If you want to use QueryPerformanceTime you have to call
cudaDeviceSynchronize();
after kernel call. It will wait until kernel stops.

Related

Trigger Countdown with 433 MHz transmission

I would like to use an arduino to read 433 MHz transmission from multiple Soil Moisture Sensors. Since I can never be sure all transmissions reach the receiver I'd like to set a countdown from the moment the first transmission is received. If another transmission is received, the countdown starts again.
After a defined amount of time (e.g. 10 Minutes) without any more signals or if all signals have been received (e.g 4 Sensors) the receiving unit should stop and come to decision based on the data it got to the point.
For transmitting and receiving I am using the <RCSwitch.h>library.
The loop of the receiving unit and one Sensor looks like this:
#include <RCSwitch.h>
RCSwitch mySwitch = RCSwitch();
void Setup(){
Serial.begin(9600);
mySwitch.enableReceive(4);
}
void loop() {
if (mySwitch.available()) {
int value = mySwitch.getReceivedValue();
if (value == 0) {
lcd.clear();
Serial.print("Unknown encoding");
}
else {
Serial.print(mySwitch.getReceivedValue());
Serial.print("%");
}
The full code includes some differentiation mechanism for all sensors but I figured that might not be relevant for my question.
Question:
What's the best way to do this without a real time clock module. As far as I know I can't wait by using delay(...)since then I won't receive any data while the processor waiting.

You can use millis() as a clock. It returns the number of milliseconds since the arduino started.
#define MINUTES(x) ((x) * 60000UL)
unsigned long countStart = 0;
void loop()
{
if (/*read from module ok*/)
{
countStart = millis();
// sanity check, since millis() eventually rolls over
if (countStart == 0)
countStart = 1;
}
if (countStart && ((millis() - countStart) > MINUTES(10)))
{
countStart = 0;
// trigger event
}
}

Arduino's internal timers can also be used in this situation. If a long time period is needed, it's better to use 16bit counter (usually timer1) at 1024 prescaler (largest available). If the largest time interval of timer is greater than time required, then a counter have to be added in order to keep track of 1 minute interval.
For example, for 1-minute interval, initialize registers as:
TCCR1A = 0; //Initially setting every register as 0x0000
TCCR1B = 0;
TCNT1 = 0;
OCR1A = 468750; // compare match register 16MHz/1024/2/frequency(hz)
TCCR1B |= (1 << WGM12); // Timer compare mode
TCCR1B |= (1 << CS10) | (1 << CS10); // 1024 prescaler
TIMSK1 |= (1 << OCIE1A); // enable timer compare interrupt
These configuration of timer will give interrupt time of 1 minute. And upon timer completion ISR TIMER1_COMPA_vect will be run. You can play around with value of OCR1A for different interrupt periods.
Main advantage of using interrupts is that they don't block any task and can will be executed instantaneously (if interrupts are not disabled explicitly).

do_gettimeofday() in Beaglebone giving wrong time

I am trying to measure the time period of a square wave on a Beaglebone running Angstrom OS. I have written a kernel driver to register an ISR in which I'm timing the pulses. Everything is working fine, but the time interval being measured is completely wrong. I'm using do_gettimeofday() function to measure the time. When I do the same in userspace program using poll() function, I'm able to achieve correct values (it shows around 1007 us for a 1000us wave), but when I use the driver to measure the pulse, I get the interval as 1923us. I have no idea why the time interval in the kernel is higher than that in user space. I have attached my code below.
I would be grateful if someone can find the mistake in my program.
kernel ISR:
static irqreturn_t ISR ( int irq, void *dev_id)
{
prev = c;
do_gettimeofday(&c);
printk(KERN_ALERT "%ld", (c.tv_usec - prev.tv_usec));
return IRQ_HANDLED;
}
userspace prog:
while(1){
prev = start;
gettimeofday(&start, NULL);
rc = poll(&fdset, 1, 20000);
if(rc < 0){
printf("Error in rc\n");
return -1;
}
if(rc == 0){
printf("Timed out\n");
return -1;
}
if (fdset.revents & POLLPRI) {
len = read(fdset.fd, buf, 2);
printf("%ld\n", (start.tv_usec - prev.tv_usec));
}
}

For profiling interrupt latency, I find it quite useful to be lazy and to set a GPIO pin then measure the time with an oscilloscope. Probably not the answer you want, but it might help you over a hurdle quickly.

Resolution of WaitForSingleObject with timeSetEvent & SetWaitableTimer

I am using a Win32 multimedia timer to put a delay between the dispatch of large numbers of UDP packets, but i am finding that the resulting delay is substantially longer than it should be. Delays of ~40ms are sometimes nearer 1000ms, even when using Windows Miltimedia timers and upping the timer resolution. Below is a simplifed version of the code i used:
if( timeGetDevCaps(&tc,sizeof(TIMECAPS)) == TIMERR_NOERROR)
{
timeRes = min( max(tc.wPeriodMin,1), tc.wPeriodMax);
timeBeginPeriod(timeRes);
printf("Timer Res: %u\n", timeRes);
}
/* ... */
while( ptrHead )
{
NALU_t *ptrLink = ptrHead;
unsigned long tsNALU = ptrLink->timestamp - tsFirst;
printf("Timestamp: %umsec\n", ptrLink->timestamp / 90 );
int idxPort;
for(idxPort=0;idxPort<12;idxPort++)
{
ip4Addr.sin_port = htons( 60000 + idxPort );
struct sockaddr *saAddr = (struct sockaddr*)&ip4Addr;
sendto(fdSocket,(char*)ptrLink->ptrData,ptrLink->lenData,
0,saAddr,lenAddr);
}
if( 1 )
{
unsigned long millis = (tsNALU - tsPrev) / 90;
valTime.QuadPart = 10000;
valTime.QuadPart *= millis;
valTime.QuadPart *= -1;
if(SetWaitableTimer(hdlTimer,&valTime,0,NULL,NULL,TIME_ONESHOT))
WaitForSingleObject(hdlTimer,INFINITE);
}
tsPrev = tsNALU;
ptrHead = ptrLink->next;
free( ptrLink );
}
I suspect the problem is that Windows7 no longer guarantees the resolution of timers when signalled by events as opposed to call-backs, but i am loathed to use the latter. Anyone know why even supposedly high-resolution timers in single-threaded test cases are so wildly inaccurate?

If timing is critical it's best to run in a busy loop (you can give up a timeslice every iteration using Sleep(0) if you want), using the QueryPerformanceCounter() API to measure elapsed time.

From subsequent experiments, my best guess is that Windows moving threads between CPU cores (possibly for load-balancing reasons - this is on a Quad-core i7) is disruptive to timing functions. I used SetThreadAffinityMask() to lock my timing-critical thread to one CPU (and my non-timing threads to all other cores), and that has sorted out the problems.

cudaEventRecord returning zero

I'm running an image filter on GPU and I need to measure the time each part of the program takes for comparison. First I tried time.h library but it always returned zero. Then I read this post
and used the same code in my program before and after calling the kernel but still it is returning zero. Can anyone tell me what the problem could be?
This is my code:
cudaEvent_t start,stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
float Elapsed=0,Cycle;
while(count)
{
cudaEventRecord(start,0);
ImgFilter<<<dimGrid,dimBlock>>>...
cudaEventRecord(stop,0);
cudaElapsedTime(&Cycle,statr,stop);
Elapsed += Cycle;
}
printf("Time = %f",Elapsed);
I also tried printing 'Cycle' but it's always zero.

You miss to call cudaEventSynchronize function
cudaEvent_t start,stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
float Elapsed=0,Cycle;
while(count)
{
cudaEventRecord(start,0);
ImgFilter<<<dimGrid,dimBlock>>>...
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaElapsedTime(&Cycle,statr,stop);
Elapsed += Cycle;
}
printf("Time = %f",Elapsed);
Note, that device function returns before all CUDA threads finished execution and you need to use cudaThreadSynchronize after kernel calling.

How can I get the Windows system time with millisecond resolution?

How can I get the Windows system time with millisecond resolution?
If the above is not possible, then how can I get the operating system start time? I would like to use this value together with timeGetTime() in order to compute a system time with millisecond resolution.

Try this article from MSDN Magazine. It's actually quite complicated.
Implement a Continuously Updating, High-Resolution Time Provider for Windows
(archive link)

This is an elaboration of the above comments to explain the some of the whys.
First, the GetSystemTime* calls are the only Win32 APIs providing the system's time. This time has a fairly coarse granularity, as most applications do not need the overhead required to maintain a higher resolution. Time is (likely) stored internally as a 64-bit count of milliseconds. Calling timeGetTime gets the low order 32 bits. Calling GetSystemTime, etc requests Windows to return this millisecond time, after converting into days, etc and including the system start time.
There are two time sources in a machine: the CPU's clock and an on-board clock (e.g., real-time clock (RTC), Programmable Interval Timers (PIT), and High Precision Event Timer (HPET)). The first has a resolution of around ~0.5ns (2GHz) and the second is generally programmable down to a period of 1ms (though newer chips (HPET) have higher resolution). Windows uses these periodic ticks to perform certain operations, including updating the system time.
Applications can change this period via timerBeginPeriod; however, this affects the entire system. The OS will check / update regular events at the requested frequency. Under low CPU loads / frequencies, there are idle periods for power savings. At high frequencies, there isn't time to put the processor into low power states. See Timer Resolution for further details. Finally, each tick has some overhead and increasing the frequency consumes more CPU cycles.
For higher resolution time, the system time is not maintained to this accuracy, no more than Big Ben has a second hand. Using QueryPerformanceCounter (QPC) or the CPU's ticks (rdtsc) can provide the resolution between the system time ticks. Such an approach was used in the MSDN magazine article Kevin cited. Though these approaches may have drift (e.g., due to frequency scaling), etc and therefore need to be synced to the system time.

In Windows, the base of all time is a function called GetSystemTimeAsFiletime.
It returns a structure that is capable of holding a time with 100ns resoution.
It is kept in UTC
The FILETIME structure records the number of 100ns intervals since January 1, 1600; meaning its resolution is limited to 100ns.
This forms our first function:
A 64-bit number of 100ns ticks since January 1, 1600 is somewhat unwieldy. Windows provides a handy helper function, FileTimeToSystemTime that can decode this 64-bit integer into useful parts:
record SYSTEMTIME {
wYear: Word;
wMonth: Word;
wDayOfWeek: Word;
wDay: Word;
wHour: Word;
wMinute: Word;
wSecond: Word;
wMilliseconds: Word;
}
Notice that SYSTEMTIME has a built-in resolution limitation of 1ms
Now we have a way to go from FILETIME to SYSTEMTIME:
We could write the function to get the current system time as a SYSTEIMTIME structure:
SYSTEMTIME GetSystemTime()
{
//Get the current system time utc in it's native 100ns FILETIME structure
FILETIME ftNow;
GetSytemTimeAsFileTime(ref ft);
//Decode the 100ns intervals into a 1ms resolution SYSTEMTIME for us
SYSTEMTIME stNow;
FileTimeToSystemTime(ref stNow);
return stNow;
}
Except Windows already wrote such a function for you: GetSystemTime
Local, rather than UTC
Now what if you don't want the current time in UTC. What if you want it in your local time? Windows provides a function to convert a FILETIME that is in UTC into your local time: FileTimeToLocalFileTime
You could write a function that returns you a FILETIME in local time already:
FILETIME GetLocalTimeAsFileTime()
{
FILETIME ftNow;
GetSystemTimeAsFileTime(ref ftNow);
//convert to local
FILETIME ftNowLocal
FileTimeToLocalFileTime(ftNow, ref ftNowLocal);
return ftNowLocal;
}
And lets say you want to decode the local FILETIME into a SYSTEMTIME. That's no problem, you can use FileTimeToSystemTime again:
Fortunately, Windows already provides you a function that returns you the value:
Precise
There is another consideration. Before Windows 8, the clock had a resolution of around 15ms. In Windows 8 they improved the clock to 100ns (matching the resolution of FILETIME).
GetSystemTimeAsFileTime (legacy, 15ms resolution)
GetSystemTimeAsPreciseFileTime (Windows 8, 100ns resolution)
This means we should always prefer the new value:
You asked for the time
You asked for the time; but you have some choices.
The timezone:
UTC (system native)
Local timezone
The format:
FILETIME (system native, 100ns resolution)
SYTEMTIME (decoded, 1ms resolution)
Summary
100ns resolution: FILETIME
UTC: GetSytemTimeAsPreciseFileTime (or GetSystemTimeAsFileTime)
Local: (roll your own)
1ms resolution: SYSTEMTIME
UTC: GetSystemTime
Local: GetLocalTime

GetTickCount will not get it done for you.
Look into QueryPerformanceFrequency / QueryPerformanceCounter. The only gotcha here is CPU scaling though, so do your research.

Starting with Windows 8 Microsoft has introduced the new API command GetSystemTimePreciseAsFileTime
Unfortunately you can't use that if you create software which must also run on older operating systems.
My current solution is as follows, but be aware: The determined time is not exact, it is only near to the real time. The result should always be smaller or equal to the real time, but with a fixed error (unless the computer went to standby). The result has a millisecond resolution. For my purpose it is exact enough.
void GetHighResolutionSystemTime(SYSTEMTIME* pst)
{
static LARGE_INTEGER uFrequency = { 0 };
static LARGE_INTEGER uInitialCount;
static LARGE_INTEGER uInitialTime;
static bool bNoHighResolution = false;
if(!bNoHighResolution && uFrequency.QuadPart == 0)
{
// Initialize performance counter to system time mapping
bNoHighResolution = !QueryPerformanceFrequency(&uFrequency);
if(!bNoHighResolution)
{
FILETIME ftOld, ftInitial;
GetSystemTimeAsFileTime(&ftOld);
do
{
GetSystemTimeAsFileTime(&ftInitial);
QueryPerformanceCounter(&uInitialCount);
} while(ftOld.dwHighDateTime == ftInitial.dwHighDateTime && ftOld.dwLowDateTime == ftInitial.dwLowDateTime);
uInitialTime.LowPart = ftInitial.dwLowDateTime;
uInitialTime.HighPart = ftInitial.dwHighDateTime;
}
}
if(bNoHighResolution)
{
GetSystemTime(pst);
}
else
{
LARGE_INTEGER uNow, uSystemTime;
{
FILETIME ftTemp;
GetSystemTimeAsFileTime(&ftTemp);
uSystemTime.LowPart = ftTemp.dwLowDateTime;
uSystemTime.HighPart = ftTemp.dwHighDateTime;
}
QueryPerformanceCounter(&uNow);
LARGE_INTEGER uCurrentTime;
uCurrentTime.QuadPart = uInitialTime.QuadPart + (uNow.QuadPart - uInitialCount.QuadPart) * 10000000 / uFrequency.QuadPart;
if(uCurrentTime.QuadPart < uSystemTime.QuadPart || abs(uSystemTime.QuadPart - uCurrentTime.QuadPart) > 1000000)
{
// The performance counter has been frozen (e. g. after standby on laptops)
// -> Use current system time and determine the high performance time the next time we need it
uFrequency.QuadPart = 0;
uCurrentTime = uSystemTime;
}
FILETIME ftCurrent;
ftCurrent.dwLowDateTime = uCurrentTime.LowPart;
ftCurrent.dwHighDateTime = uCurrentTime.HighPart;
FileTimeToSystemTime(&ftCurrent, pst);
}
}

GetSystemTimeAsFileTime gives the best precision of any Win32 function for absolute time. QPF/QPC as Joel Clark suggested will give better relative time.

Since we all come here for quick snippets instead of boring explanations, I'll write one:
FILETIME t;
GetSystemTimeAsFileTime(&t); // unusable as is
ULARGE_INTEGER i;
i.LowPart = t.dwLowDateTime;
i.HighPart = t.dwHighDateTime;
int64_t ticks_since_1601 = i.QuadPart; // now usable
int64_t us_since_1601 = (i.QuadPart * 1e-1);
int64_t ms_since_1601 = (i.QuadPart * 1e-4);
int64_t sec_since_1601 = (i.QuadPart * 1e-7);
// unix epoch
int64_t unix_us = (i.QuadPart * 1e-1) - 11644473600LL * 1000000;
int64_t unix_ms = (i.QuadPart * 1e-4) - 11644473600LL * 1000;
double unix_sec = (i.QuadPart * 1e-7) - 11644473600LL;
// i.QuadPart is # of 100ns ticks since 1601-01-01T00:00:00Z
// difference to Unix Epoch is 11644473600 seconds (attention to units!)
No idea how drifting performance-counter-based answers went up, don't do slippage bugs, guys.

QueryPerformanceCounter() is built for fine-grained timer resolution.
It is the highest resolution timer that the system has to offer that you can use in your application code to identify performance bottlenecks
Here is a simple implementation for C# devs:
[DllImport("kernel32.dll")]
extern static short QueryPerformanceCounter(ref long x);
[DllImport("kernel32.dll")]
extern static short QueryPerformanceFrequency(ref long x);
private long m_endTime;
private long m_startTime;
private long m_frequency;
public Form1()
{
InitializeComponent();
}
public void Begin()
{
QueryPerformanceCounter(ref m_startTime);
}
public void End()
{
QueryPerformanceCounter(ref m_endTime);
}
private void button1_Click(object sender, EventArgs e)
{
QueryPerformanceFrequency(ref m_frequency);
Begin();
for (long i = 0; i < 1000; i++) ;
End();
MessageBox.Show((m_endTime - m_startTime).ToString());
}
If you are a C/C++ dev, then take a look here: How to use the QueryPerformanceCounter function to time code in Visual C++

Well, this one is very old, yet there is another useful function in Windows C library _ftime, which returns a structure with local time as time_t, milliseconds, timezone, and daylight saving time flag.

In C11 and above (or C++17 and above) you can use timespec_get() to get time with higher precision portably
#include <stdio.h>
#include <time.h>
int main(void)
{
struct timespec ts;
timespec_get(&ts, TIME_UTC);
char buff[100];
strftime(buff, sizeof buff, "%D %T", gmtime(&ts.tv_sec));
printf("Current time: %s.%09ld UTC\n", buff, ts.tv_nsec);
}
If you're using C++ then since C++11 you can use std::chrono::high_resolution_clock, std::chrono::system_clock (wall clock), or std::chrono::steady_clock (monotonic clock) in the new <chrono> header. No need to use Windows-specific APIs anymore
auto start1 = std::chrono::high_resolution_clock::now();
auto start2 = std::chrono::system_clock::now();
auto start3 = std::chrono::steady_clock::now();
// do some work
auto end1 = std::chrono::high_resolution_clock::now();
auto end2 = std::chrono::system_clock::now();
auto end3 = std::chrono::steady_clock::now();
std::chrono::duration<long long, std::milli> diff1 = end1 - start1;
std::chrono::duration<double, std::milli> diff2 = end2 - start2;
auto diff3 = std::chrono::duration_cast<std::chrono::milliseconds>(end3 - start3);
std::cout << diff.count() << ' ' << diff2.count() << ' ' << diff3.count() << '\n';

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio