Can 'nop' instruction save execute time on AArch64? - performance

.text
.globl asm_f1
.globl asm_f2
.globl asm_f3
.globl asm_f4
.globl asm_f5
asm_f1:
mov x0, #1
movk x0, #1, lsl #16
movk x0, #1, lsl #32
ret
asm_f2:
nop
nop
nop
ret
asm_f3:
mov x0, #1
movk x0, #1, lsl #16
nop
ret
asm_f4:
mov x0, #1
nop
movk x0, #1, lsl #16
ret
asm_f5:
mov x0, #1
nop
nop
ret
int main(int argc, char **argv) {
clock_t start;
clock_t end;
start = clock();
for (int i = 0; i < COUNT; i++) {
asm_f1();
}
end = clock();
printf("asm_f1 took %ld cycles on avg\n", (end - start) );
start = clock();
for (int i = 0; i < COUNT; i++) {
asm_f2();
}
end = clock();
printf("asm_f2 took %ld cycles on avg\n", (end - start) );
start = clock();
for (int i = 0; i < COUNT; i++) {
asm_f3();
}
end = clock();
printf("asm_f3 took %ld cycles on avg\n", (end - start) );
start = clock();
for (int i = 0; i < COUNT; i++) {
asm_f4();
}
end = clock();
printf("asm_f4 took %ld cycles on avg\n", (end - start) );
start = clock();
for (int i = 0; i < COUNT; i++) {
asm_f5();
}
end = clock();
printf("asm_f5 took %ld cycles on avg\n", (end - start) );
}
asm_f1 took 345961 cycles on avg
asm_f2 took 325725 cycles on avg
asm_f3 took 327267 cycles on avg
asm_f4 took 324958 cycles on avg
asm_f5 took 349501 cycles on avg
Compared with f1 and f2, it seems that 'nop' will save time than 'mov'. As f3 and f4 just use one 'nop', they are a little bit slower than f2 which is expected. f5 uses 2 'nop' but it's slower than f3 and f4, how to explain this case?
$> lsb_release -a
Distributor ID: Ubuntu
Description: Ubuntu 18.04.2 LTS
Release: 18.04
Codename: bionic
$> uname -p
aarch64
Some docs from AArch64 instruction book
NOP
No Operation does nothing, other than advance the value of the program counter by 4. This instruction can be used for instruction alignment purposes.
The timing effects of including a NOP instruction in a program are not guaranteed. It can increase execution time, leave it unchanged, or even reduce it. Therefore, NOP instructions are not suitable for timing loops.
Update:
According to #Siguza's comment, I changed the loop a little, the performance gap now is more obvious.
void loop(void *f, char *fname) {
clock_t start;
clock_t end;
start = clock();
for (int i = 0; i < COUNT; i++) {
((void(*)())f)();
}
end = clock();
printf("%s took %ld cycles on avg\n", fname, (end - start) );
}
int main(int argc, char **argv) {
loop(asm_f1, "asm_f1");
loop(asm_f2, "asm_f2");
loop(asm_f3, "asm_f3");
loop(asm_f4, "asm_f4");
loop(asm_f5, "asm_f5");
return 0;
}
asm_f1 took 344089 cycles on avg
asm_f2 took 259436 cycles on avg
asm_f3 took 246766 cycles on avg
asm_f4 took 245166 cycles on avg
asm_f5 took 354827 cycles on avg

Related

Stupid Sort of uint16_t halfwords?

This is the first code for sorting whole signed words (int32_t):
sort: SUB R1,R1,#1 //--n void StupidSort(int a[], int n)
ADD R12,R0,#4 //нач.адрес+4 {
ADD R1,R0,R1,LSL#2 //краен адрес int tmp, i = 0;
L1: LDR R3,[R0] //*a do {
LDR R2,[R0,#4]! //*++a ___ if (a[i] > a[i+1]) {
CMP R3,R2 // \ tmp = a[i+1];
BLE L2 // \ a[i+1] = a[i];
STMDA R0,{R2,R3} // \ a[i] = tmp;
CMP R0,R12 // \ if (i) i--;
SUBHI R0,R0,#8 // \___ } else i++;
L2: CMP R0,R1 // } while (i < n - 1);
BLT L1 // }
BX LR
This is what I have done until now:
sort: SUB R1,R1,#1 //--n void StupidSort(int a[], int n)
ADD R12,R0,#2 //нач.адрес+4 {
ADD R1,R0,R1,LSL#1 //краен адрес int tmp, i = 0;
L1: LDRH R3,[R0] //*a do {
LDRH R2,[R0,#2]! //*++a ___ if (a[i] > a[i+1]) {
CMP R3,R2 // \ tmp = a[i+1];
BLE L2 // \ a[i+1] = a[i];
STRH R2,[R0,#2]
STRH R3,[R0] // \ a[i] = tmp;
CMP R0,R12 // \ if (i) i--;
SUBHI R0,R0,#4 // \___ } else i++;
L2: CMP R0,R1 // } while (i < n - 1);
BLT L1 // }
BX LR
I tried to change it to sort uint16_t unsigned halfwords. I'm almost done but something is
missing in the code. The problem is the sort, the architecture is ARM (in ARM mode, not
Thumb);Also i don't know what the sign ! behind LDRH function do also i think on ldrd r3 and
r2 should rotate their places.
The ! stands for pre-increment, exactly as in the comment *++a.
Thus, the offset #2 is added to the base r0 before memory access, but also the register r0 is updated.
L1: LDRH R3,[R0] //*a do {
LDRH R2,[R0,#2]! //*++a ___ if (a[i] > a[i+1]) {
CMP R3,R2 // \ tmp = a[i+1];
BLE L2 // \ a[i+1] = a[i];
STRH R2,[R0,#-2] // *** \
STRH R3,[R0] // *** \ a[i] = tmp;
CMP R0,R12 // \ if (i) i--;
SUBHI R0,R0,#4 // \___ } else i++;
L2: CMP R0,R1 // } while (i < n - 1);
The sections marked with *** have been modified to compensate the effect of *++a;
Generally there are probably better ways to implement an insertion or bubble sort; this algorithms e.g. reads the same element in the next round which was written on the previous iteration. Also it would probably make more sense to use conditional swap/move instead of explicit branch.
The corresponding 32-bit code writes the swapped values with a clever (obfuscated) way with STMDA R0,{R2,R3}, which relies on the R2, R3 having been read in the order of R3, R2. This instruction stands for STore Multiple, Decrement After, writing R2 to R0 - 4 and R3 to R0. But because that variant is not available for 16-bit registers, one needs to write R2 to R0 - 2 and R3 to R0.

Understanding response time of memory accesses

I performed, as a part of an academic research, the following experiment:
buff = mmap(NULL, BUFFSIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE | HUGEPAGES, -1, 0);
lineAddr = buff;
for (int i = 0; i < BUFFSIZE; i++)
clflush(&(buff[i]));
for (int i = 0; i < LINES; i ++){
srand(rdtscp());
result = memaccesstime(lineAddr);
lineAddr = (void*)((uint64_t)lineAddr + (rand()%20+3)*(8*sizeof(void*)));
resultArr[i] = result;
}
MemAccessTime function returns the response time in cpu ticks.
static inline uint32_t memaccesstime(void *v) {
uint32_t rv;
asm volatile (
"mfence\n"
"lfence\n"
"rdtscp\n"
"mov %%eax, %%esi\n"
"mov (%1), %%eax\n"
"rdtscp\n"
"sub %%esi, %%eax\n"
: "=&a" (rv): "r" (v): "ecx", "edx", "esi");
return rv;
}
So the steps are:
Allocated a long range of memory (with mmap()).
clflush() all the line (with for loop)
Running over random lines (with steps between 3 to 23) and measured the response time.
The results:
Results
Please help me understand the results better.
Why after short number of samples, the response time is plunging?
Notes:
The MSR register 0x1a4 value is 0xF (but behavior is the same with 0x0)
I've chosen random steps to avoid the "stride" prefetcher.
Is there any other hardware (or software) prefetcher that could be responsible for those results?

Popcount of SSE vectors for binary correlation?

I have this simple binary correlation method, It beats table lookup and Hakmem bit twiddling methods by x3-4 and %25 better than GCC's __builtin_popcount (which I think maps to a popcnt instruction when SSE4 is enabled.)
Here is the much simplified code:
int correlation(uint64_t *v1, uint64_t *v2, int size64) {
__m128i* a = reinterpret_cast<__m128i*>(v1);
__m128i* b = reinterpret_cast<__m128i*>(v2);
int count = 0;
for (int j = 0; j < size64 / 2; ++j, ++a, ++b) {
union { __m128i s; uint64_t b[2]} x;
x.s = _mm_xor_si128(*a, *b);
count += _mm_popcnt_u64(x.b[0]) +_mm_popcnt_u64(x.b[1]);
}
return count;
}
I tried unrolling the loop, but I think GCC already automatically does this, so I ended up with same performance. Do you think performance further improved without making the code too complicated? Assume v1 and v2 are of the same size and size is even.
I am happy with its current performance but I was just curious to see if it could be further improved.
Thanks.
Edit: Fixed an error in union and it turned out this error was making this version faster than builtin __builtin_popcount , anyway I modified the code again, it is again slightly faster than builtin now (15%) but I don't think it is worth investing worth time on this. Thanks for all comments and suggestions.
for (int j = 0; j < size64 / 4; ++j, a+=2, b+=2) {
__m128i x0 = _mm_xor_si128(_mm_load_si128(a), _mm_load_si128(b));
count += _mm_popcnt_u64(_mm_extract_epi64(x0, 0))
+_mm_popcnt_u64(_mm_extract_epi64(x0, 1));
__m128i x1 = _mm_xor_si128(_mm_load_si128(a + 1), _mm_load_si128(b + 1));
count += _mm_popcnt_u64(_mm_extract_epi64(x1, 0))
+_mm_popcnt_u64(_mm_extract_epi64(x1, 1));
}
Second Edit: turned out that builtin is the fastest, sigh. especially with -funroll-loops and
-fprefetch-loop-arrays args. Something like this:
for (int j = 0; j < size64; ++j) {
count += __builtin_popcountll(a[j] ^ b[j]);
}
Third Edit:
This is an interesting SSE3 parallel 4 bit lookup algorithm. Idea is from Wojciech Muła, implementation is from Marat Dukhan's answer. Thanks to #Apriori for reminding me this algorithm. Below is the heart of the algorithm, it is very clever, basically counts bits for bytes using a SSE register as a 16 way lookup table and lower nibbles as index of which table cells are selected. Then sums the counts.
static inline __m128i hamming128(__m128i a, __m128i b) {
static const __m128i popcount_mask = _mm_set1_epi8(0x0F);
static const __m128i popcount_table = _mm_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);
const __m128i x = _mm_xor_si128(a, b);
const __m128i pcnt0 = _mm_shuffle_epi8(popcount_table, _mm_and_si128(x, popcount_mask));
const __m128i pcnt1 = _mm_shuffle_epi8(popcount_table, _mm_and_si128(_mm_srli_epi16(x, 4), popcount_mask));
return _mm_add_epi8(pcnt0, pcnt1);
}
On my tests this version is on par; slightly faster on smaller input, slightly slower on larger ones than using hw popcount. I think this should really shine if it is implemented in AVX. But I don't have time for this, if anyone is up to it would love to hear their results.
The problem is that popcnt (which is what __builtin_popcnt compiles to on intel CPU's) operates on the integer registers. This causes the compiler to issue instructions to move data between the SSE and integer registers. I'm not surprised that the non-sse version is faster since the ability to move data between the vector and integer registers is quite limited/slow.
uint64_t count_set_bits(const uint64_t *a, const uint64_t *b, size_t count)
{
uint64_t sum = 0;
for(size_t i = 0; i < count; i++) {
sum += popcnt(a[i] ^ b[i]);
}
return sum;
}
This runs at approx. 2.36 clocks per loop on small data sets (fits in cache). I think it run's slow because of the 'long' dependency chain on sum which restricts the CPU's ability to handle more things out of order. We can improve it by manually pipelining the loop:
uint64_t count_set_bits_2(const uint64_t *a, const uint64_t *b, size_t count)
{
uint64_t sum = 0, sum2 = 0;
for(size_t i = 0; i < count; i+=2) {
sum += popcnt(a[i ] ^ b[i ]);
sum2 += popcnt(a[i+1] ^ b[i+1]);
}
return sum + sum2;
}
This runs at 1.75 clocks per item. My CPU is an Sandy Bridge model (i7-2820QM fixed # 2.4Ghz).
How about four-way pipelining? That's 1.65 clocks per item. What about 8-way? 1.57 clocks per item. We can derive that the runtime per item is (1.5n + 0.5) / n where n is the amount of pipelines in our loop. I should note that for some reason 8-way pipelining performs worse than the others when the dataset grows, i have no idea why. The generated code looks okay.
Now if you look carefully there is one xor, one add, one popcnt, and one mov instruction per item. There is also one lea instruction per loop (and one branch and decrement, which i'm ignoring because they're pretty much free).
$LL3#count_set_:
; Line 50
mov rcx, QWORD PTR [r10+rax-8]
lea rax, QWORD PTR [rax+32]
xor rcx, QWORD PTR [rax-40]
popcnt rcx, rcx
add r9, rcx
; Line 51
mov rcx, QWORD PTR [r10+rax-32]
xor rcx, QWORD PTR [rax-32]
popcnt rcx, rcx
add r11, rcx
; Line 52
mov rcx, QWORD PTR [r10+rax-24]
xor rcx, QWORD PTR [rax-24]
popcnt rcx, rcx
add rbx, rcx
; Line 53
mov rcx, QWORD PTR [r10+rax-16]
xor rcx, QWORD PTR [rax-16]
popcnt rcx, rcx
add rdi, rcx
dec rdx
jne SHORT $LL3#count_set_
You can check with Agner Fog's optimization manual that an lea is half a clock cycle in throughout and the mov/xor/popcnt/add combo is apparently 1.5 clock cycles, although i don't fully understand why exactly.
Unfortunately, I think we're stuck here. The PEXTRQ instruction is what's usually used to move data from the vector registers to the integer registers and we can fit this instruction and one popcnt instruction neatly in one clock cycle. Add one integer add instruction and our pipeline is at minimum 1.33 cycles long and we still need to add an vector load and xor in there somewhere... If intel offered instructions to move multiple registers between the vector and integer registers at once it would be a different story.
I don't have an AVX2 cpu at hand (xor on 256-bit vector registers is an AVX2 feature), but my vectorized-load implementation performs quite poorly with low data sizes and reached an minimum of 1.97 clock cycles per item.
For reference these are my benchmarks:
"pipe 2", "pipe 4" and "pipe 8" are 2, 4 and 8-way pipelined versions of the code shown above. The poor showing of "sse load" appears to be a manifestation of the lzcnt/tzcnt/popcnt false dependency bug which gcc avoided by using the same register for input and output. "sse load 2" follows below:
uint64_t count_set_bits_4sse_load(const uint64_t *a, const uint64_t *b, size_t count)
{
uint64_t sum1 = 0, sum2 = 0;
for(size_t i = 0; i < count; i+=4) {
__m128i tmp = _mm_xor_si128(
_mm_load_si128(reinterpret_cast<const __m128i*>(a + i)),
_mm_load_si128(reinterpret_cast<const __m128i*>(b + i)));
sum1 += popcnt(_mm_extract_epi64(tmp, 0));
sum2 += popcnt(_mm_extract_epi64(tmp, 1));
tmp = _mm_xor_si128(
_mm_load_si128(reinterpret_cast<const __m128i*>(a + i+2)),
_mm_load_si128(reinterpret_cast<const __m128i*>(b + i+2)));
sum1 += popcnt(_mm_extract_epi64(tmp, 0));
sum2 += popcnt(_mm_extract_epi64(tmp, 1));
}
return sum1 + sum2;
}
Have a look here. There is an SSSE3 version that beats the popcnt instruction by a lot. I'm not sure but you may be able to extend it to AVX as well.

Inline ASM: Use of MMX returns NaN seconds on timer

Problem
I am trying to find out whether mmx or xmm registers are faster for copying elements of an array to another array (I know about memcpy() but I need this function for a very specific purpose).
My souce code is below. The relevant function is copyarray(). I can use either mmx or xmm registers with movq or movsd respectively, and the result is correct. However, when I use mmx registers, any timer I use (either clock() or QueryPerformanceCounter) to time the operations returns NaN.
Compiled with: gcc -std=c99 -O2 -m32 -msse3 -mincoming-stack-boundary=2 -mfpmath=sse,387 -masm=intel copyasm.c -o copyasm.exe
This is a very strange bug and I cannot figure out why using mmx registers would cause a timer to return NaN seconds, while using xmm registers in exactly the same code returns a valid time value
EDIT
Results using xmm registers:
Elapsed time: 0.000000 seconds, Gigabytes copied per second: inf GB
Residual = 0.000000
0.937437 0.330424 0.883267 0.118717 0.962493 0.584826 0.344371 0.423719
0.937437 0.330424 0.883267 0.118717 0.962493 0.584826 0.344371 0.423719
Results using mmx register:
Elapsed time: nan seconds, Gigabytes copied per second: inf GB
Residual = 0.000000
0.000000 0.754173 0.615345 0.634724 0.611286 0.547655 0.729637 0.942381
0.935759 0.754173 0.615345 0.634724 0.611286 0.547655 0.729637 0.942381
Source Code
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <x86intrin.h>
#include <windows.h>
inline double
__attribute__ ((gnu_inline))
__attribute__ ((aligned(64))) copyarray(
double* restrict dst,
const double* restrict src,
const int n)
{
// int i = n;
// do {
// *dst++ = *src++;
// i--;
// } while(i);
__asm__ __volatile__
(
"mov ecx, %[n] \n\t"
"mov edi, %[dst] \n\t"
"mov esi, %[src] \n\t"
"xor eax, eax \n\t"
"sub ecx,1 \n\t"
"L%=: \n\t"
"movq mm0, QWORD PTR [esi+ecx*8] \n\t"
"movq QWORD PTR [edi+ecx*8], mm0 \n\t"
"sub ecx, 1 \n\t"
"jge L%= \n\t"
: // no outputs
: // inputs
[dst] "m" (dst),
[src] "m" (src),
[n] "g" (n)
: // register clobber
"eax","ecx","edi","esi",
"mm0"
);
}
void printarray(double* restrict a, int n)
{
for(int i = 0; i < n; ++i) {
printf(" %f ", *(a++));
}
printf("\n");
}
double residual(const double* restrict dst,
const double* restrict src,
const int n)
{
double residual = 0.0;
for(int i = 0; i < n; ++i)
residual += *(dst++) - *(src++);
return(residual);
}
int main()
{
double *A = NULL;
double *B = NULL;
int n = 8;
double memops;
double time3;
clock_t time1;
// LARGE_INTEGER frequency, time1, time2;
// QueryPerformanceFrequency(&frequency);
int trials = 1 << 0;
A = _mm_malloc(n*sizeof(*A), 64);
B = _mm_malloc(n*sizeof(*B), 64);
srand(time(NULL));
for(int i = 0; i < n; ++i)
*(A+i) = (double) rand()/RAND_MAX;
// QueryPerformanceCounter(&time1);
time1 = clock();
for(int i = 0; i < trials; ++i)
copyarray(B,A,n);
// QueryPerformanceCounter(&time2);
// time3 = (double)(time2.QuadPart - time1.QuadPart) / frequency.QuadPart;
time3 = (double) (clock() - time1)/CLOCKS_PER_SEC;
memops = (double) trials*n/time3*sizeof(*A)/1.0e9;
printf("Elapsed time: %f seconds, Gigabytes copied per second: %f GB\n",time3, memops);
printf("Residual = %f\n",residual(B,A,n));
printarray(A,n);
printarray(B,n);
_mm_free(A);
_mm_free(B);
}
You have to be careful when mixing MMX with floating point - use SSE instead if possible. If you must use MMX then read the section titled "MMX - State Management" on this page - note the requirement for the emms instruction after any MMX instructions before you next perform any floating point operations.

Trouble measuring the elapsed time of a CUDA program and CUDA kernels

I currently have three methos of measuring the elapsed time, two using CUDA events and the other recording start and end UNIX. The ones using CUDA events measure two things, one measures the entire outer loop time, and the other sum all kernel execution times.
Here's the code:
int64 x1, x2;
cudaEvent_t start;
cudaEvent_t end;
cudaEvent_t s1, s2;
float timeValue;
#define timer_s cudaEventRecord(start, 0);
#define timer_e cudaEventRecord(end, 0); cudaEventSynchronize(end); cudaEventElapsedTime( &timeValue, start, end ); printf("time: %f ms \n", timeValue);
cudaEventCreate( &start );
cudaEventCreate( &end );
cudaEventCreate( &s1 );
cudaEventCreate( &s2 );
cudaEventRecord(s1, 0);
x1 = GetTimeMs64();
for(int r = 0 ; r < 2 ; r++)
{
timer_s
kernel1<<<1, x>>>(gl_devdata_ptr);
cudaThreadSynchronize();
timer_e
sum += timeValue;
for(int j = 0 ; j < 5; j++)
{
timer_s
kernel2<<<1,x>>>(gl_devdata_ptr);
cudaThreadSynchronize();
timer_e
sum += timeValue;
timer_s
kernel3<<<1,x>>>(gl_devdata_ptr);
cudaThreadSynchronize();
timer_e
sum += timeValue;
}
timer_s
kernel4<<<y, x>>> (gl_devdata_ptr);
cudaThreadSynchronize();
timer_e
sum += timeValue;
}
x2 = GetTimeMs64();
cudaEventRecord(s2, 0);
cudaEventSynchronize(s2);
cudaEventElapsedTime( &timeValue, s1, s2 );
printf("elapsed cuda : %f ms \n", timeValue);
printf("elapsed sum : %f ms \n", sum);
printf("elapsed win : %d ms \n", x2-x1);
The GetTimeMs64 is something I found here on StackOverflow:
int64 GetTimeMs64()
{
/* Windows */
FILETIME ft;
LARGE_INTEGER li;
uint64 ret;
/* Get the amount of 100 nano seconds intervals elapsed since January 1, 1601 (UTC) and copy it
* to a LARGE_INTEGER structure. */
GetSystemTimeAsFileTime(&ft);
li.LowPart = ft.dwLowDateTime;
li.HighPart = ft.dwHighDateTime;
ret = li.QuadPart;
ret -= 116444736000000000LL; /* Convert from file time to UNIX epoch time. */
ret /= 10000; /* From 100 nano seconds (10^-7) to 1 millisecond (10^-3) intervals */
return ret;
}
Those aren't the real variable names nor the right kernel names, I just removed some to make the code smaller.
So the problem is, every measure gives me a really different total time.
Some examples I just ran:
elapsed cuda : 21.076832
elapsed sum : 4.177984
elapsed win : 27
So why is there such a huge difference? The sum of all kernel calls is around 4 ms, where are the other 18ms? CPU time?
cudaThreadSynchronize is a very high overhead operation as it has to wait for all work on the GPU to complete.
You should get the correct result if you structure you code as follows:
int64 x1, x2;
cudaEvent_t start;
cudaEvent_t end;
const int k_maxEvents = 5 + (2 * 2) + (2 * 5 * 2);
cudaEvent_t events[k_maxEvents];
int eIdx = 0;
float timeValue;
for (int e = 0; e < 5; ++e)
{
cudaEventCreate(&events[e]);
}
x1 = GetTimeMs64();
cudaEventRecord(events[eIdx++], 0);
for(int r = 0 ; r < 2 ; r++)
{
cudaEventRecord(events[eIdx++], 0);
kernel1<<<1, x>>>(gl_devdata_ptr);
for(int j = 0 ; j < 5; j++)
{
cudaEventRecord(events[eIdx++], 0);
kernel2<<<1,x>>>(gl_devdata_ptr);
cudaEventRecord(events[eIdx++], 0);
kernel3<<<1,x>>>(gl_devdata_ptr);
}
cudaEventRecord(events[eIdx++], 0);
kernel4<<<y, x>>> (gl_devdata_ptr);
}
cudaEventRecord(eIdx++, 0);
cudaDeviceSynchronize();
x2 = GetTimeMs64();
cudaEventElapsedTime( &timeValue, events[0], events[k_maxEvents - 1] );
printf("elapsed cuda : %f ms \n", timeValue);
// TODO the time between each events is the time to execute each kernel.
// On WDDM a context switch may occur between any of the kernels leading
// to higher than expected results.
// printf("elapsed sum : %f ms \n", sum);
printf("elapsed win : %d ms \n", x2-x1);
On Windows an easier way to measure time is to use QueryPerformanceCounter and QueryPerformanceFrequency.
If you write the above example without the events as
#include "NvToolsExt.h"
nvtxRangePushA("CPU Time");
for(int r = 0 ; r < 2 ; r++)
{
kernel1<<<1, x>>>(gl_devdata_ptr);
for(int j = 0 ; j < 5; j++)
{
kernel2<<<1,x>>>(gl_devdata_ptr);
kernel3<<<1,x>>>(gl_devdata_ptr);
}
kernel4<<<y, x>>> (gl_devdata_ptr);
}
cudaDeviceSynchronize();
nvtxRangePop();
and run in Nsight Visual Studio Edition 1.5-2.2 CUDA Trace Activity or Visual Profiler 4.0+ all of the times will be available. The GPU times will be more accurate than what you can collect using cudaEvents API. Using nvtxRangePush to measure the CPU time range is optional.This can also be accomplished by measuring from the first CUDA API in the example to the end of cudaDeviceSynchronize.

Resources