CUDA Matrix Addition Timings, By Row Vs. By Column - performance

I am currently learning CUDA, and am working through some exercises. One of them is to implement kernels that add matrices in 3 different ways: 1 thread per element, 1 thread per row, and 1 thread per column. The matrices are square, and are implemented as 1D vectors, that I simply index into with
A[N*row + col]
Intuitively, I expected the first option to be the slowest due to thread overhead, the second to be the fastest since a single thread would be working on adjacent data.
On the CPU, with dense matrices of 8000 x 8000 I get:
Adding on CPU - Adding down columns
Compute Time Taken: 2.21e+00 s
Adding on CPU - Adding across rows
Compute Time Taken: 2.52e-01 s
So about an order of magnitude speed up due to many more cache hits. On the GPU with the same matrices I get:
Adding one element per thread
Compute Time Taken: 7.42e-05 s
Adding one row per thread
Compute Time Taken: 2.52e-05 s
Adding one column per thread
Compute Time Taken: 1.57e-05 s
Which in non-intuitive to me. The 30-40% speed up for the last case is consistent above about 1000 x 1000 matrices. Note that these timings are only the kernel execution, and don't include the data transfer between host and device. Below are my two kernels for comparison.
__global__
void matAddKernel2(float* A, float* B, float* C, int N)
{
int row = threadIdx.x + blockDim.x * blockIdx.x;
if (row < N)
{
int j;
for (j = 0; j < N; j++)
{
C[N*row + j] = A[N*row + j] + B[N*row + j];
}
}
}
__global__
void matAddKernel3(float* A, float* B, float* C, int N)
{
int col = threadIdx.x + blockDim.x * blockIdx.x;
int j;
if (col < N)
{
for (j = 0; j < N; j++)
{
C[col + N*j] = A[col + N*j] + B[col + N*j];
}
}
}
My question is, why don't GPU threads seem to benefit from working on adjacent data, which would then help it to get more cache hits?

GPU threads do benefit from working on adjacent data, what you are missing is that GPU threads are not independent threads like CPU thread, they work in a group called warp. A warp groups together 32 threads and works in a similar fashion like a single CPU thread executing SIMD instructions with width 32.
So in reality the code that uses one thread per column is the most effective because adjacent threads inside a warp are accessing adjacent data locations from memory and that is the most effective way to access global memory.
You will find the details in the CUDA documentation.

Related

warp shuffling to reduction of arrays with any length

I am working on a Cuda kernel which performs vector dot product (A x B). I assumed that the length of each vector is multiple of 32 (32,64, ...) and defined the block size to be equal to the length of the array. Each thread in the block multiplies one element of A to the corresponding element of B (thread i ==>psum = A[i]xB[i]). After multiplication, I used the following functions which used warp shuffling technique to perform reduction and calculate the sum all multiplications.
__inline__ __device__
float warpReduceSum(float val) {
int warpSize =32;
for (int offset = warpSize/2; offset > 0; offset /= 2)
val += __shfl_down(val, offset);
return val;
}
__inline__ __device__
float blockReduceSum(float val) {
static __shared__ int shared[32]; // Shared mem for 32 partial sums
int lane = threadIdx.x % warpSize;
int wid = threadIdx.x / warpSize;
val = warpReduceSum(val); // Each warp performs partial reduction
if (lane==0)
shared[wid]=val; // Write reduced value to shared memory
__syncthreads(); // Wait for all partial reductions
//read from shared memory only if that warp existed
val = (threadIdx.x < blockDim.x / warpSize) ? shared[lane] : 0;
if (wid==0)
val = warpReduceSum(val); // Final reduce within first warp
return val;
}
I simply call blockReduceSum(psum) which psum is the multiplication of two elements by a thread.
This approach doesn't work when the length of the array is not multiple of 32, so my question is, can we change this code so that it also works for any length? or is it impossible because if the length of the array is not multiple of 32, some warps have elements belonging more than one array?
First of all, depending on the GPU you are using, performing dot product with just 1 block will probably not be very efficient (as long as you are not batching several dot products in 1 kernel, each done by a single block).
To answer your question: you can reuse the code you have written by just calling your kernel with the number of threads being the closest multiple of 32 higher than N (length of the array) and introducing if statement before calling to blockReduceSum that would like this:
__global__ void kernel(float * A, float * B, int N) {
float psum = 0;
if(threadIdx.x < N) //threadIDx.x because your are using single block, you will need to change it to more general id once you move to multiple blocks
psum = A[threadIdx.x] * B[threadIdx.x];
blockReduceSum(psum);
//The rest of computation
}
That way, threads that do not have array element associated with them, but that need to be there due to use of __shfl, will contribute 0 to the sum.

Optimize Cuda Kernel time execution

I'm a learning Cuda student, and I would like to optimize the execution time of my kernel function. As a result, I realized a short program computing the difference between two pictures. So I compared the execution time between a classic CPU execution in C, and a GPU execution in Cuda C.
Here you can find the code I'm talking about:
int *imgresult_data = (int *) malloc(width*height*sizeof(int));
int size = width*height;
switch(computing_type)
{
case GPU:
HANDLE_ERROR(cudaMalloc((void**)&dev_data1, size*sizeof(unsigned char)));
HANDLE_ERROR(cudaMalloc((void**)&dev_data2, size*sizeof(unsigned char)));
HANDLE_ERROR(cudaMalloc((void**)&dev_data_res, size*sizeof(int)));
HANDLE_ERROR(cudaMemcpy(dev_data1, img1_data, size*sizeof(unsigned char), cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(dev_data2, img2_data, size*sizeof(unsigned char), cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(dev_data_res, imgresult_data, size*sizeof(int), cudaMemcpyHostToDevice));
float time;
cudaEvent_t start, stop;
HANDLE_ERROR( cudaEventCreate(&start) );
HANDLE_ERROR( cudaEventCreate(&stop) );
HANDLE_ERROR( cudaEventRecord(start, 0) );
for(int m = 0; m < nb_loops ; m++)
{
diff<<<height, width>>>(dev_data1, dev_data2, dev_data_res);
}
HANDLE_ERROR( cudaEventRecord(stop, 0) );
HANDLE_ERROR( cudaEventSynchronize(stop) );
HANDLE_ERROR( cudaEventElapsedTime(&time, start, stop) );
HANDLE_ERROR(cudaMemcpy(imgresult_data, dev_data_res, size*sizeof(int), cudaMemcpyDeviceToHost));
printf("Time to generate: %4.4f ms \n", time/nb_loops);
break;
case CPU:
clock_t begin = clock(), diff;
for (int z=0; z<nb_loops; z++)
{
// Apply the difference between 2 images
for (int i = 0; i < height; i++)
{
tmp = i*imgresult_pitch;
for (int j = 0; j < width; j++)
{
imgresult_data[j + tmp] = (int) img2_data[j + tmp] - (int) img1_data[j + tmp];
}
}
}
diff = clock() - begin;
float msec = diff*1000/CLOCKS_PER_SEC;
msec = msec/nb_loops;
printf("Time taken %4.4f milliseconds", msec);
break;
}
And here is my kernel function:
__global__ void diff(unsigned char *data1 ,unsigned char *data2, int *data_res)
{
int row = blockIdx.x;
int col = threadIdx.x;
int v = col + row*blockDim.x;
if (row < MAX_H && col < MAX_W)
{
data_res[v] = (int) data2[v] - (int) data1[v];
}
}
I obtained these execution time for each one
CPU: 1,3210ms
GPU: 0,3229ms
I wonder why GPU result is not as lower as it should be. I am a beginner in Cuda so please be comprehensive if there are some classic errors.
EDIT1:
Thank you for your feedback. I tried to delete the 'if' condition from the kernel but it didn't change deeply my program execution time.
However, after having install Cuda profiler, it told me that my threads weren't running concurrently. I don't understand why I have this kind of message, but it seems true because I only have a 5 or 6 times faster application with GPU than with CPU. This ratio should be greater, because each thread is supposed to process one pixel concurrently to all the other ones. If you have an idea of what I am doing wrong, it would be hepful...
Flow.
Here are two things you could do which may improve the performance of your diff kernel:
1. Let each thread do more work
In your kernel, each thread handles just a single element; but having a thread do anything already has a bunch of overhead, at the block and the thread level, including obtaining the parameters, checking the condition and doing address arithmetic. Now, you could say "Oh, but the reads and writes take much more time then that; this overhead is negligible" - but you would be ignoring the fact, that the latency of these reads and writes is hidden by the presence of many other warps which may be scheduled to do their work.
So, let each thread process more than a single element. Say, 4, as each thread can easily read 4 bytes at once into a register. Or even 8 or 16; experiment with it. Of course you'll need to adjust your grid and block parameters accordingly.
2. "Restrict" your pointers
__restrict is not part of C++, but it is supported in CUDA. It tells the compiler that accesses through different pointers passed to the function never overlap. See:
What does the restrict keyword mean in C++?
Realistic usage of the C99 'restrict' keyword?
Using it allows the CUDA compiler to apply additional optimizations, e.g. loading or storing data via non-coherent cache. Indeed, this happens with your kernel although I haven't measured the effects.
3. Consider using a "SIMD" instruction
CUDA offers this intrinsic:
__device__ ​ unsigned int __vsubss4 ( unsigned int a, unsigned int b )
Which subtracts each signed byte value in a from its corresponding one in b. If you can "live" with the result, rather than expecting a larger int variable, that could save you some of work - and go very well with increasing the number of elements per thread. In fact, it might let you increase it even further to get to the optimum.
I don't think you are measuring times correctly, memory copy is a time consuming step in GPU that you should take into account when measuring your time.
I see some details that you can test:
I suppose you are using MAX_H and MAX_H as constants, you may consider doing so using cudaMemcpyToSymbol().
Remember to sync your threads using __syncthreads(), so you don't get issues between each loop iteration.
CUDA works with warps, so block and number of threads per block work better as multiples of 8, but not larger than 512 threads per block unless your hardware supports it. Here is an example using 128 threads per block: <<<(cols*rows+127)/128,128>>>.
Remember as well to free your allocated memory in GPU and destroying your time events created.
In your kernel function you can have a single variable int v = threadIdx.x + blockIdx.x * blockDim.x .
Have you tested, beside the execution time, that your result is correct? I think you should use cudaMallocPitch() and cudaMemcpy2D() while working with arrays due to padding.
Probably there are other issues with the code, but here's what I see. The following lines in __global__ void diff are considered not optimal:
if (row < MAX_H && col < MAX_W)
{
data_res[v] = (int) data2[v] - (int) data1[v];
}
Conditional operators inside a kernel result in warp divergence. It means that if and else parts inside a warp are executed in sequence, not in parallel. Also, as you might have realized, if evaluates to false only at borders. To avoid the divergence and needless computation, split your image in two parts:
Central part where row < MAX_H && col < MAX_W is always true. Create an additional kernel for this area. if is unnecessary here.
Border areas that will use your diff kernel.
Obviously you'll have modify your code that calls the kernels.
And on a separate note:
GPU has throughput-oriented architecture, but not latency-oriented as CPU. It means CPU may be faster then CUDA when it comes to processing small amounts of data. Have you tried using large data sets?
CUDA Profiler is a very handy tool that will tell you're not optimal in the code.

Which way to order a shared 2D/3D array for parallel reduction over 1 dimension in CUDA/OpenCL?

Overall goal
I have several reductions to make on a bipartite graph, represented by two dense arrays for vertices and a dense array specifying whether an edge is present b/w the two. Say, two arrays are a0[] and a1[], and all edges go like e[i0][i1] (that is, from elements in a0 to elements in a1).
There are ~100+100 vertices, and ~100*100 edges, so each thread is responsible for one edge.
Task 1 : max reduction
For each vertex in a0 I want to find the maximum of all vertices (in a1) connected to it, and then the same in reverse: having assigned the result to an array b0, for each vertex in a1, I want to find the maximum b0[i0] of the connected vertices.
To do this, I:
1) load into shared memory
#define DC_NUM_FROM_SHARED 16
#define DC_NUM_TO_SHARED 16
__global__ void max_reduce_down(
Value* value1
, Value* max_value_in_connected
, int r0_size, int r1_size
, bool** connected
)
{
int id_from;
id_from = blockIdx.x * blockDim.x + threadIdx.x;
id_to = blockIdx.y * blockDim.y + threadIdx.y;
bool within_bounds = (id_from < r0_size) && (id_to < r1_size);
//load into shared memory
__shared__ Value value[DC_NUM_TO_SHARED][DC_NUM_FROM_SHARED]; //FROM is the inner (consecutive) dimension
if(within_bounds)
value[threadIdx.y][threadIdx.x] = connected[id_to][id_from]? value1[id_to] : 0;
else
value[threadIdx.y][threadIdx.x] = 0;
__syncthreads();
if(!within_bounds)
return;
2) reduce
for(int stride = DC_NUM_TO_SHARED/2; threadIdx.y < stride; stride >>= 1)
{
value[threadIdx.y][threadIdx.x] = max(value[threadIdx.y][threadIdx.x], dc[threadIdx.y + stride][threadIdx.x]);
__syncthreads();
}
3) write back
max_value_connected[id_from] = value[0][threadIdx.x];
Task 2 : best k
Similar problem, but reduction is only in for vertices in a0, I need to find the k best candidates are chosen from connected in a1 (k is ~5).
1) I initialize the shared array with zero elements except for the 1st place
int id_from, id_to;
id_from = blockIdx.x * blockDim.x + threadIdx.x;
id_to = blockIdx.y * blockDim.y + threadIdx.y;
__shared Value* values[MAX_CHAMPS * CHAMPS_NUM_FROM_SHARED * CHAMPS_NUM_TO_SHARED]; //champion overlaps
__shared int* champs[MAX_CHAMPS * CHAMPS_NUM_FROM_SHARED * CHAMPS_NUM_TO_SHARED]; // overlap champions
bool within_bounds = (id_from < r0_size) && (id_to < r1_size);
int i = threadIdx.y * CHAMPS_NUM_FROM_SHARED + threadIdx.x;
if(within_bounds)
{
values[i] = connected[id_to][id_from] * values1[id_to];
champs[i] = connected[id_to][id_from] ? id_to : -1;
}
else
{
values[i] = 0;
champs[i] = -1;
}
for(int place = 1; place < CHAMP_COUNT; place++)
{
i = (place * CHAMPS_NUM_TO_SHARED + threadIdx.y) * CHAMPS_NUM_FROM_SHARED + threadIdx.x;
values[i] = 0;
champs[i] = -1;
}
if(! within_bounds)
return;
__syncthreads();
2) reduce it
for(int stride = CHAMPS_NUM_TO_SHARED/2; threadIdx.y < stride; stride >>= 1)
{
merge_2_champs(values, champs, CHAMP_COUNT, id_from, id_to, id_to + stride);
__syncthreads();
}
3) write the results back
for(int place = 0; place < LOCAL_DESIRED_ACTIVITY; place++)
champs0[place][id_from] = champs[place * CHAMPS_NUM_TO_SHARED * CHAMPS_NUM_FROM_SHARED + threadIdx.x];
Issue
How do I order (transpose) the elements in the shared array, so that memory access uses the cache better?
Does it matter at this point, or there is much more I can gain from other optimizations?
Would it be better to transpose the edge matrix if I needed to optimize for Task 2? (as far as I understood, there is a symmetry in Task 1, so it doesn't matter).
P.S.
I have delayed unrolling loops and doing the first reduction iteration while loading, since I thought it is too complicated to do before I have explored simpler ways.
For Task 2, it would be nice to not load zero elements, since the array would never need to grow, and only start shrinking once log k steps have been made. This would make it k times more compact in shared memory! But I dread the resulting index math.
Syntax and Correctness
The unusual types are just typedef'ed ints/chars/etc - AFAIK, in GPUs, it makes sense to compactify those as much as possible. I have not run the code yet, no need to check for indexing errors.
Also, I am using CUDA, but I am interested in an OpenCL perspective as well, since I think the best solution should be the same, and I will be using OpenCL in the future anyway.
OK, I think I figured this out.
The two alternatives that I am considering are to have reductions work on the y dimension, and independent on the x dimension, or vice versa (x dimension being the contiguous one). In any case, the scheduler is able to assemble threads into warps along the x dimension, so some coherence is guaranteed. However, having coherence extend beyond a warp would be great. Also, due to the 2D/3D nature of the shared arrays, one would have to limit the dimensions to 16 or even 8.
To ensure coalescence within a warp, the scheduler has to assemble warps along the x dimension.
If reducing over x dimension, after each iteration, the number of active threads in a warp will halve. However, if reducing over y dimension, it is the number of active warps that will halve.
So, I need to reduce over y.
Unless the transpose (load) is the slowest, which is an abnormal case.
Coalesced buffer reads really matter; kernels can be 32x slower if you don't do them. It can be worth doing a re-arrangement pass if it means being able to do them (of course, the re-arrangement pass needs to be coalesced as well, but you can often leverage shared local memory to do this).

Need help explaining some CUDA performance results

We have been experimenting with different histogramming algorithms on a CUDA GPU. Most of the results I can explain, but we noticed some really weird features of which I have no clue what is causing them.
Kernels
The weird stuff happens in a data-parallel implementation. This means that the data is distributed over the threads. Each thread looks at a subset (ideally just 1) of the data, and adds its contribution to a histogram in global memory, which requires atomic operations.
__global__ void histogram1(float *data, uint *hist, uint n, float xMin, float binWidth, uin\
t nBins)
{
uint const nThreads = blockDim.x * gridDim.x;
uint const tid = threadIdx.x + blockIdx.x * blockDim.x;
uint idx = tid;
while (idx < n)
{
float x = data[idx];
uint bin = (x - xMin) / binWidth;
atomicAdd(hist + bin, 1);
idx += nThreads;
}
}
As a first optimization, each block first constructs a partial histogram in shared memory before doing a reduction of partial histograms to obtain the final result in global memory. The code is pretty straightforward, and I believe that it's very similar to that used in Cuda By Example.
__global__ void histogram2(float *data, uint *hist, uint n,
float xMin, float binWidth, uint nBins)
{
extern __shared__ uint partialHist[]; // size = nBins * sizeof(uint)
uint const nThreads = blockDim.x * gridDim.x;
uint const tid = threadIdx.x + blockIdx.x * blockDim.x;
// initialize shared memory to 0
uint idx = threadIdx.x;
while (idx < nBins)
{
partialHist[idx] = 0;
idx += blockDim.x;
}
__syncthreads();
// Calculate partial histogram (in shared mem)
idx = tid;
while (idx < n)
{
float x = data[idx];
uint bin = (x - xMin) / binWidth;
atomicAdd(partialHist + bin, 1);
idx += nThreads;
}
__syncthreads();
// Compute resulting total (global) histogram
idx = threadIdx.x;
while (idx < nBins)
{
atomicAdd(hist + idx, partialHist[idx]);
idx += blockDim.x;
}
}
Results
Speedup vs n
We benchmarked these two kernels to see how they behave as a function of n, which is the number of datapoints. The data was uniform randomly distributed. In the figure below, HIST_DP_1 is the unoptimized trivial version, whereas HIST_DP_2 is the one using shared memory to speed things up:
The timings have been taken relative to the CPU performance, and the weird stuff happens for very large datasets. The optimizing function, instead of flattening out like the unoptimized version, starts to improve again (relatively). We'd expect that for large datasets, the occupancy of our card will be near 100%, which would mean that from that point on the performance would scale linearly, like the CPU (and indeed the unoptimized blue curve).
The behavior could be due to the fact that the chance of having two threads performing an atomic operation on the same bin in shared/global memory going to zero for large data-sets, but in that case we would expect the drop to be in different places for different nBins. This is not what we observe, the drop is in all three panels at around 10^7 bins. What is happening here? Some complicated caching effect? Or is it something obvious that we missed?
Speedup vs nBins
To have a closer look at the behavior as a function of the number of bins, we fixed our dataset at 10^4 (10^5 in one case), and ran the algorithms for many different bin-numbers.
As a reference we also generated some non-random data. The red graph shows the results for perfectly sorted data, whereas the light-blue line corresponds to a dataset in which every value was identical (maximal congestion in the atomic operations). The question is obvious: what is the discontinuity doing there?
System Setup
NVidia Tesla M2075, driver 319.37
Cuda 5.5
Intel(R) Xeon(R) CPU E5-2603 0 # 1.80GHz
Thanks for your help!
EDIT: Reproduction Case
As requested: a compiling, runnable reproduction case. The code is quite long, which is why I didn't include it in the first place. The snippet is available on snipplr. To make your life even more easy, I'll include a little shell-script to run it for the same settings I used, and an Octave script to produce the plots.
Shell script
#!/bin/bash
runs=100
# format: [n] [nBins] [t_cpu] [t_gpu1] [t_gpu2]
for nBins in 100 1000 10000
do
for n in 10 50 100 200 500 1000 2000 5000 10000 50000 100000 500000 1000000 10000000 100000000
do
echo -n "$n $nBins "
./repro $n $nBins $runs
done
done
Octave script
T = load('repro.txt');
bins = unique(T(:,2));
t = cell(1, numel(bins));
for i = 1:numel(bins)
t{i} = T(T(:,2) == bins(i), :);
subplot(2, numel(bins), i);
loglog(t{i}(:,1), t{i}(:,3:5))
title(sprintf("nBins = %d", bins(i)));
legend("cpu", "gpu1", "gpu2");
subplot(2, numel(bins), i + numel(bins));
loglog(t{i}(:,1), t{i}(:,4)./t{i}(:,3), ...
t{i}(:,1), t{i}(:,5)./t{i}(:,3));
title("relative");
legend("gpu1/cpu", "gpu2/cpu");
end
Absolute Timings
Absolute timings show that it's not the CPU slowing down. Instead, the GPU is speeding up relatively:
Regarding question 1:
This is not what we observe, the drop is in all three panels at around 10^7 bins. What is happening here? Some complicated caching effect? Or is it something obvious that we missed?
This drop is due to the limit you've set on the maximum number of blocks (1<<14 == 16384). At n=10^7 gpuBench2 the limit has kicked in, and each thread starts processing multiple elements. At n=10^8 each thread works on 12 (sometimes 11) elements. If you remove this cap you can see that your performance continues to flatline.
Why is this faster? Multiple elements per thread allows for latency of the load from data to be hidden much better, especially in the case with 10000 bins where you are only able to fit one block on to each SM due to the high shared memory usage. In this case, every element in the block will reach the global load at around the same time, and none will be able to continue until it has completed its load. By having multiple elements we can pipeline these loads, getting many elements per thread for the latency of one.
(You don't see this in gupBench1 as it is not latency bound, but bandwidth bound to L2. You can see this very quickly if you import the output of nvprof into the visual profiler)
Regarding question 2:
The question is obvious: what is the discontinuity doing there?
I don't have a Fermi to hand, and I can't reproduce this on my Kepler, so I'd assume it's something that is Fermi specific. That's the danger of answering questions with two parts, I suppose!

Performance hit in CUDA program that calls kernel repeatedly within a for loop

I have a CUDA program that calls the kernel repeatedly within a for loop. The
code computes all rows of a matrix by using the values computed in the previous one
until the entire matrix is done. This is basically a dynamic programming algorithm.
The code below fills the (i,j) entry of many separate matrices in parallel with
the kernel.
for(i = 1; i <=xdim; i++){
for(j = 1; j <= ydim; j++){
start3time = clock();
assign5<<<BLOCKS, THREADS>>>(Z, i, j, x, y, z)
end3time = clock();
diff = static_cast<double>(end3time-start3time)/(CLOCKS_PER_SEC / 1000);
printf("Time for i=%d j=%d is %f\n", i, j, diff);
}
}
The kernel assign5 is straightforward
__global__ void assign5(float* Z, int i, int j, int x, int y, int z) {
int id = threadIdx.x + blockIdx.x * blockDim.x;
char ch = database[j + id];
Z[i+id] = (Z[x+id] + Z[y+id] + Z[z+id])*dev_matrix[i][index[ch - 'A']];
}
}
My problem is that when I run this program the time for each i and j is 0 most of the
time but sometimes it is 10 milliseconds. So the output looks like
Time for i=0 j=0 is 0
Time for i=0 j=1 is 0
.
.
Time for i=15 j=21 is 10
Time for i=15 j=22 is 0
.
I don't understand why this is happening. I don't see a thread race condition. If I add
if(i % 20 == 0) cudaThreadSynchronize();
right after the first loop then the Time for i and j is mostly 0. But then the time
for sync is sometimes 10 or even 20. It seems like CUDA is performing many operations
at low cost and then charges a lot for later ones. Any help would be appreciated.
I think you have a misconception about what a kernel call in CUDA actually does on the host. A kernel call is non-blocking and is only added to the device's queue. If you're measuring time before and after your kernel call, then the difference has nothing to do with how long your kernel call takes (it would measure the time it takes to add the kernel call to the queue).
You should add a cudaThreadSynchronize() after every kernel call and before you measure end3time. cudaThreadSynchronize() blocks and returns if all kernels in the queue have finished their work.
This is why
if(i % 20 == 0) cudaThreadSynchronize();
made spikes in your measurments.

Resources