Correct execution time of CUDA program

Correct execution time of CUDA program - time

I wrote a small program with which you can get edges of a digital image (the well-known Canny detector). It is necessary to measure the exact time (in milliseconds) of the algorithm execution on the device (GPU) (including the stages of data transfer). I attach the working program code in C:
#include <iostream>
#include <sys/time.h>
#include <opencv2/core.hpp>
#include <opencv2/highgui.hpp>
#include <opencv2/opencv.hpp>
#include <opencv2/cudaimgproc.hpp>
#include <cuda_runtime.h>
#include <opencv2/core/cuda.hpp>
using namespace cv;
using namespace std;
__device__ __host__
void FirstRun (void)
{
cudaSetDevice(0);
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
}
int main( int argc, char** argv )
{
clock_t time;
if (argc != 2)
{
cout << "Wrong number of arguments!" << endl;
return -1;
}
const char* filename = argv[1];
Mat img = imread(filename, IMREAD_GRAYSCALE);
if( !img.data )
{
cout << " --(!) Error reading images \n" << endl;
return -2;
}
double low_tresh = 100.0;
double high_tresh = 150.0;
int apperture_size = 3;
bool useL2gradient = false;
int imageWidth = img.cols;
int imageHeight = img.rows;
cout << "Width of image: " << imageWidth << endl;
cout << "Height of image: " << imageHeight << endl;
cout << endl;
FirstRun();
// Canny algorithm
cuda::GpuMat d_img(img);
cuda::GpuMat d_edges;
time = clock();
Ptr<cuda::CannyEdgeDetector> canny = cuda::createCannyEdgeDetector(low_tresh, high_tresh, apperture_size, useL2gradient);
canny->detect(d_img, d_edges);
time = clock() - time;
cout << "CannyCUDA time (ms): " << (float)time / CLOCKS_PER_SEC * 1000 << endl;
return 0;
}
I get two different work times (image 7741 x 8862)
System configuration:
1) CPU: Intel Core i7 9600K (3.6 GHz), 32 GB RAM;
2) GPU: Nvidia Geforce RTX 2080 Ti;
3) OpenCV ver. 4.0
What time is right and do I measure it correctly, thank you!

There are different times you can measure when dealing with cuda.
Here are some solution you might want to try:
Measure the total time used by cuda: Use time() to get an absolute time value before using any cuda functions and time() again after you got the result. The difference will be the real time that passed.
Measure only the time of the calculation: cuda has some start-up overhead, but if you're not interested in that, because you will be using your code many times without exiting the cuda environment, you can measure it seperatly. Please read the CUDA C Programming Guide, it will explain the use of Events to be used for timing.
Use the profiler to get detailed information on what part of the program takes what part of time: The kernel times are expecially interesting, as they tell you how long your computations take. Be careful when looking at the API times. In your example, a lot of time is used by cudaEventCreate() as this is the first cuda function in your program, so it includes the start-up overhead. Also, cuda[...]Synchronize() doesnt actually take that long to be called, but it includes the time it is waiting for synchronization.

Related

Is there an overhead in my code that makes my threading slower [C++]

I have created two programs that find the determinant of 2 matrixes, with one using threads and the other without and then recorded the time taken to complete the calculation. The threaded script appears to be slower than the one without threads yet I cannot see anything that may create any overhead issues. Any help is appreciated thanks.
Thread script:
#include <iostream>
#include <ctime>
#include <thread>
void determinant(int matrix[3][3]){
int a = matrix[0][0]*((matrix[1][1]*matrix[2][2])-(matrix[1][2]*matrix[2][1]));
int b = matrix[0][1]*((matrix[1][0]*matrix[2][2])-(matrix[1][2]*matrix[2][0]));
int c = matrix[0][2]*((matrix[1][0]*matrix[2][1])-(matrix[1][1]*matrix[2][0]));
int determinant = a-b+c;
}
int main() {
int matrix[3][3]= {
{11453, 14515, 1399954},
{13152, 11254, 11523},
{11539994, 51821, 19515}
};
int matrix2[3][3] = {
{16392, 16999942, 18682},
{5669, 466999832, 1429},
{96989, 10962, 63413}
};
const clock_t c_start = clock();
std::thread mat_thread1(determinant, matrix);
std::thread mat_thread2(determinant, matrix2);
mat_thread1.join();
mat_thread2.join();
const clock_t c_end = clock();
std::cout << "\nOperation takes: " << 1000.0 * (c_end-c_start) / CLOCKS_PER_SEC << "ms of CPU time";
}
Script with no other thread than the main one:
#include <iostream>
#include <ctime>
#include <thread>
void determinant(int matrix[3][3]){
int a = matrix[0][0]*((matrix[1][1]*matrix[2][2])-(matrix[1][2]*matrix[2][1]));
int b = matrix[0][1]*((matrix[1][0]*matrix[2][2])-(matrix[1][2]*matrix[2][0]));
int c = matrix[0][2]*((matrix[1][0]*matrix[2][1])-(matrix[1][1]*matrix[2][0]));
int determinant = a-b+c;
}
int main() {
int matrix[3][3]= {
{11453, 14515, 1399954},
{13152, 11254, 11523},
{11539994, 51821, 19515}
};
int matrix2[3][3] = {
{16392, 16999942, 18682},
{5669, 466999832, 1429},
{96989, 10962, 63413}
};
const clock_t c_start = clock();
determinant(matrix);
determinant(matrix2);
const clock_t c_end = clock();
std::cout << "\nOperation takes: " << 1000.0 * (c_end-c_start) / CLOCKS_PER_SEC << "ms of CPU time";
}
PS - the 1st script took 0.293ms on the last run and the second script took 0.002ms
Thanks again,
wndlbh

The difference seems to be the creation of two threads and the joins. I expect that the time to do this (create and join) is way more than the time to do 9 multiplications and 5 additions.

The start-up (and tear down) cost of a new thread is enormous, and in this case drowns the real work.
I seem to remember times between 1ms and 1s depending on your setup. More threads first helps if the time saved on the work is higher than the cost of creating the threads. In this case you would need 1000's of calculations to save that much.

Is there any way to reduce sum 100M float elements of an array in CUDA?

I'm new to CUDA. So please bear with questions with trivial solutions, if any.
I am trying to find the sum of 100M float elements of an array. From the following code one could see that I've used a reduction kernel and thrust. I suppose the kernel stores the sum in g_odata[0]. As all the elements are same in g_idata the result should be n*g_idata[1]. But you could clearly see the results are incorrect for both of them.
What am I getting wrong? How could I achieve my target?
Every reduction kernel I found is for integer datatype. e.g. the highly recommended Optimizing Parallel Reduction in CUDA.. Is there any specific reason to that?
Here is my code:
#include <iostream>
#include <math.h>
#include <stdlib.h>
#include <iomanip>
#include <thrust/reduce.h>
#include <thrust/execution_policy.h>
using namespace std;
__global__ void reduce(float *g_idata, float *g_odata) {
__shared__ float sdata[256];
int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[threadIdx.x] = g_idata[i];
__syncthreads();
for (int s=1; s < blockDim.x; s *=2)
{
int index = 2 * s * threadIdx.x;;
if (index < blockDim.x)
{
sdata[index] += sdata[index + s];
}
__syncthreads();
}
if (threadIdx.x == 0)
atomicAdd(g_odata,sdata[0]);
}
int main(void){
unsigned int n=pow(10,8);
float *g_idata, *g_odata;
cudaMallocManaged(&g_idata, n*sizeof(float));
cudaMallocManaged(&g_odata, n*sizeof(float));
int blockSize = 32;
int numBlocks = (n + blockSize - 1) / blockSize;
for(int i=0;i<n;i++){g_idata[i]=6.1;g_odata[i]=0;}
reduce<<<numBlocks, blockSize>>>(g_idata, g_odata);
cudaDeviceSynchronize();
cout << g_odata[0] << "\t" << (float)n*g_idata[1] << "\t"<< (float)n*g_idata[1]-g_odata[0]<<endl;
g_odata[0]=thrust::reduce(thrust::device, g_idata, g_idata+n);
cout << g_odata[0] << "\t" << (float)n*g_idata[1] << "\t"<< (float)n*g_idata[1]-g_odata[0]<<endl;
cudaFree(g_idata);
cudaFree(g_odata);
}
Result:
6.0129e+08 6.1e+08 8.7097e+06
6.09986e+08 6.1e+08 13824
I am using CUDA 10. nvcc --version :
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
Details of my GPU DeviceQuery:
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 750"
CUDA Driver Version / Runtime Version 10.0 / 10.0
CUDA Capability Major/Minor version number: 5.0
Total amount of global memory: 1999 MBytes (2096168960 bytes)
( 4) Multiprocessors, (128) CUDA Cores/MP: 512 CUDA Cores
GPU Max Clock rate: 1110 MHz (1.11 GHz)
Memory Clock rate: 2505 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 1
Result = PASS
Thanks in advance.

I think the reason you are confused about the results here is a lack of understanding of floating point arithmetic. This whitepaper covers the topic pretty well. As a simple concept to grasp, if I have numbers represented as float quantities, and I attempt to do this:
100000000 + 1
the result will be: 100000000 (write some code and try it yourself)
This isn't unique to GPUs, CPU code will behave the same way (try it).
So for very large reductions, we get to the point (often) where we are adding very large numbers to much much smaller numbers, and the results aren't accurate from a "pure math" point of view.
That is fundamentally the problem here. In your CPU code, when you decide that the correct result should be 6.1*n, that kind of multiplication problem is not subject to the limits of adding large numbers to small ones that I just described, so you get an "accurate" result from that.
One of the ways to prove this or work around it, is to use double representation instead of float. This doesn't really completely eliminate the problem, but it pushes the resolution to the point where it can do a much better job of representing the range of numbers here.
The following code primarily has that change. You can change the typedef to compare the behavior between float and double.
There are a few other changes in the code. None of them are the cause of the discrepancy you witnessed.
$ cat t18.cu
#include <iostream>
#include <math.h>
#include <stdlib.h>
#include <iomanip>
#include <thrust/reduce.h>
#include <thrust/execution_policy.h>
#define BLOCK_SIZE 32
typedef double ft;
using namespace std;
__device__ double my_atomicAdd(double* address, double val)
{
unsigned long long int* address_as_ull =
(unsigned long long int*)address;
unsigned long long int old = *address_as_ull, assumed;
do {
assumed = old;
old = atomicCAS(address_as_ull, assumed,
__double_as_longlong(val +
__longlong_as_double(assumed)));
// Note: uses integer comparison to avoid hang in case of NaN (since NaN != NaN)
} while (assumed != old);
return __longlong_as_double(old);
}
__device__ float my_atomicAdd(float* addr, float val){
return atomicAdd(addr, val);
}
__global__ void reduce(ft *g_idata, ft *g_odata, int n) {
__shared__ ft sdata[BLOCK_SIZE];
int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[threadIdx.x] = (i < n)?g_idata[i]:0;
__syncthreads();
for (int s=1; s < blockDim.x; s *=2)
{
int index = 2 * s * threadIdx.x;;
if ((index +s) < blockDim.x)
{
sdata[index] += sdata[index + s];
}
__syncthreads();
}
if (threadIdx.x == 0)
my_atomicAdd(g_odata,sdata[0]);
}
int main(void){
unsigned int n=pow(10,8);
ft *g_idata, *g_odata;
cudaMallocManaged(&g_idata, n*sizeof(ft));
cudaMallocManaged(&g_odata, sizeof(ft));
cout << "n = " << n << endl;
int blockSize = BLOCK_SIZE;
int numBlocks = (n + blockSize - 1) / blockSize;
g_odata[0] = 0;
for(int i=0;i<n;i++){g_idata[i]=6.1;}
reduce<<<numBlocks, blockSize>>>(g_idata, g_odata, n);
cudaDeviceSynchronize();
cout << g_odata[0] << "\t" << (float)n*g_idata[1] << "\t"<< (float)n*g_idata[1]-g_odata[0]<<endl;
g_odata[0]=thrust::reduce(thrust::device, g_idata, g_idata+n);
cout << g_odata[0] << "\t" << (float)n*g_idata[1] << "\t"<< (float)n*g_idata[1]-g_odata[0]<<endl;
cudaFree(g_idata);
cudaFree(g_odata);
}
$ nvcc -o t18 t18.cu
$ cuda-memcheck ./t18
========= CUDA-MEMCHECK
n = 100000000
6.1e+08 6.1e+08 0.00527966
6.1e+08 6.1e+08 5.13792e-05
========= ERROR SUMMARY: 0 errors
$

C++11: How to set seed using <random>

I am exercising the random library, new to C++11. I wrote the following minimal program:
#include <iostream>
#include <random>
using namespace std;
int main() {
default_random_engine eng;
uniform_real_distribution<double> urd(0, 1);
cout << "Uniform [0, 1): " << urd(eng);
}
When I run this repeatedly it gives the same output each time:
>a
Uniform [0, 1): 0.131538
>a
Uniform [0, 1): 0.131538
>a
Uniform [0, 1): 0.131538
I would like to have the program set the seed differently each time it is called, so that a different random number is generated each time. I am aware that random provides a facility called seed_seq, but I find the explanation of it (at cplusplus.com) totally obscure:
http://www.cplusplus.com/reference/random/seed_seq/
I'd appreciate advice on how to have a program generate a new seed each time it is called: The simpler the better.
My platform(s):
Windows 7 : TDM-GCC compiler

The point of having a seed_seq is to increase the entropy of the generated sequence. If you have a random_device on your system, initializing with multiple numbers from that random device may arguably do that. On a system that has a pseudo-random number generator I don't think there is an increase in randomness, i.e. generated sequence entropy.
Building on that your approach:
If your system does provide a random device then you can use it like this:
std::random_device r;
// std::seed_seq ssq{r()};
// and then passing it to the engine does the same
default_random_engine eng{r()};
uniform_real_distribution<double> urd(0, 1);
cout << "Uniform [0, 1): " << urd(eng);
If your system does not have a random device then you can use time(0) as a seed to the random_engine
default_random_engine eng{static_cast<long unsigned int>(time(0))};
uniform_real_distribution<double> urd(0, 1);
cout << "Uniform [0, 1): " << urd(eng);
If you have multiple sources of randomness you can actually do this (e.g. 2)
std::seed_seq seed{ r1(), r2() };
default_random_engine eng{seed};
uniform_real_distribution<double> urd(0, 1);
cout << "Uniform [0, 1): " << urd(eng);
where r1 , r2 are different random devices , e.g. a thermal noise or quantum source .
Ofcourse you could mix and match
std::seed_seq seed{ r1(), static_cast<long unsigned int>(time(0)) };
default_random_engine eng{seed};
uniform_real_distribution<double> urd(0, 1);
cout << "Uniform [0, 1): " << urd(eng);
Finally, I like to initialize with an one liner:
auto rand = std::bind(std::uniform_real_distribution<double>{0,1},
std::default_random_engine{std::random_device()()});
std::cout << "Uniform [0,1): " << rand();
If you worry about the time(0) having second precision you can overcome this by playing with the high_resolution_clock either by requesting the time since epoch as designated firstly by bames23 below:
static_cast<long unsigned int>(std::chrono::high_resolution_clock::now().time_since_epoch().count())
or maybe just play with CPU randomness
long unsigned int getseed(int const K)
{
typedef std::chrono::high_resolution_clock hiclock;
auto gett= [](std::chrono::time_point<hiclock> t0)
{
auto tn = hiclock::now();
return static_cast<long unsigned int>(std::chrono::duration_cast<std::chrono::microseconds>(tn-t0).count());
};
long unsigned int diffs[10];
diffs[0] = gett(hiclock::now());
for(int i=1; i!=10; i++)
{
auto last = hiclock::now();
for(int k=K; k!=0; k--)
{
diffs[i]= gett(last);
}
}
return *std::max_element(&diffs[1],&diffs[9]);
}

#include <iostream>
#include <random>
using namespace std;
int main() {
std::random_device r; // 1
std::seed_seq seed{r(), r(), r(), r(), r(), r(), r(), r()}; // 2
std::mt19937 eng(seed); // 3
uniform_real_distribution<double> urd(0, 1);
cout << "Uniform [0, 1): " << urd(eng);
}
In order to get unpredictable results from a pseudo-random number generator
we need a source of unpredictable seed data. On 1 we create a
std::random_device for this purpose. On
2 we use a std::seed_seq to combine
several values produced by random_device into a form suitable for seeding a
pseudo-random number generator. The more unpredictable data that is fed into
the seed_seq, the less predictable the results of the seeded engine will
be. On 3 we create a random number engine using the seed_seq to seed the
engine's initial state.
A seed_seq can be used to initialize multiple random number engines;
seed_seq will produce the same seed data each time it is used.
Note: Not all implemenations provide a source of non-deterministic data.
Check your implementation's documentation for std::random_device.
If your platform does not provide a non-deterministic random_device then some other sources can be used for seeding. The article Simple Portable C++ Seed Entropy suggests a number of alternative sources:
A high resolution clock such as std::chrono::high_resolution_clock (time() typically has a resolution of one second which generally too low)
Memory configuration which on modern OSs varies due to address space layout randomization (ASLR)
CPU counters or random number generators. C++ does not provide standardized access to these so I won't use them.
thread id
A simple counter (which only matters if you seed more than once)
For example:
#include <chrono>
#include <iostream>
#include <random>
#include <thread>
#include <utility>
using namespace std;
// we only use the address of this function
static void seed_function() {}
int main() {
// Variables used in seeding
static long long seed_counter = 0;
int var;
void *x = std::malloc(sizeof(int));
free(x);
std::seed_seq seed{
// Time
static_cast<long long>(std::chrono::high_resolution_clock::now()
.time_since_epoch()
.count()),
// ASLR
static_cast<long long>(reinterpret_cast<intptr_t>(&seed_counter)),
static_cast<long long>(reinterpret_cast<intptr_t>(&var)),
static_cast<long long>(reinterpret_cast<intptr_t>(x)),
static_cast<long long>(reinterpret_cast<intptr_t>(&seed_function)),
static_cast<long long>(reinterpret_cast<intptr_t>(&_Exit)),
// Thread id
static_cast<long long>(
std::hash<std::thread::id>()(std::this_thread::get_id())),
// counter
++seed_counter};
std::mt19937 eng(seed);
uniform_real_distribution<double> urd(0, 1);
cout << "Uniform [0, 1): " << urd(eng);
}

How to measure GPU vs CPU performance? Which time measurement functions?

What libraries or functions need to be used for an objective comparison of CPU and GPU performance? What caveat should be warned for the sake of an accurate evaluation?
I using an Ubuntu platform with a device having compute capability 2.1 and working with the CUDA 5 toolkit.

I'm using the following
CPU - return microseconds between tic and toc with 2 microseconds of resolution
#include <sys/time.h>
#include <time.h>
struct timespec init;
struct timespec after;
void tic() { clock_gettime(CLOCK_MONOTONIC,&init); }
double toc() {
clock_gettime(CLOCK_MONOTONIC,&after);
double us = (after.tv_sec-init.tv_sec)*1000000.0f;
return us+(after.tv_nsec- init.tv_nsec)/1000.0f;
}
GPU
float time;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
// Instructions
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
cout << setprecision (10) << "GPU Time [ms] " << time << endl;
EDIT
For a more complete answer, please see Timing CUDA operations.

Counting time or cpu clicks in Cygwin

I can´t count the time of cpu processing in cygwin?Why is that?
Do I need a special command.Counting clicks in cpu is done by clock
function after including time.h!
Still, after i get it done in visual studio i just can´t run on cygwin?
Why is that!
Here is the code.
#include <iostream>
#include <time.h>
using namespace std;
int main()
{
clock_t t1,t2;
int x=0;
int num;
cout << "0 to get out of program, else, number of iterations" << endl;
cin>>num;
if(num==0)
system(0);
t1=clock();
while (x!=num)
{
cout << "Number "<<x<<" e"<< endl;
if(x%2==0)
cout << "Even" << endl;
else
cout << "Odd" << endl;
x=x+1;
}
t2=clock();
float diff ((float)t2-(float)t1);
cout<<diff<<endl;
float seconds = diff / CLOCKS_PER_SEC;
cout<<seconds<<endl;
system ("pause");
return 0;
}
Sorry for the bad english.

Looks like the clock() function is defined differently for Windows and POSIX (and hence Cygwin). MSDN says that the Windows clock() returns "the elapsed wall-clock time since the start of the process", whereas the POSIX version returns "the implementation's best approximation to the processor time used by the process". In your example, the process will be spending almost its entire time waiting for output to the terminal to complete, which doesn't count towards the processing time.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Correct execution time of CUDA program - time

Related

Is there an overhead in my code that makes my threading slower [C++]

Is there any way to reduce sum 100M float elements of an array in CUDA?

C++11: How to set seed using <random>

How to measure GPU vs CPU performance? Which time measurement functions?

Counting time or cpu clicks in Cygwin

Categories

Resources