NVidia CUDA Thrust device vector allocation is too slow

NVidia CUDA Thrust device vector allocation is too slow - performance

Does anybody knows why vector allocation on device takes too much for the first run being compiled in Debug mode? In my particular case (NVIDIA Quadro 3000M, Cuda Toolkit 6.0, Windows 7, MSVC2010) first run for Debug compiled version takes over 40 seconds, next (no recompilation) runs take 10 times less (vector allocation on device for Release version takes over 1 second).
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/copy.h>
#include <cstdlib>
#include <ctime>
int main(void) {
clock_t t;
t = clock();
thrust::host_vector<int> h_vec( 100);
clock_t dt = clock() - t;
printf ("allocation on host - %f sec.\n", (float)dt/CLOCKS_PER_SEC);
t = clock();
thrust::generate(h_vec.begin(), h_vec.end(), rand);
dt = clock() - t;
printf ("initialization on host - %f sec.\n", (float)dt/CLOCKS_PER_SEC);
t = clock();
thrust::device_vector<int> d_vec( 100); // First run for Debug compiled version takes over 40 seconds...
dt = clock() - t;
printf ("allocation on device - %f sec.\n", (float)dt/CLOCKS_PER_SEC);
t = clock();
d_vec[0] = h_vec[0];
dt = clock() - t;
printf ("copy one to device - %f sec.\n", (float)dt/CLOCKS_PER_SEC);
t = clock();
d_vec = h_vec;
dt = clock() - t;
printf ("copy all to device - %f sec.\n", (float)dt/CLOCKS_PER_SEC);
t = clock();
thrust::sort(d_vec.begin(), d_vec.end());
dt = clock() - t;
printf ("sort on device - %f sec.\n", (float)dt/CLOCKS_PER_SEC);
t = clock();
thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());
dt = clock() - t;
printf ("copy to host - %f sec.\n", (float)dt/CLOCKS_PER_SEC);
t = clock();
for(int i=0; i<10; i++)
printf("%d\n", h_vec[i]);
dt = clock() - t;
printf ("output - %f sec.\n", (float)dt/CLOCKS_PER_SEC);
std::cin.ignore();
return 0;
}

Most of the time you are measuring for the first vector instantiation isn't the cost of the vector allocation and initialisation, it is overhead costs associated with the CUDA runtime and driver. I would guess that if you changed your code to something like this:
int main(void) {
clock_t t;
....
cudaFree(0); // This forces context establishment and lazy runtime overheads
t = clock();
thrust::device_vector<int> d_vec( 100); // First run for Debug compiled version takes over 40 seconds...
dt = clock() - t;
printf ("allocation on device - %f sec.\n", (float)dt/CLOCKS_PER_SEC);
.....
You should see that the time you measure to allocate the vector between first and second runs becomes the same even though the wall clock time to run the program shows a big difference.
I don't have a good explanation as to why there is such a large difference in startup time between first and second runs, but if I was to hazard a guess, it is that there is some driver level JIT recompilation being performed on the first run, and the driver caches the code for subsequent runs. One thing to check is that you are compiling code for the correct architecture for your GPU, that would eliminate driver recompilation as a source of the time difference.
The nvprof utility can provide you with an API trace and timings. You might want to run it and see where in the API call sequence the difference in time is arising from. It isn't beyond the realms of possibility that you are seeing the effects of some sort of driver bug, but without more information it is impossible to say.

It looks like in my case (NVIDIA Quadro 3000M, Cuda Toolkit 6.0, Windows 7, MSVC2010) the problem is solved by changing project CUDA C/C++ / Code Generation option from compute_10,sm_10 to compute_20,sm_20 which states for newer GPU achrchitecture. So I've got happiness for today )

Related

Correct execution time of CUDA program

I wrote a small program with which you can get edges of a digital image (the well-known Canny detector). It is necessary to measure the exact time (in milliseconds) of the algorithm execution on the device (GPU) (including the stages of data transfer). I attach the working program code in C:
#include <iostream>
#include <sys/time.h>
#include <opencv2/core.hpp>
#include <opencv2/highgui.hpp>
#include <opencv2/opencv.hpp>
#include <opencv2/cudaimgproc.hpp>
#include <cuda_runtime.h>
#include <opencv2/core/cuda.hpp>
using namespace cv;
using namespace std;
__device__ __host__
void FirstRun (void)
{
cudaSetDevice(0);
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
}
int main( int argc, char** argv )
{
clock_t time;
if (argc != 2)
{
cout << "Wrong number of arguments!" << endl;
return -1;
}
const char* filename = argv[1];
Mat img = imread(filename, IMREAD_GRAYSCALE);
if( !img.data )
{
cout << " --(!) Error reading images \n" << endl;
return -2;
}
double low_tresh = 100.0;
double high_tresh = 150.0;
int apperture_size = 3;
bool useL2gradient = false;
int imageWidth = img.cols;
int imageHeight = img.rows;
cout << "Width of image: " << imageWidth << endl;
cout << "Height of image: " << imageHeight << endl;
cout << endl;
FirstRun();
// Canny algorithm
cuda::GpuMat d_img(img);
cuda::GpuMat d_edges;
time = clock();
Ptr<cuda::CannyEdgeDetector> canny = cuda::createCannyEdgeDetector(low_tresh, high_tresh, apperture_size, useL2gradient);
canny->detect(d_img, d_edges);
time = clock() - time;
cout << "CannyCUDA time (ms): " << (float)time / CLOCKS_PER_SEC * 1000 << endl;
return 0;
}
I get two different work times (image 7741 x 8862)
System configuration:
1) CPU: Intel Core i7 9600K (3.6 GHz), 32 GB RAM;
2) GPU: Nvidia Geforce RTX 2080 Ti;
3) OpenCV ver. 4.0
What time is right and do I measure it correctly, thank you!

There are different times you can measure when dealing with cuda.
Here are some solution you might want to try:
Measure the total time used by cuda: Use time() to get an absolute time value before using any cuda functions and time() again after you got the result. The difference will be the real time that passed.
Measure only the time of the calculation: cuda has some start-up overhead, but if you're not interested in that, because you will be using your code many times without exiting the cuda environment, you can measure it seperatly. Please read the CUDA C Programming Guide, it will explain the use of Events to be used for timing.
Use the profiler to get detailed information on what part of the program takes what part of time: The kernel times are expecially interesting, as they tell you how long your computations take. Be careful when looking at the API times. In your example, a lot of time is used by cudaEventCreate() as this is the first cuda function in your program, so it includes the start-up overhead. Also, cuda[...]Synchronize() doesnt actually take that long to be called, but it includes the time it is waiting for synchronization.

OpenMP does not provide speedup for simple program

I just started learning OpenMP with C++, and I used a very simple program to check if I can get some speedup from parallelize the program:
#include <iostream>
#include <ctime>
#include "omp.h"
int main() {
const uint N = 1000000000;
clock_t start_time = clock();
#pragma omp parallel for
for (uint i = 0; i < N; i++) {
int x = 1+1;
}
clock_t end_time = clock();
std::cout << "total_time: " << double(end_time - start_time) / CLOCKS_PER_SEC << " seconds." << std::endl;
}
The program takes 2.2 seconds without parallel #pragma, and takes 2.8 seconds with parallel #pragma 4 threads. What mistake did I make in the program? My compiler is clang++ 6.0, and the computer is Macbook Pro with 2.6G i5 CPU and MacOS 10.13.6.
EDIT:
I realized I used the wrong function for measuring execution time. Instead of clock() from library ctime, I should use high_resolution_clock from library chrono library. In that case, I get 80 seconds for 1 thread, 47 seconds for 2 threads, 35 seconds for 3 threads. Should the speedup be better than what I get here, since the program is embarrassingly parallel?

As with anything in parallel programming, there is a startup cost to creating new threads. For simple programs, the overhead of creating and managing threads is often great enough that it actually slows down the target program compared to when the program is run in a single thread.
In other words, you didn't make a mistake - this is an inherent part of using threads.

OpenMP takes more time in my program with gcc 5.4 [duplicate]

This question already has answers here:
C++: Timing in Linux (using clock()) is out of sync (due to OpenMP?)
(3 answers)
Closed 4 years ago.
I'm using Ubuntu 16.04 as Windows subsystem and gcc 5.4 version with bash.
My windows version is windows 10 home, and ram is 16GB.
My cpu is i7-7700HQ.
I am studying computer programming at my university. These days, I am interested in parallel programming, so I've coded many codes, but there are some problems.
int i;
int sum = 0;
clock_t S, E;
S = clock();
#pragma omp parallel for reduction(+:sum)
for(i = 0 ; i < M ; i++){
sum++;
}
E = clock();
printf("\n%d :: time : %lf\n", sum, (double)(E-S)/1000);
return 0;
If I compile this code with commamd, "gcc -openmp test.c -o test" and "time ./test", it shows
100000000 :: time : 203.125000
real 0m0.311s
user 0m0.203s
sys 0m0.016s.
However,
int i;
int sum = 0;
clock_t S, E;
S = clock();
for(i = 0 ; i < M ; i++){
sum++;
}
E = clock();
printf("\n%d :: time : %lf\n", sum, (double)(E-S)/1000);
return 0;
if is compile this code with command, "gcc -openmp test2.c -o test2" and "time ./test2", it shows
100000000 :: time : 171.875000
real 0m0.295s
user 0m0.172s
sys 0m0.016s.
If I compile those codes again and again, it sometimes takes same time, but openmp is never faster.
And I edited those codes with vim.
And I tried to compile with command, "gcc -fopenmp test.c" and "time ./penmp". It takes much more time than command, "gcc -openmp test.c."
And if I compile those same codes with visual studio 2017 community, openmp is much more faster.
Please let me know how I can reduce time with openmp.

You should use omp_get_wtime() instead to measure the wall-clock:
double dif;
double start = omp_get_wtime( ); //start the timer
//beginning of computation
..
//end of computation
double end = omp_get_wtime();// end the timer
dif = end - start // stores the difference in dif
printf("the time of dif is %f", dif);

Low performance in a OpenMP program

I am trying to understand an openmp code from here. You can see the code below.
In order to measure the speedup, difference between the serial and omp version, I use time.h, do you find right this approach?
The program runs on a 4 core machine. I specify export OMP_NUM_THREADS="4" but can not see substantially speedup, usually I get 1.2 - 1.7. Which problems am I facing in this parallelization?
Which debug/performace tool could I use to see the loss of performace?
code (for compilation I use xlc_r -qsmp=omp omp_workshare1.c -o omp_workshare1.exe)
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#define CHUNKSIZE 1000000
#define N 100000000
int main (int argc, char *argv[])
{
int nthreads, tid, i, chunk;
float a[N], b[N], c[N];
unsigned long elapsed;
unsigned long elapsed_serial;
unsigned long elapsed_omp;
struct timeval start;
struct timeval stop;
chunk = CHUNKSIZE;
// ================= SERIAL start =======================
/* Some initializations */
for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0;
gettimeofday(&start,NULL);
for (i=0; i<N; i++)
{
c[i] = a[i] + b[i];
//printf("Thread %d: c[%d]= %f\n",tid,i,c[i]);
}
gettimeofday(&stop,NULL);
elapsed = 1000000 * (stop.tv_sec - start.tv_sec);
elapsed += stop.tv_usec - start.tv_usec;
elapsed_serial = elapsed ;
printf (" \n Time SEQ= %lu microsecs\n", elapsed_serial);
// ================= SERIAL end =======================
// ================= OMP start =======================
/* Some initializations */
for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0;
gettimeofday(&start,NULL);
#pragma omp parallel shared(a,b,c,nthreads,chunk) private(i,tid)
{
tid = omp_get_thread_num();
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
//printf("Thread %d starting...\n",tid);
#pragma omp for schedule(static,chunk)
for (i=0; i<N; i++)
{
c[i] = a[i] + b[i];
//printf("Thread %d: c[%d]= %f\n",tid,i,c[i]);
}
} /* end of parallel section */
gettimeofday(&stop,NULL);
elapsed = 1000000 * (stop.tv_sec - start.tv_sec);
elapsed += stop.tv_usec - start.tv_usec;
elapsed_omp = elapsed ;
printf (" \n Time OMP= %lu microsecs\n", elapsed_omp);
// ================= OMP end =======================
printf (" \n speedup= %f \n\n", ((float) elapsed_serial) / ((float) elapsed_omp)) ;
}

There's nothing really wrong with the code as above, but your speedup is going to be limited by the fact that the main loop, c=a+b, has very little work -- the time required to do the computation (a single addition) is going to be dominated by memory access time (2 loads and one store), and there's more contention for memory bandwidth with more threads acting on the array.
We can test this by making the work inside the loop more compute-intensive:
c[i] = exp(sin(a[i])) + exp(cos(b[i]));
And then we get
$ ./apb
Time SEQ= 17678571 microsecs
Number of threads = 4
Time OMP= 4703485 microsecs
speedup= 3.758611
which is obviously a lot closer to the 4x speedup one would expect.
Update: Oh, and to the other questions -- gettimeofday() is probably fine for timing, and on a system where you're using xlc - is this AIX? In that case, peekperf is a good overall performance tool, and the hardware performance monitors will give you access to to memory access times. On x86 platforms, free tools for performance monitoring of threaded code include cachegrind/valgrind for cache performance debugging (not the problem here), scalasca for general OpenMP issues, and OpenSpeedShop is pretty useful, too.

Cuda doesn't calculate what it is expected to, just silently ignores my code

I'm encountering a very strange problem: Mu 9800GT doesnt seem to calculate at all.
I've tried all hello-worlds i've found in the internet, here's one of them:
this program creates 1..100 array on hosts, sends it to device, calculates a square of each value, returns it to host, prints the results.
#include "stdafx.h"
#include <stdio.h>
#include <cuda.h>
__global__ void square_array(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] * a[idx];
}
// main routine that executes on the host
int main(void)
{
float *a_h, *a_d; // Pointer to host & device arrays
const int N = 100; // Number of elements in arrays
size_t size = N * sizeof(float);
a_h = (float *)malloc(size); // Allocate array on host
cudaMalloc((void **) &a_d, size); // Allocate array on device
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
square_array <<< n_blocks, block_size >>> (a_d, N);
// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
// Cleanup
free(a_h); cudaFree(a_d);
}
so the output is expected to be:
1 1.000
2 4.000
3 9.000
4 16.000
..
I swear back in 2009 it worked perfectly (vista 32, deviceemu)
now i get output:
1 1.000
2 2.000
3 3.000
4 4.000
so my card doesnt do anything. What can be the problem?
Configuration is:
win7x64
visual studio 2010 32bit
cuda toolkit 3.2 64bit
compilation settings: cuda 3.2 toolkit, 32-bit target platform, deviceemu or not - doesnt matter, the results are the same.
i also tried it on my vmware xp(32bit) visual studio 2008. the result is the same.
Please help me, i barely made the programe to compile, now i need it to work.
You can also view my project with all it needs from my post at nvidia forums ( 2.7 kb)
Thanks, Ilya

Your code produces the intended results on my Linux system so I would suggest checking the error codes returned by cudaMalloc and cudaMemcpy to ensure there are no silent driver/runtime errors. For example
cudaError_t error = cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
printf("error status: %s\n", cudaGetErrorString(error));
should print
error status: no error
if the call is successful.
Also, I believe device emulation was deprecated in CUDA 3.0 and removed entirely in CUDA 3.1. I don't know if that's related to your problem though.
To compile several files you'd just do something like this
$nvcc -c foo.cu
$nvcc -c bar.cu
$nvcc -o foobar foo.o bar.o
alternatively, you can do the linking in the last step with g++ like so
$g++ -o foobar foo.o bar.o -L/usr/local/cuda/lib64 -lcudart

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

NVidia CUDA Thrust device vector allocation is too slow - performance

It looks like in my case (NVIDIA Quadro 3000M, Cuda Toolkit 6.0, Windows 7, MSVC2010) the problem is solved by changing project CUDA C/C++ / Code Generation option from compute_10,sm_10 to compute_20,sm_20 which states for newer GPU achrchitecture. So I've got happiness for today )

Related

Correct execution time of CUDA program

OpenMP does not provide speedup for simple program

OpenMP takes more time in my program with gcc 5.4 [duplicate]

Low performance in a OpenMP program

Cuda doesn't calculate what it is expected to, just silently ignores my code

Categories

Resources