What libraries or functions need to be used for an objective comparison of CPU and GPU performance? What caveat should be warned for the sake of an accurate evaluation?
I using an Ubuntu platform with a device having compute capability 2.1 and working with the CUDA 5 toolkit.
I'm using the following
CPU - return microseconds between tic and toc with 2 microseconds of resolution
#include <sys/time.h>
#include <time.h>
struct timespec init;
struct timespec after;
void tic() { clock_gettime(CLOCK_MONOTONIC,&init); }
double toc() {
clock_gettime(CLOCK_MONOTONIC,&after);
double us = (after.tv_sec-init.tv_sec)*1000000.0f;
return us+(after.tv_nsec- init.tv_nsec)/1000.0f;
}
GPU
float time;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
// Instructions
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
cout << setprecision (10) << "GPU Time [ms] " << time << endl;
EDIT
For a more complete answer, please see Timing CUDA operations.
Related
I wrote a small program with which you can get edges of a digital image (the well-known Canny detector). It is necessary to measure the exact time (in milliseconds) of the algorithm execution on the device (GPU) (including the stages of data transfer). I attach the working program code in C:
#include <iostream>
#include <sys/time.h>
#include <opencv2/core.hpp>
#include <opencv2/highgui.hpp>
#include <opencv2/opencv.hpp>
#include <opencv2/cudaimgproc.hpp>
#include <cuda_runtime.h>
#include <opencv2/core/cuda.hpp>
using namespace cv;
using namespace std;
__device__ __host__
void FirstRun (void)
{
cudaSetDevice(0);
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
}
int main( int argc, char** argv )
{
clock_t time;
if (argc != 2)
{
cout << "Wrong number of arguments!" << endl;
return -1;
}
const char* filename = argv[1];
Mat img = imread(filename, IMREAD_GRAYSCALE);
if( !img.data )
{
cout << " --(!) Error reading images \n" << endl;
return -2;
}
double low_tresh = 100.0;
double high_tresh = 150.0;
int apperture_size = 3;
bool useL2gradient = false;
int imageWidth = img.cols;
int imageHeight = img.rows;
cout << "Width of image: " << imageWidth << endl;
cout << "Height of image: " << imageHeight << endl;
cout << endl;
FirstRun();
// Canny algorithm
cuda::GpuMat d_img(img);
cuda::GpuMat d_edges;
time = clock();
Ptr<cuda::CannyEdgeDetector> canny = cuda::createCannyEdgeDetector(low_tresh, high_tresh, apperture_size, useL2gradient);
canny->detect(d_img, d_edges);
time = clock() - time;
cout << "CannyCUDA time (ms): " << (float)time / CLOCKS_PER_SEC * 1000 << endl;
return 0;
}
I get two different work times (image 7741 x 8862)
System configuration:
1) CPU: Intel Core i7 9600K (3.6 GHz), 32 GB RAM;
2) GPU: Nvidia Geforce RTX 2080 Ti;
3) OpenCV ver. 4.0
What time is right and do I measure it correctly, thank you!
There are different times you can measure when dealing with cuda.
Here are some solution you might want to try:
Measure the total time used by cuda: Use time() to get an absolute time value before using any cuda functions and time() again after you got the result. The difference will be the real time that passed.
Measure only the time of the calculation: cuda has some start-up overhead, but if you're not interested in that, because you will be using your code many times without exiting the cuda environment, you can measure it seperatly. Please read the CUDA C Programming Guide, it will explain the use of Events to be used for timing.
Use the profiler to get detailed information on what part of the program takes what part of time: The kernel times are expecially interesting, as they tell you how long your computations take. Be careful when looking at the API times. In your example, a lot of time is used by cudaEventCreate() as this is the first cuda function in your program, so it includes the start-up overhead. Also, cuda[...]Synchronize() doesnt actually take that long to be called, but it includes the time it is waiting for synchronization.
I'd like to measure the time a bit of code within my kernel takes. I've followed this question along with its comments so that my kernel looks something like this:
__global__ void kernel(..., long long int *runtime)
{
long long int start = 0;
long long int stop = 0;
asm volatile("mov.u64 %0, %%clock64;" : "=l"(start));
/* Some code here */
asm volatile("mov.u64 %0, %%clock64;" : "=l"(stop));
runtime[threadIdx.x] = stop - start;
...
}
The answer says to do a conversion as follows:
The timers count the number of clock ticks. To get the number of milliseconds, divide this by the number of GHz on your device and multiply by 1000.
For which I do:
for(long i = 0; i < size; i++)
{
fprintf(stdout, "%d:%ld=%f(ms)\n", i,runtime[i], (runtime[i]/1.62)*1000.0);
}
Where 1.62 is the GPU Max Clock rate of my device. But the time I get in milliseconds does not look right because it suggests that each thread took minutes to complete. This cannot be correct as execution finishes in less than a second of wall-clock time. Is the conversion formula incorrect or am I making a mistake somewhere? Thanks.
The correct conversion in your case is not GHz:
fprintf(stdout, "%d:%ld=%f(ms)\n", i,runtime[i], (runtime[i]/1.62)*1000.0);
^^^^
but hertz:
fprintf(stdout, "%d:%ld=%f(ms)\n", i,runtime[i], (runtime[i]/1620000000.0f)*1000.0);
^^^^^^^^^^^^^
In the dimensional analysis:
clock cycles
clock cycles / -------------- = seconds
second
the first term is the clock cycle measurement. The second term is the frequency of the GPU (in hertz, not GHz), the third term is the desired measurement (seconds). You can convert to milliseconds by multiplying seconds by 1000.
Here's a worked example that shows a device-independent way to do it (so you don't have to hard-code the clock frequency):
$ cat t1306.cu
#include <stdio.h>
const long long delay_time = 1000000000;
const int nthr = 1;
const int nTPB = 256;
__global__ void kernel(long long *clocks){
int idx=threadIdx.x+blockDim.x*blockIdx.x;
long long start=clock64();
while (clock64() < start+delay_time);
if (idx < nthr) clocks[idx] = clock64()-start;
}
int main(){
int peak_clk = 1;
int device = 0;
long long *clock_data;
long long *host_data;
host_data = (long long *)malloc(nthr*sizeof(long long));
cudaError_t err = cudaDeviceGetAttribute(&peak_clk, cudaDevAttrClockRate, device);
if (err != cudaSuccess) {printf("cuda err: %d at line %d\n", (int)err, __LINE__); return 1;}
err = cudaMalloc(&clock_data, nthr*sizeof(long long));
if (err != cudaSuccess) {printf("cuda err: %d at line %d\n", (int)err, __LINE__); return 1;}
kernel<<<(nthr+nTPB-1)/nTPB, nTPB>>>(clock_data);
err = cudaMemcpy(host_data, clock_data, nthr*sizeof(long long), cudaMemcpyDeviceToHost);
if (err != cudaSuccess) {printf("cuda err: %d at line %d\n", (int)err, __LINE__); return 1;}
printf("delay clock cycles: %ld, measured clock cycles: %ld, peak clock rate: %dkHz, elapsed time: %fms\n", delay_time, host_data[0], peak_clk, host_data[0]/(float)peak_clk);
return 0;
}
$ nvcc -arch=sm_35 -o t1306 t1306.cu
$ ./t1306
delay clock cycles: 1000000000, measured clock cycles: 1000000210, peak clock rate: 732000kHz, elapsed time: 1366.120483ms
$
This uses cudaDeviceGetAttribute to get the clock rate, which returns a result in kHz, which allows us to easily compute milliseconds in this case.
In my experience, the above method works generally well on datacenter GPUs that have the clock rate running at the reported rate (may be affected by settings you make in nvidia-smi.) Other GPUs such as GeForce GPUs may be running at (unpredictable) boost clocks that will make this method inaccurate.
Also, more recently, CUDA has the ability to preempt activity on the GPU. This can come about in a variety of circumstances, such as debugging, CUDA dynamic parallelism, and other situations. If preemption occurs for whatever reason, attempting to measure anything based on clock64() is generally not reliable.
clock64 returns a value in graphics clock cycles. The graphics clock is dynamic so I would not recommend using a constant to try to convert to seconds. If you want to convert to wall time then the better option is to use globaltimer, which is a 64-bit clock register accessible as:
asm volatile("mov.u64 %0, %%globaltimer;" : "=l"(start));
The unit is in nanoseconds.
The default resolution is 32ns with update every µs. The NVIDIA performance tools force the update to every 32 ns (or 31.25 MHz). This clock is used by CUPTI for start time when capturing concurrent kernel trace.
In the quest for optimal matrix-matrix multiplication using eigen3 (and hopefully profiting from SIMD support) I wrote the following test:
#include <iostream>
#include <Eigen/Dense>
#include <ctime>
using namespace Eigen;
using namespace std;
const int test_size= 13;
const int test_size_16b= test_size+1;
typedef Matrix<double, Dynamic, Dynamic, ColMajor, test_size_16b, test_size_16b> TestMatrix_dyn16b_t;
typedef Matrix<double, Dynamic, Dynamic> TestMatrix_dynalloc_t;
typedef Matrix<double, test_size, test_size> TestMatrix_t;
typedef Matrix<double, test_size_16b, test_size_16b> TestMatrix_fix16b_t;
template<typename TestMatrix_t> EIGEN_DONT_INLINE void test(const char * msg, int m_size= test_size, int n= 10000) {
double s= 0.0;
clock_t elapsed= 0;
TestMatrix_t m3;
for(int i= 0; i<n; i++) {
TestMatrix_t m1 = TestMatrix_t::Random(m_size, m_size);
TestMatrix_t m2= TestMatrix_t::Random(m_size, m_size);
clock_t begin = clock();
m3.noalias()= m1*m2;
clock_t end = clock();
elapsed+= end - begin;
// make sure m3 is not optimized away
s+= m3(1, 1);
}
double elapsed_secs = double(elapsed) / CLOCKS_PER_SEC;
cout << "Elapsed time " << msg << ": " << elapsed_secs << " size " << m3.cols() << ", " << m3.rows() << endl;
}
int main() {
#ifdef EIGEN_VECTORIZE
cout << "EIGEN_VECTORIZE on " << endl;
#endif
test<TestMatrix_t> ("normal ");
test<TestMatrix_dyn16b_t> ("dyn 16b ");
test<TestMatrix_dynalloc_t>("dyn alloc");
test<TestMatrix_fix16b_t> ("fix 16b ", test_size_16b);
}
compiled with g++ -msse3 -O2 -DEIGEN_DONT_PARALLELIZE test.cpp and ran it on an Athlon II X2 255. The result rather surprised me:
EIGEN_VECTORIZE on
Elapsed time normal : 0.019193 size 13, 13
Elapsed time dyn 16b : 0.025226 size 13, 13
Elapsed time dyn alloc: 0.018648 size 13, 13
Elapsed time fix 16b : 0.018221 size 14, 14
Similar results are attained with other odd numbers for test_size. What confuses me is this:
From reading Eigen Vectorization FAQ I would have thought that a 13x13 matrix has no multiple of 16 bytes size and thus will not profit from SIMD optimization. I expected the computation time to be much worse but it isn't.
From reading about Optional template parameters I would have thought that dynamic matrices with fixed upper bound known at compile time would behave much like dynamically allocated matrices an thus would have a similar computation speed. But they don't. That's actually what surprises me the most and what triggered my initial quest: I wanted to know if it is better to use a dynamic matrix with fixed upper bound that is a multiple of 16 bytes than a fixed size matrix whos size is not a multiple of 16 bytes.
Finally interesting but not so much surprising: a matrix whos fixed size is a multiple of 16 is no slower that that of a matrix whos col and row length is one less. SIMD just does the extra col and row for free.
Not my original question but also interesting: when the test is compiled without SSE2 support and thus without vectorization the relative computation times are roughly proportional. The dynamically sized fixed memory matrix is again slowest.
To put my question short: why is Matrix<double, Dynamic, Dynamic, ColMajor, test_size_16b, test_size_16b> so much slower? Can you confirm my observations and maybe even explain them?
The FAQ was obsolete. Since Eigen version 3.3, unaligned vectors and matrices are vectorized.
Then regarding why Matrix<double, Dynamic, Dynamic, ColMajor, test_size_16b, test_size_16b> was slower, that was just an issue in the compile-time selection of the preferred matrix product implementation. The fix will be part of Eigen 3.3.1.
I have a problem that boils down to performing some arithmetic on each element of a set of matrices. I thought this sounded like the kind of computation that could benefit greatly from being shifted onto the GPU. However, I've only succeeded in slowing down the computation by a factor of 10!
Here are the specifics of my test system:
OS: Windows 10
CPU: Core i7-4700MQ # 2.40 GHz
GPU: GeForce GT 750M (compute capability 3.0)
CUDA SDK: v7.5
The code below performs equivalent calcs to my production code, on the CPU and on the GPU. The latter is consistently ten times slower on my machine (CPU approx. 650ms; GPU approx. 7s).
I've tried changing the grid and block sizes; I've increased and decreased the size of the array passed to the GPU; I've run it through the visual profiler; I've tried integer data rather than doubles, but whatever I do, the GPU version is always significantly slower than the CPU equivalent.
So why is the GPU version so much slower and what changes, that I've not mentioned above, could I try to improve its performance?
Here's my command line: nvcc source.cu -o CPUSpeedTest.exe -arch=sm_30
And here's the contents of source.cu:
#include <iostream>
#include <windows.h>
#include <cuda_runtime_api.h>
void AdjustArrayOnCPU(double factor1, double factor2, double factor3, double denominator, double* array, int arrayLength, double* curve, int curveLength)
{
for (size_t i = 0; i < arrayLength; i++)
{
double adjustmentFactor = factor1 * factor2 * factor3 * (curve[i] / denominator);
array[i] = array[i] * adjustmentFactor;
}
}
__global__ void CudaKernel(double factor1, double factor2, double factor3, double denominator, double* array, int arrayLength, double* curve, int curveLength)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < arrayLength)
{
double adjustmentFactor = factor1 * factor2 * factor3 * (curve[idx] / denominator);
array[idx] = array[idx] * adjustmentFactor;
}
}
void AdjustArrayOnGPU(double array[], int arrayLength, double factor1, double factor2, double factor3, double denominator, double curve[], int curveLength)
{
double *dev_row, *dev_curve;
cudaMalloc((void**)&dev_row, sizeof(double) * arrayLength);
cudaMalloc((void**)&dev_curve, sizeof(double) * curveLength);
cudaMemcpy(dev_row, array, sizeof(double) * arrayLength, cudaMemcpyHostToDevice);
cudaMemcpy(dev_curve, curve, sizeof(double) * curveLength, cudaMemcpyHostToDevice);
CudaKernel<<<100, 1000>>>(factor1, factor2, factor3, denominator, dev_row, arrayLength, dev_curve, curveLength);
cudaMemcpy(array, dev_row, sizeof(double) * arrayLength, cudaMemcpyDeviceToHost);
cudaFree(dev_curve);
cudaFree(dev_row);
}
void FillArray(int length, double row[])
{
for (size_t i = 0; i < length; i++) row[i] = 0.1 + i;
}
int main(void)
{
const int arrayLength = 10000;
double arrayForCPU[arrayLength], curve1[arrayLength], arrayForGPU[arrayLength], curve2[arrayLength];;
FillArray(arrayLength, curve1);
FillArray(arrayLength, curve2);
///////////////////////////////////// CPU Version ////////////////////////////////////////
LARGE_INTEGER StartingTime, EndingTime, ElapsedMilliseconds, Frequency;
QueryPerformanceFrequency(&Frequency);
QueryPerformanceCounter(&StartingTime);
for (size_t iterations = 0; iterations < 10000; iterations++)
{
FillArray(arrayLength, arrayForCPU);
AdjustArrayOnCPU(1.0, 1.0, 1.0, 1.0, arrayForCPU, 10000, curve1, 10000);
}
QueryPerformanceCounter(&EndingTime);
ElapsedMilliseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
ElapsedMilliseconds.QuadPart *= 1000;
ElapsedMilliseconds.QuadPart /= Frequency.QuadPart;
std::cout << "Elapsed Milliseconds: " << ElapsedMilliseconds.QuadPart << std::endl;
///////////////////////////////////// GPU Version ////////////////////////////////////////
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
for (size_t iterations = 0; iterations < 10000; iterations++)
{
FillArray(arrayLength, arrayForGPU);
AdjustArrayOnGPU(arrayForGPU, 10000, 1.0, 1.0, 1.0, 1.0, curve2, 10000);
}
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop);
std::cout << "CUDA Elapsed Milliseconds: " << elapsedTime << std::endl;
cudaEventDestroy(start);
cudaEventDestroy(stop);
return 0;
}
And here is an example of the output of CUDASpeedTest.exe
Elapsed Milliseconds: 565
CUDA Elapsed Milliseconds: 7156.76
What follows is likely to be embarrassingly obvious to most developers working with CUDA, but may be of value to others - like myself - who are new to the technology.
The GPU code is ten times slower than the CPU equivalent because the GPU code exhibits a perfect storm of performance-wrecking characteristics.
The GPU code spends most of its time allocating memory on the GPU, copying data to the device, performing a very, very simple calculation (that is supremely fast irrespective of the type of processor it's running on) and then copying data back from the device to the host.
As noted in the comments, if an upper bound exists on the size of the data structures being processed, then a buffer on the GPU can be allocated exactly once and reused. In the code above, this takes the GPU to CPU runtime down from 10:1 to 4:1.
The remaining performance disparity is down to the fact that the CPU is able to perform the required calculations, in serial, millions of times in a very short time span due to its simplicity. In the code above, the calculation involves reading a value from an array, some multiplication, and finally an assignment
to an array element. Something this simple must be performed millions of times
before the benefits of doing so in parallel outweigh the necessary time penalty of transferring the data to the GPU and back. On my test system, a million array elements is the break even point, where GPU and CPU perform in (approximately) the same amount of time.
I am exercising the random library, new to C++11. I wrote the following minimal program:
#include <iostream>
#include <random>
using namespace std;
int main() {
default_random_engine eng;
uniform_real_distribution<double> urd(0, 1);
cout << "Uniform [0, 1): " << urd(eng);
}
When I run this repeatedly it gives the same output each time:
>a
Uniform [0, 1): 0.131538
>a
Uniform [0, 1): 0.131538
>a
Uniform [0, 1): 0.131538
I would like to have the program set the seed differently each time it is called, so that a different random number is generated each time. I am aware that random provides a facility called seed_seq, but I find the explanation of it (at cplusplus.com) totally obscure:
http://www.cplusplus.com/reference/random/seed_seq/
I'd appreciate advice on how to have a program generate a new seed each time it is called: The simpler the better.
My platform(s):
Windows 7 : TDM-GCC compiler
The point of having a seed_seq is to increase the entropy of the generated sequence. If you have a random_device on your system, initializing with multiple numbers from that random device may arguably do that. On a system that has a pseudo-random number generator I don't think there is an increase in randomness, i.e. generated sequence entropy.
Building on that your approach:
If your system does provide a random device then you can use it like this:
std::random_device r;
// std::seed_seq ssq{r()};
// and then passing it to the engine does the same
default_random_engine eng{r()};
uniform_real_distribution<double> urd(0, 1);
cout << "Uniform [0, 1): " << urd(eng);
If your system does not have a random device then you can use time(0) as a seed to the random_engine
default_random_engine eng{static_cast<long unsigned int>(time(0))};
uniform_real_distribution<double> urd(0, 1);
cout << "Uniform [0, 1): " << urd(eng);
If you have multiple sources of randomness you can actually do this (e.g. 2)
std::seed_seq seed{ r1(), r2() };
default_random_engine eng{seed};
uniform_real_distribution<double> urd(0, 1);
cout << "Uniform [0, 1): " << urd(eng);
where r1 , r2 are different random devices , e.g. a thermal noise or quantum source .
Ofcourse you could mix and match
std::seed_seq seed{ r1(), static_cast<long unsigned int>(time(0)) };
default_random_engine eng{seed};
uniform_real_distribution<double> urd(0, 1);
cout << "Uniform [0, 1): " << urd(eng);
Finally, I like to initialize with an one liner:
auto rand = std::bind(std::uniform_real_distribution<double>{0,1},
std::default_random_engine{std::random_device()()});
std::cout << "Uniform [0,1): " << rand();
If you worry about the time(0) having second precision you can overcome this by playing with the high_resolution_clock either by requesting the time since epoch as designated firstly by bames23 below:
static_cast<long unsigned int>(std::chrono::high_resolution_clock::now().time_since_epoch().count())
or maybe just play with CPU randomness
long unsigned int getseed(int const K)
{
typedef std::chrono::high_resolution_clock hiclock;
auto gett= [](std::chrono::time_point<hiclock> t0)
{
auto tn = hiclock::now();
return static_cast<long unsigned int>(std::chrono::duration_cast<std::chrono::microseconds>(tn-t0).count());
};
long unsigned int diffs[10];
diffs[0] = gett(hiclock::now());
for(int i=1; i!=10; i++)
{
auto last = hiclock::now();
for(int k=K; k!=0; k--)
{
diffs[i]= gett(last);
}
}
return *std::max_element(&diffs[1],&diffs[9]);
}
#include <iostream>
#include <random>
using namespace std;
int main() {
std::random_device r; // 1
std::seed_seq seed{r(), r(), r(), r(), r(), r(), r(), r()}; // 2
std::mt19937 eng(seed); // 3
uniform_real_distribution<double> urd(0, 1);
cout << "Uniform [0, 1): " << urd(eng);
}
In order to get unpredictable results from a pseudo-random number generator
we need a source of unpredictable seed data. On 1 we create a
std::random_device for this purpose. On
2 we use a std::seed_seq to combine
several values produced by random_device into a form suitable for seeding a
pseudo-random number generator. The more unpredictable data that is fed into
the seed_seq, the less predictable the results of the seeded engine will
be. On 3 we create a random number engine using the seed_seq to seed the
engine's initial state.
A seed_seq can be used to initialize multiple random number engines;
seed_seq will produce the same seed data each time it is used.
Note: Not all implemenations provide a source of non-deterministic data.
Check your implementation's documentation for std::random_device.
If your platform does not provide a non-deterministic random_device then some other sources can be used for seeding. The article Simple Portable C++ Seed Entropy suggests a number of alternative sources:
A high resolution clock such as std::chrono::high_resolution_clock (time() typically has a resolution of one second which generally too low)
Memory configuration which on modern OSs varies due to address space layout randomization (ASLR)
CPU counters or random number generators. C++ does not provide standardized access to these so I won't use them.
thread id
A simple counter (which only matters if you seed more than once)
For example:
#include <chrono>
#include <iostream>
#include <random>
#include <thread>
#include <utility>
using namespace std;
// we only use the address of this function
static void seed_function() {}
int main() {
// Variables used in seeding
static long long seed_counter = 0;
int var;
void *x = std::malloc(sizeof(int));
free(x);
std::seed_seq seed{
// Time
static_cast<long long>(std::chrono::high_resolution_clock::now()
.time_since_epoch()
.count()),
// ASLR
static_cast<long long>(reinterpret_cast<intptr_t>(&seed_counter)),
static_cast<long long>(reinterpret_cast<intptr_t>(&var)),
static_cast<long long>(reinterpret_cast<intptr_t>(x)),
static_cast<long long>(reinterpret_cast<intptr_t>(&seed_function)),
static_cast<long long>(reinterpret_cast<intptr_t>(&_Exit)),
// Thread id
static_cast<long long>(
std::hash<std::thread::id>()(std::this_thread::get_id())),
// counter
++seed_counter};
std::mt19937 eng(seed);
uniform_real_distribution<double> urd(0, 1);
cout << "Uniform [0, 1): " << urd(eng);
}