My CUDA program uses only float, int, short and char types in its computation. None of the input or output arrays have members of type double. And none of the kernels create any double type inside them during computation.
This program has been compiled using CUDA SDK 5.5 in Release mode using NSight Eclipse. A typical compile line looks like this:
nvcc -O3 -gencode arch=compute_35,code=sm_35 -M -o "src/foo.d" "../src/foo.cu"
I am running this program on a GTX Titan on Linux. To my surprise, I noticed that this program runs 10% faster when I enable the full speed FP64 mode on Titan. This can be done by enabling CUDA Double Precision option in NVIDIA X Server Settings program.
While I am happy for this free speed bonus, I would like to learn the reasons why a CUDA float program could get faster in FP64 mode?
I guess that when you enable the full speed FP64 mode on Titan, more compute units start participating in computation and these FP64 compute units can be used to computing FP32. But enabling large amount of FP64 blocks also slowing clock, so computing getting faster by only 10%.
How to get 10%?
When Titan runs in 1/24 FP64 mode, it runs at 837MHz. When it runs in 1/3 FP64 mode, it runs at 725MHz. So (1+1/3)/(1+1/24) * 725/837 = 1.109.
References: http://www.anandtech.com/show/6760/nvidias-geforce-gtx-titan-part-1/4
I found confirmation my guess.
"What's more, the CUDA FP64 block has a very special execution rate: 1/1 FP32."
Reference http://www.anandtech.com/show/5699/nvidia-geforce-gtx-680-review/2
This information for GK104, Titan have GK110. But it's one architecture. So I think that GK110 also have this opportunity.
Titan cards by default have the FP64 implementation "capped", that has been done mainly due to power efficiency reasons and clockspeed.
NVIDIA deliberately chose not to enable this by default and instead let you control the behavior by setting FP64 at full speed (1/3 FP32) or reduced speed (1/24 FP32).
References: http://www.anandtech.com/show/6760/nvidias-geforce-gtx-titan-part-1/4
Related
I have been trying to set up CUDA computing under Julia for my RTX 2070 GPU and, so far, I did not get any errors related to failed CUDA initialization when executing CUDA-parallelized code. However, the parallelized computations seem surprisingly slow, so I launched Pkg.test("CUDA") from Julia to get some more insight into why that could be. Here is a screenshot of some of the results:
Julia CUDA test. The GPU allocation appears to be entirely negligible as compared to the CPU.
This is also reflected in the CUDA vs. CPU usage — running nvidia-smi shows 0% volatile GPU-util, while the CPU in the resource monitor was consistently at 80% and more usage throughout the test.
Further, the CUDA utilization graph in the task manager merely shows spikes in CUDA utilization rather than continuous usage: Screenshot of CUDA utilization in task manager.
Any suggestions for why this could be the case? I have went through the verification of proper CUDA package and driver installation several times now, and I am unsure what to do next.
As the comments note, the tests in Cuda.jl/test are designed to test the compilation pipeline, not really to put the GPU under any significant load. Just to complete the picture, if you do want to try loading the GPU, you might try modifying an example from https://cuda.juliagpu.org/stable/tutorials/introduction/, for example along the lines of
N = 2^20
using CUDA
x_d = CUDA.fill(1.0f0, N) # a vector stored on the GPU filled with 1.0 (Float32)
y_d = CUDA.fill(2.0f0, N) # a vector stored on the GPU filled with 2.0
for i=1:100000
y_d .+= sqrt.(x_d)
end
I have an OpenCL sequential program and a parallel program which consists of the same algorithm. I have got the execution time results as 133000 milliseconds for sequential and 17 milliseconds as the kernel time for parallel. So when I calculate the speed up that is 133000/17 i get 7823 as the speedup. Whether this much of speed up possible?
Such a speedup might happen (but seems quite big; to me, a speedup of 7823 looks suspicious but not entirely impossible, see e.g. these slides and that. A 100x factor would seem more reasonable). Costly graphics cards are rumored to be able to run at several teraflops. A single core gives only gigaflops. Some particular programs can even run slower on GPGPU than on the CPU.
When benchmarking your CPU code, be sure to enable optimizations in your compiler (e.g. compile with gcc -O2 at least with GCC). Without any optimization (e.g. gcc -O0) the CPU performance is slow (e.g. a 3x factor between binary obtained with gcc -O0 and gcc -O2 is common).
BTW, cache considerations matter a lot for CPU performance. If you wrote your numerical CPU code without taking that into account, it may be quite slow (in the weird case when it has bad locality of reference).
If the kernel function has a problem and has not been executed, the time results will be inaccurate
When we are talking about a parallel program in Cuda on GPU having a speed up over a similar sequential one on CPU , should the sequential one be compiled by a Compiler Optimizer (gcc -O2)?
I have paralleled a program on GPU. It has a speed up of 18 in comparison with its CPU implementation without a compiler optimizer. But when I add the option -O2 to nvcc compiler, the speed up rate decreases to 8.
Of course optimizer should be used for both GPU and CPU program when comparing the performance.
If your focus on GPU v.s. CPU, the comparison should not be affected by the quality of the software code. We often assume the code should have the best performance on its hardware.
I use two kernels, let's call them A an B.
I run the CUDA profiler and this is what it returned:
The first kernel has 44% overhead while the second 20%.
However, if I decide to find out the actual execution time by following this logic:
timeval tim;
gettimeofday(&tim, NULL);
double before = tim.tv_sec+(tim.tv_usec/1000000.0);
runKernel<<<...>>>(...)
gettimeofday(&tim, NULL);
double after=tim.tv_sec+(tim.tv_usec/1000000.0);
totalTime = totalTime + after - before;
The totalTime will be very small, somewhere around 0.0001 seconds.
I'm new to CUDA and I don't understand exactly what's going on. Should I try and make the kernels more efficient or are they already efficient?
Kernel calls are asynchronous from the point of view of the CPU (see this answer). If you time your kernel the way you do without any synchronization (i.e. without calling cudaDeviceSynchronize()), your timings will not mean anything since computation is still in progress on the GPU.
You can trust NVIDIA's profilers when it comes to timing your kernels (nvprof / nvvp). The NVIDIA Visual Profiler can also analyze your program and provide some pieces of advice on what may be wrong with your kernels: uncoalesced memory accesses, unefficient number of threads/blocks assigned etc. You also need to compile your code in release mode with optimization flags (e.g. -O3) to get some relevant timings.
Concerning kernel optimization, you need to find your bottlenecks (e.g. your 44% kernel), analyze it, and apply the usual optimization techniques:
Use the effective bandwidth of your device to work out what the upper bound on performance ought to be for your kernel
Minimize memory transfers between host and device - even if that means doing calculations on the device which are not efficient there
Coalesce all memory accesses
Prefer shared memory access to global memory access
Avoid code execution branching within a single warp as this serializes the threads
You can also use instruction-level parallelism (you should read these slides).
It is however hard to know when you cannot optimize your kernels anymore. Saying that the execution time of your kernels is small does not mean much: small compared to what? Are you trying to do some real time computation? Is scalability an issue? These are some of the questions that you need to answer before trying to optimize your kernels.
On a side note, you should also use error checking extensively, and rely on cuda-memcheck/cuda-gdb to debug your code.
I have a CUDA program with huge memory accesses, which are 'randomly' and thus NOT coalesced at all. Now when I bench this program for different kernel-runtimeparameters and choose the blocksize always a multiple of 7 (starting from 7 to let's say 980) and the threadsPerBlock always a multiple of the warpsize (starting from 32 to let's say 1024) there's NO difference in the runtime of the program. How could one explain that?
Thanks a lot!
Influence of thread block size is minimal. It's the last optimization I would try (and only if occupancy is egregiously bad, Fermi class has virtually same performance whenever occupancy is over 50% or so). If your kernel is really bad, then you won't notice any differences at all.
Also, you can run the CUDA Visual Profiler on your Matlab code. With GPU coding, profile everything.
Follow these steps in the session setup.
in Launch specify your Matlab executable.
In Working directory select the directory of your matlab script
in Arguments: -nojvm -nosplash -r name_of_matlab_script (with no .m)
That said, from personal experience, see if you can use texture memory to do some caching. Even if the memory accesses are not coalesced, you may nevertheless get some cache hits from memory locality.