When I started CUDA acceleration in onnx reasoning, I found a problem. If other processes in the process use GPU resources highly, the operation speed of my program will decrease by 100% (30ms to 60ms)
Is there any way to increase my occupation of GPU computing power?
Related
I have been trying to set up CUDA computing under Julia for my RTX 2070 GPU and, so far, I did not get any errors related to failed CUDA initialization when executing CUDA-parallelized code. However, the parallelized computations seem surprisingly slow, so I launched Pkg.test("CUDA") from Julia to get some more insight into why that could be. Here is a screenshot of some of the results:
Julia CUDA test. The GPU allocation appears to be entirely negligible as compared to the CPU.
This is also reflected in the CUDA vs. CPU usage — running nvidia-smi shows 0% volatile GPU-util, while the CPU in the resource monitor was consistently at 80% and more usage throughout the test.
Further, the CUDA utilization graph in the task manager merely shows spikes in CUDA utilization rather than continuous usage: Screenshot of CUDA utilization in task manager.
Any suggestions for why this could be the case? I have went through the verification of proper CUDA package and driver installation several times now, and I am unsure what to do next.
As the comments note, the tests in Cuda.jl/test are designed to test the compilation pipeline, not really to put the GPU under any significant load. Just to complete the picture, if you do want to try loading the GPU, you might try modifying an example from https://cuda.juliagpu.org/stable/tutorials/introduction/, for example along the lines of
N = 2^20
using CUDA
x_d = CUDA.fill(1.0f0, N) # a vector stored on the GPU filled with 1.0 (Float32)
y_d = CUDA.fill(2.0f0, N) # a vector stored on the GPU filled with 2.0
for i=1:100000
y_d .+= sqrt.(x_d)
end
I have created a neural network classifier with 2 hidden layers. Hidden Layers units [50,25].
The model is training much faster on CPU than GPU.
My questions are :
Is this expected? I do see that the architecture is small but not that small to be faster on CPU :/
How should I debug this?
I tried increasing batch size, expecting that after some batch_size GPU will overtake CPU. But I don't see that happening.
My code is in Tensorflow 1.4.
Given the size of the network (very small) I'm inclined to think this is a DMA issue: copying data from the CPU to the GPU is expensive, maybe expensive enough that it makes up for the GPU being much faster at doing larger matrix multiplications.
I have both serial and parallel program using GPU.
The serial program takes 112.9 seconds to finish.
The parallel program with GPU takes 3.16 second to finish.
Thus, I have the speedup of 35.73.
Can I measure the efficiency of the program using the formula SpeedupTime/NumberOfThread ?
The threads will be 1024
The efficiency is the ratio of the time on the CPU to time on GPU. You might want to try multicore implementation as well and compare it with GPU implementation.
I need a GPGPU benchmark which will load the GPU so that I can measure the parameters like temperature rise, amount of battery drain etc. Basically I want to alert the user when the GPU is using a lot of power than normal use. Hence I need to decide on threshold values of GPU temperature, clock frequency and battery drain rate above which GPU will be using more power than normal use. I have tried using several graphics benchmark but most of them don't use GPU resources to the fullest.
Please provide me a link to such GPGPU benchmark.
I have been asked to measure how "efficiently " does my code use the GPU /what % of peak performance are algorithms achieving.I am not sure how to do this comparison.Till now I have basically had timers put in my code and measure the execution.How can I compare this to optimal performance and find what might be the bottle necks? (I did hear about visual profiler but couldnt get it to work ..it keeps giving me "cannot load output" error).
Each card has a maximum memory bandwidth and processing speed. For example, the GTX 480 bandwidth is 177.4 GB/s. You will need to know the specs for your card.
The first thing to decide is whether your code is memory bound or computation bound. If it is clearly one or the other, that will help you focus on the correct "efficiency" to measure. If your program is memory bound, then you need to compare your bandwidth with the cards maximum bandwidth.
You can calculate memory bandwidth by computing the amount of memory you read/write and dividing by run time (I use cuda events for timing). Here is a good example of calculating bandwidth efficiency (look at the whitepaper for the parallel reduction) and using it to help validate a kernel.
I don't know very much about determining the efficiency if instead you are ALU bound. You can probably count (or profile) the number of instructions, but what is the card's maximum?
I'm also not sure what to do in the likely case that your kernel is something in between memory bound and ALU bound.
Anyone...?
Generally "efficiently" would probably be a measure of how much memory and GPU cycles (average, min, max) of your program is using. Then the efficiency measure would be avg(mem)/total memory for the time period and so on with AVG(GPU cycles)/Max GPU cycles.
Then I'd compare these metrics to metrics from some GPU benchmark suites (which you can assume to be pretty efficient at using most of the GPU). Or you could measure against some random GPU intensive programs of your choice. That'd be how I'd do it but I've never thought to try so good luck!
As for bottlenecks and "optimal" performance. These are probably NP-Complete problems that no one can help you with. Get out the old profiler and debuggers and start working your way through your code.
Can't help with profiler and microoptimisation, but there is a CUDA calculator http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls , which trys to estimate how does your CUDA code use the hardware resources, based on this values:
Threads Per Block
Registers Per Thread
Shared Memory Per Block (bytes)