CUDA testing in Julia - very low GPU utilization - windows

I have been trying to set up CUDA computing under Julia for my RTX 2070 GPU and, so far, I did not get any errors related to failed CUDA initialization when executing CUDA-parallelized code. However, the parallelized computations seem surprisingly slow, so I launched Pkg.test("CUDA") from Julia to get some more insight into why that could be. Here is a screenshot of some of the results:
Julia CUDA test. The GPU allocation appears to be entirely negligible as compared to the CPU.
This is also reflected in the CUDA vs. CPU usage — running nvidia-smi shows 0% volatile GPU-util, while the CPU in the resource monitor was consistently at 80% and more usage throughout the test.
Further, the CUDA utilization graph in the task manager merely shows spikes in CUDA utilization rather than continuous usage: Screenshot of CUDA utilization in task manager.
Any suggestions for why this could be the case? I have went through the verification of proper CUDA package and driver installation several times now, and I am unsure what to do next.

As the comments note, the tests in Cuda.jl/test are designed to test the compilation pipeline, not really to put the GPU under any significant load. Just to complete the picture, if you do want to try loading the GPU, you might try modifying an example from https://cuda.juliagpu.org/stable/tutorials/introduction/, for example along the lines of
N = 2^20
using CUDA
x_d = CUDA.fill(1.0f0, N) # a vector stored on the GPU filled with 1.0 (Float32)
y_d = CUDA.fill(2.0f0, N) # a vector stored on the GPU filled with 2.0
for i=1:100000
y_d .+= sqrt.(x_d)
end

Related

Type of CPU load to perform constant relaxed work

I'm trying to figure out how to program a certain type of load to the CPU that makes it work constantly but with average stress.
The only approach I know how to load a CPU with some work to do which wouldn't be at its maximum possible performance is when we alternate the part of giving the CPU something to do with sleep for some time. E.g. to achieve 20% CPU usage, do some computation which would take e.g. 0.2 seconds and then sleep for 0.8 seconds. Then the CPU usage will be roughly 20%.
However this essentially means the CPU will be jumping between max performance to idle all the time.
I wrote a small Python program where I'm making a process for each CPU core, set its affinity so each process runs on a designated core, and I'm giving it some absolutely meaningless load:
def actual_load_cycle():
x = list(range(10000))
del x
while repeating a call to this procedure in cycle and then sleeping for some time to ensure the working time is N% of total time:
while 1:
timer.mark_time(timer_marker)
for i in range(coef):
actual_load_cycle()
elapsed = timer.get_time_since(timer_marker)
# now we need to sleep for some time. The elapsed time is CPU_LOAD_TARGET% of 100%.
time_to_sleep = elapsed / CPU_LOAD_TARGET * (100 - CPU_LOAD_TARGET)
sleep(time_to_sleep)
It works well, giving the load within 7% of desired value of CPU_LOAD_TARGET - I don't need a precise amount of load.
But it sets the temperature of the CPU very high, with CPU_LOAD_TARGET=35 (real CPU usage reported by the system is around 40%) the CPU temps go up to 80 degrees.
Even with the minimal target like 5%, the temps are spiking, maybe just not as much - up to 72-73.
I believe the reason for this is that those 20% of time the CPU works as hard as it can, and it doesn't get cooler fast enough while sleeping afterwards.
But when I'm running a game, like Uncharted 4, the CPU usage as measured by MSI Afterburner is 42-47%, but the temperatures stay under 65 degrees.
How can I achieve similar results, how can I program such load to make CPU usage high but the work itself would be quite relaxed, as is done e.g. in the games?
Thanks!
The heat dissipation of a CPU is mainly dependent of its power consumption which is very dependent of the workload, and more precisely the instruction being executed and the number of active cores. Modern processors are very complex so it is very hard to predict the power consumption based on a given workload, especially when the executed code is a Python code executed in the CPython interpreter.
There are many factors that can impact the power consumption of a modern processors. The most important one is frequency scaling. Mainstream x86-64 processors can adapt the frequency of a core based on the kind of computation done (eg. use of wide SIMD floating-point vectors like the ZMM registers of AVX-512F VS scalar 64-bit integers), the number of active cores (the higher the number of core the lower the frequency), the current temperature of the core, the time executing instructions VS sleeping, etc. On modern processor, the memory hierarchy can take a significant amount of power so operations involving the memory controller and more generally the RAM can eat more power than the one operating on in-core registers. In fact, regarding the instructions actually executed, the processor needs to enable/disable some parts of its integrated circuit (eg. SIMD units, integrated GPU, etc.) and not all can be enabled at the same time due to TDP constraints (see Dark silicon). Floating-point SIMD instructions tend to eat more energy than integer SIMD instructions. Something even weirder: the consumption can actually be dependent of the input data since transistors may switch more frequently from one state to another with some data (researchers found this while computing matrix multiplication kernels on different kind of platforms with different kind of input). The power is automatically adapted by the processor since it would be insane (if even possible) for engineers to actually consider all possible cases and all possible dynamic workload.
One of the cheapest x86 instruction is NOP which basically mean "do nothing". That being said, the processor can run at the highest turbo frequency so to execute a loop of NOP resulting in a pretty high power consumption. In fact, some processor can run the NOP in parallel on multiple execution units of a given core keeping busy all the available ALUs. Funny point: running dependent instructions with a high latency might actually reduce the power consumption of the target processor.
The MWAIT/MONITOR instructions provide hints to allow the processor to enter an implementation-dependent optimized state. This includes a lower-power consumption possibly due to a lower frequency (eg. no turbo) and the use of sleep states. Basically, your processor can sleep for a very short time so to reduce its power consumption and then be able to use a high frequency for a longer time due to a lower power / heat-dissipation before. The behaviour is similar to humans: the deeper the sleep the faster the processor can be after that, but the deeper the sleep the longer the time to (completely) wake up. The bad news is that such instruction requires very-high privileges AFAIK so you basically cannot use them from a user-land code. AFAIK, there are instructions to do that in user-land like UMWAIT and UMONITOR but they are not yet implemented except maybe in very recent processors. For more information, please read this post.
In practice, the default CPython interpreter consumes a lot of power because it makes a lot of memory accesses (including indirection and atomic operations), does a lot of branches that needs to be predicted by the processor (which has special power-greedy units for that), performs a lot of dynamic jumps in a large code. The kind of pure-Python code executed does not reflect the actual instructions executed by the processor since most of the time will be spent in the interpreter itself. Thus, I think you need to use a lower-level language like C or C++ to better control kind of workload to be executed. Alternatively, you can use JIT compiler like Numba so to have a better control while still using a Python code (but not a pure-Python one anymore). Still, one should keep in mind that the JIT can generate many unwanted instructions that can result in an unexpectedly higher power consumption. Alternatively, a JIT compiler can optimize trivial codes like a sum from 1 to N (simplified as just a N*(N+1)/2 expression).
Here is an example of code:
import numba as nb
def test(n):
s = 1
for i in range(1, n):
s += i
s *= i
s &= 0xFF
return s
pythonTest = test
numbaTest = nb.njit('(int64,)')(test) # Compile the function
pythonTest(1_000_000_000) # takes about 108 seconds
numbaTest(1_000_000_000) # takes about 1 second
In this code, the Python function takes 108 times more time to execute than the Numba function on my machine (i5-9600KF processor) so one should expect a 108 higher energy needed to execute the Python version. However, in practice, this is even worse: the pure-Python function causes the target core to consume a much higher power (not just more energy) than the equivalent compiled Numba implementation on my machine. This can be clearly seen on the temperature monitor:
Base temperature when nothing is running: 39°C
Temperature during the execution of pythonTest: 55°C
Temperature during the execution of numbaTest: 46°C
Note that my processor was running at 4.4-4.5 GHz in all cases (due to the performance governor being chosen). The temperature is retrieved after 30 seconds in each cases and it is stable (due to the cooling system). The function are run in a while(True) loop during the benchmark.
Note that game often use multiple cores and they do a lot of synchronizations (at least to wait for the rendering part to be completed). A a result, the target processor can use a slightly lower turbo frequency (due to TDP constraints) and have a lower temperature due to the small sleeps (saving energy).

TensorFlow - Low GPU usage on Titan X

For a while, I have been noticing that TensorFlow (v0.8) does not seem to fully use the computation power of my Titan X. For several CNNs that I have been running the GPU usage does not seem to exceed ~30%. Typically the GPU utilization is even lower, more like 15%. One particular example of a CNN that shows this behavior is the CNN from DeepMind's Atari paper with Q-learning (see link below for code).
When I see other people of our lab running CNNs written in Theano or Torch the GPU usage is typically 80%+. This makes me wondering, why are the CNNs that I write in TensorFlow so 'slow' and what can I do to make more efficient use of the GPU processing power? Generally, I am interested in ways to profile the GPU operations and discover where the bottlenecks are. Any recommendations how to do this are very welcome since this seems not really possible with TensorFlow at the moment.
Things I did to find out more about the cause of this problem:
Analyzing TensorFlow's device placement, everything seems to be on gpu:/0 so looks OK.
Using cProfile, I have optimized the batch generation and other preprocessing steps. The preprocessing is performed on a single thread, but the actual optimization performed by TensorFlow steps take much longer (see average runtimes below). One obvious idea to increase the speed is by using TFs queue runners, but since the batch preparation is already 20x faster than optimization I wonder whether this is going to make a big difference.
Avg. Time Batch Preparation: 0.001 seconds
Avg. Time Train Operation: 0.021 seconds
Avg. Time Total per Batch: 0.022 seconds (45.18 batches/second)
Run on multiple machines to rule out hardware issues.
Upgraded to the latest versions of CuDNN v5 (RC), CUDA Toolkit 7.5 and reinstalled TensorFlow from sources about a week ago.
An example of the Q-learning CNN for which this 'problem' occurs can be found here: https://github.com/tomrunia/DeepReinforcementLearning-Atari/blob/master/qnetwork.py
Example of NVIDIA SMI displaying the low GPU utilization: NVIDIA-SMI
With the more recent versions of Tensorflow (I am using Tensorflow 1.4), we can obtain runtime statistics and visualize them in Tensorboard.
These statistics include compute time and memory usage for each node in the computation graph.

Why does CUDA float program get faster in full speed FP64 mode?

My CUDA program uses only float, int, short and char types in its computation. None of the input or output arrays have members of type double. And none of the kernels create any double type inside them during computation.
This program has been compiled using CUDA SDK 5.5 in Release mode using NSight Eclipse. A typical compile line looks like this:
nvcc -O3 -gencode arch=compute_35,code=sm_35 -M -o "src/foo.d" "../src/foo.cu"
I am running this program on a GTX Titan on Linux. To my surprise, I noticed that this program runs 10% faster when I enable the full speed FP64 mode on Titan. This can be done by enabling CUDA Double Precision option in NVIDIA X Server Settings program.
While I am happy for this free speed bonus, I would like to learn the reasons why a CUDA float program could get faster in FP64 mode?
I guess that when you enable the full speed FP64 mode on Titan, more compute units start participating in computation and these FP64 compute units can be used to computing FP32. But enabling large amount of FP64 blocks also slowing clock, so computing getting faster by only 10%.
How to get 10%?
When Titan runs in 1/24 FP64 mode, it runs at 837MHz. When it runs in 1/3 FP64 mode, it runs at 725MHz. So (1+1/3)/(1+1/24) * 725/837 = 1.109.
References: http://www.anandtech.com/show/6760/nvidias-geforce-gtx-titan-part-1/4
I found confirmation my guess.
"What's more, the CUDA FP64 block has a very special execution rate: 1/1 FP32."
Reference http://www.anandtech.com/show/5699/nvidia-geforce-gtx-680-review/2
This information for GK104, Titan have GK110. But it's one architecture. So I think that GK110 also have this opportunity.
Titan cards by default have the FP64 implementation "capped", that has been done mainly due to power efficiency reasons and clockspeed.
NVIDIA deliberately chose not to enable this by default and instead let you control the behavior by setting FP64 at full speed (1/3 FP32) or reduced speed (1/24 FP32).
References: http://www.anandtech.com/show/6760/nvidias-geforce-gtx-titan-part-1/4

Cuda profiler says that my two kernels are expensive, however their execution time seems to be small

I use two kernels, let's call them A an B.
I run the CUDA profiler and this is what it returned:
The first kernel has 44% overhead while the second 20%.
However, if I decide to find out the actual execution time by following this logic:
timeval tim;
gettimeofday(&tim, NULL);
double before = tim.tv_sec+(tim.tv_usec/1000000.0);
runKernel<<<...>>>(...)
gettimeofday(&tim, NULL);
double after=tim.tv_sec+(tim.tv_usec/1000000.0);
totalTime = totalTime + after - before;
The totalTime will be very small, somewhere around 0.0001 seconds.
I'm new to CUDA and I don't understand exactly what's going on. Should I try and make the kernels more efficient or are they already efficient?
Kernel calls are asynchronous from the point of view of the CPU (see this answer). If you time your kernel the way you do without any synchronization (i.e. without calling cudaDeviceSynchronize()), your timings will not mean anything since computation is still in progress on the GPU.
You can trust NVIDIA's profilers when it comes to timing your kernels (nvprof / nvvp). The NVIDIA Visual Profiler can also analyze your program and provide some pieces of advice on what may be wrong with your kernels: uncoalesced memory accesses, unefficient number of threads/blocks assigned etc. You also need to compile your code in release mode with optimization flags (e.g. -O3) to get some relevant timings.
Concerning kernel optimization, you need to find your bottlenecks (e.g. your 44% kernel), analyze it, and apply the usual optimization techniques:
Use the effective bandwidth of your device to work out what the upper bound on performance ought to be for your kernel
Minimize memory transfers between host and device - even if that means doing calculations on the device which are not efficient there
Coalesce all memory accesses
Prefer shared memory access to global memory access
Avoid code execution branching within a single warp as this serializes the threads
You can also use instruction-level parallelism (you should read these slides).
It is however hard to know when you cannot optimize your kernels anymore. Saying that the execution time of your kernels is small does not mean much: small compared to what? Are you trying to do some real time computation? Is scalability an issue? These are some of the questions that you need to answer before trying to optimize your kernels.
On a side note, you should also use error checking extensively, and rely on cuda-memcheck/cuda-gdb to debug your code.

Does modern GPU (e.g Fermi/Evergreen) supports out of order execution?

I am writing an OpenCL kernel which involves a few barriers in a loop. I have tested the kernel on CPU (8-core FX8150) and the result shows these barriers reduced running speed by a factor of 50~100 times (I further verified this by re-implementing the kernel on Java using multi-threading + CyclicBarrier). I suspect the reason was barrier essentially stops the CPU taking advantage of out-of-order execution, so I am a little worried if I would observe the same magnitude of speed decrease on GPU. I checked a few official documents and googled around a bit but there is little information available on this topic.
Current state-of-the art GPUs are in-order pipelined processor. GPUs fill the pipeline effectively by interleaving instructions from different warps (wavefronts). In comparisons, CPUs use out-of-order speculative execution to fill the pipeline. There are different functional units like ALUs and SFUs which have separated pipelines. But notice that instruction dependency stalls the warp. For more information on instruction dependency resolving on GPUs refer to this NVIDIA patent.
NVIDIA’s Next Generation
CUDA Compute and Graphics Architecture, Code-Named “Fermi”:
Nvidia GigaThread Engine has capabilities of(at page 5)
10x faster application context switching
Concurrent kernel execution
Out of Order thread block execution :)
Dual overlapped memory transfer engines
Evergreen has SIMD capabilities and has a chance outperform some fermi but i dont know about oooe of it. There is also "local atomic add" upper hand of HD 7000 series compared to GTX 600 series (nearly 10x faster)

Resources