I benchmarked Eigen SGEMM operation using one thread and using 8 threads and what I got was that the performance peaked at 512x512 but then droped when exceding that size. I was wondering if there was any specific reason for this perhaps something with complexety of the larger matrix's? I looked at the benchmark on the website of Eigen for matrix-matrix operations but didn't see anything similar.
At 512x512 I got like 4x faster in parallel. But in 4096x4096 I got barely 2x faster. I am using openMP for parallelism and to down it to one thread I set num_of_threads to two.
Your results suggest that this algorithm is primarily memory bandwidth bound at large matrix size. 4Kx4K matrix (float?) exceeds cache size of any CPU available to mere mortals, while 512x512 will comfortably fit into L3 cache on most modern CPUs.
I ran a few tests on matrix multiplication using several BLAS implementations including Eigen. I've posted the results here. You might find it useful.
Related
In my multi-GPU tensorflow (1.13) training, some sparse-related operations consume a considerable amount of time. In the timeline, I found that these sparse operations can only be performed on the CPU, which have no GPU kernel support, and resulting in frequent memory copies.
e.g.
As shown above, SparseFillEmptyRows and SparseSegmentSum take up most of the CPU time and cause a large number of memory copies (DtoH && HtoD). If these two ops can be moved to GPU, I think there can be a big performance improvement.
I want to know what is the reason behind it. Is it simply that no one is developing it? Or does the sparse operation perform poorly on the GPU?
NVIDIA cuda documentation for cuFFT says "These batched transforms have higher performance than single transforms"
(Read more at: http://docs.nvidia.com/cuda/cufft/index.html#ixzz57haP0Mtz
Follow us: #GPUComputing on Twitter | NVIDIA on Facebook)
But does not show anything quantitative. any information about how much the speed up would be ? from a single transform I mean inside a for loop.
Speedup will depend on the size of the matrices, the number of batches, and the targeted hardware (also the CUDA Toolkit version). If you have a large batch of small matrices you would see more of a speedup than otherwise. Part of the speedup is avoiding the launch overhead, so for matrix sizes that are large enough that the launch overhead is small compared to kernel execution, you won’t see as much speedup. I believe for very small matrices they can pack several batches together and use the more (memory) efficient device functions.
I'm asking around to see if there are any white papers or other published reports. So far I haven't found any.
I have created a neural network classifier with 2 hidden layers. Hidden Layers units [50,25].
The model is training much faster on CPU than GPU.
My questions are :
Is this expected? I do see that the architecture is small but not that small to be faster on CPU :/
How should I debug this?
I tried increasing batch size, expecting that after some batch_size GPU will overtake CPU. But I don't see that happening.
My code is in Tensorflow 1.4.
Given the size of the network (very small) I'm inclined to think this is a DMA issue: copying data from the CPU to the GPU is expensive, maybe expensive enough that it makes up for the GPU being much faster at doing larger matrix multiplications.
I'm coding a raytracer using (py)CUDA and I'm obtaining a really low speedup; for example, in a 1000x1000 image, the GPU-parallelized code is just 4 times faster than the sequential code, executed in the CPU.
For each ray I have to solve 5 equations (the raytracer generates images of black holes using the process described in this paper), so my setup is the following: each ray is computed in a separate block, where 5 threads compute the equations using shared memory. That is, if I want to generate an image with a width of W pixels and a height of H pixels, the setup is:
Grid: W blocks x H blocks.
Block: 5 threads.
The most expensive computation is the resolution of the equations, that I solve with a custom Runge Kutta 4-5 algorithm.
The code is quite long and hard to explain in such a short question, but you can see it in Github. The CUDA kernel is here and the Runge Kutta solver is here. The CPU version with the sequential version of the exact same solver can be found in the same repo.
The equations to solve involve several computations, and I'm afraid the CPU optimization of some functions like sin, cos and sqrt is causing the low speedup (?)
My machine specs are:
GPU: GeForce GTX 780
CPU: Intel Core i7 CPU 930 # 2.80GHz
My questions are:
Is it normal to get a speedup of 3x or 4x in a GPU-parallelized raytracer against a sequential code?
Do you see anything wrong in the CUDA setup or in the code that could be causing this behaviour?
Am I missing something important?
I understand the question can be too specific, but if you need more information, just say it, I'll be glad to provide it.
Is it normal to get a speedup of 3x or 4x in a GPU-parallelized raytracer against a sequential code?
How long is a piece of string? There is no answer to this question.
Do you see anything wrong in the CUDA setup or in the code that could be causing this behaviour?
Yes, as noted in comments, you are using a completely inappropriate block size which is wasting approximately 85% of the potential computational capacity of your GPU.
Am I missing something important?
Yes, the answer to this question. Setting correct execution parameters is about 50% of the practical performance tuning requirements in CUDA, and you should be able to obtain noticeable performance improvements just be selecting a sane block size. Beyond this, careful profiling should be your next port of call.
[This answer assembled from comments and added as community wiki entry to get this (very broad) question off the unanswered list in the absence of enough close votes to close it].
I would like to upload two images to the GPU memory, and I'm interested how fast I can do this?
In fact - will it be faster to compare two bitmaps in RAM with CPU, or upload them to GPU and use GPU parallelism to do it?
If you run the CUDA device bandwidth sample, you'll get a benchmark for the upload speed.
Assuming DDR3 tri-channel 1600MHz RAM, you'll get something like 38 GB/s memory bandwidth.
Take a typical midrange card like a GTX460 and you'll get something like 84 GB/s memory bandwidth. Note that you'll have to make a hop across the bus which is something like 8GB/s theoretical, ~5.5 in practice for a PCI-E2.0 x16 link.
Note that kotlinski's answer isn't quite correct. You'll can do compared in parallel and then do a parallel reduction in which case, the bigger GPU device bandwidth can work win out eventually.
I think the answer is likely to be: a loss to upload to GPU and do comparison once. Possible gain if comparison is made multiple times (kept and modified on the GPU, for example).
Edit:
The multiple times comparison refers to if you modified the images on the GPU memory in situ. Thus, it would merit another comparison (caching doesn't cut it), while not incurring the penalty of another copy across the bus.
Since memory access is the bottleneck here, it is extremely likely that it is faster to just do it in CPU. Making it run in parallel is not likely to give you anything, memory access is essentially a serial operation.
The answer to this question is highly debatable and depends entirely on you systems configuration. This means that you'll have to do the benchmarks yourself. Factors that could influence your situation:
Speed of your RAM
Speed of the GPU Bus
Whether or not you have shared RAM between GPU & CPU
However, I do think that in the general case (eg. with busspeeds in the order of GB/s) it's faster to upload the images to the GPU and do the difference comparison there.