Tensorflow Neural Network faster on CPU than GPU - performance

I have created a neural network classifier with 2 hidden layers. Hidden Layers units [50,25].
The model is training much faster on CPU than GPU.
My questions are :
Is this expected? I do see that the architecture is small but not that small to be faster on CPU :/
How should I debug this?
I tried increasing batch size, expecting that after some batch_size GPU will overtake CPU. But I don't see that happening.
My code is in Tensorflow 1.4.

Given the size of the network (very small) I'm inclined to think this is a DMA issue: copying data from the CPU to the GPU is expensive, maybe expensive enough that it makes up for the GPU being much faster at doing larger matrix multiplications.

Related

Why so few GPU kernels for the sparse operation in tensorflow?

In my multi-GPU tensorflow (1.13) training, some sparse-related operations consume a considerable amount of time. In the timeline, I found that these sparse operations can only be performed on the CPU, which have no GPU kernel support, and resulting in frequent memory copies.
e.g.
As shown above, SparseFillEmptyRows and SparseSegmentSum take up most of the CPU time and cause a large number of memory copies (DtoH && HtoD). If these two ops can be moved to GPU, I think there can be a big performance improvement.
I want to know what is the reason behind it. Is it simply that no one is developing it? Or does the sparse operation perform poorly on the GPU?

Is physics simulation really faster on GPU?

From what i have observed, havok does a significantly better job for rigid simulation than Physx, especially their new Havok Physics 2013.
Im not very familiar with how state of the art physics engine works, but by testing alone
i cannot get a very accurate test results.
For example, PhysX still seems to cripple CPU performance on purpose. My results shows when the simultaneous interaction rigids exceeds a certain amount (this ranges from 1024 to 8096 boxes), it's performance drops along a very unnatural steep curve and stops plumb dropping when it matches the performance of Bullet. Whereas, many other engines I've tested scale relatively linearly with scene complexity.
This goes worse if I want to measure the real world scenarios such as in games or in game-engines or even in CG making.
Undoubtedly, GPU handles particle physics far better than CPU, So i want to restrict this discussion on rigid body and soft body (including cloth) simulations.
So, is physics simulation really faster on GPU? If it is, by how much?
In general the architecure of a GPU is designed to handle massive data-parallel tasks by having a large number of stream processor cores with very wide SIMD instruction sets. So if the task can be decomposed into similarly structured independant kernels, then a GPU will be much faster (sometimes by orders of magntitude). CPUs do also have multiple cores and SIMD instructions as well, but not as many and not as wide. So it really depends on the specific properties and constraints of the specific workload, and whether or not it can take advantage of this extra parallel architecture.

Eigen parallel performance drops when matrix exceds 512x512

I benchmarked Eigen SGEMM operation using one thread and using 8 threads and what I got was that the performance peaked at 512x512 but then droped when exceding that size. I was wondering if there was any specific reason for this perhaps something with complexety of the larger matrix's? I looked at the benchmark on the website of Eigen for matrix-matrix operations but didn't see anything similar.
At 512x512 I got like 4x faster in parallel. But in 4096x4096 I got barely 2x faster. I am using openMP for parallelism and to down it to one thread I set num_of_threads to two.
Your results suggest that this algorithm is primarily memory bandwidth bound at large matrix size. 4Kx4K matrix (float?) exceeds cache size of any CPU available to mere mortals, while 512x512 will comfortably fit into L3 cache on most modern CPUs.
I ran a few tests on matrix multiplication using several BLAS implementations including Eigen. I've posted the results here. You might find it useful.

What's the speed of texture upload?

I would like to upload two images to the GPU memory, and I'm interested how fast I can do this?
In fact - will it be faster to compare two bitmaps in RAM with CPU, or upload them to GPU and use GPU parallelism to do it?
If you run the CUDA device bandwidth sample, you'll get a benchmark for the upload speed.
Assuming DDR3 tri-channel 1600MHz RAM, you'll get something like 38 GB/s memory bandwidth.
Take a typical midrange card like a GTX460 and you'll get something like 84 GB/s memory bandwidth. Note that you'll have to make a hop across the bus which is something like 8GB/s theoretical, ~5.5 in practice for a PCI-E2.0 x16 link.
Note that kotlinski's answer isn't quite correct. You'll can do compared in parallel and then do a parallel reduction in which case, the bigger GPU device bandwidth can work win out eventually.
I think the answer is likely to be: a loss to upload to GPU and do comparison once. Possible gain if comparison is made multiple times (kept and modified on the GPU, for example).
Edit:
The multiple times comparison refers to if you modified the images on the GPU memory in situ. Thus, it would merit another comparison (caching doesn't cut it), while not incurring the penalty of another copy across the bus.
Since memory access is the bottleneck here, it is extremely likely that it is faster to just do it in CPU. Making it run in parallel is not likely to give you anything, memory access is essentially a serial operation.
The answer to this question is highly debatable and depends entirely on you systems configuration. This means that you'll have to do the benchmarks yourself. Factors that could influence your situation:
Speed of your RAM
Speed of the GPU Bus
Whether or not you have shared RAM between GPU & CPU
However, I do think that in the general case (eg. with busspeeds in the order of GB/s) it's faster to upload the images to the GPU and do the difference comparison there.

Optimum performance of GPU

I have been asked to measure how "efficiently " does my code use the GPU /what % of peak performance are algorithms achieving.I am not sure how to do this comparison.Till now I have basically had timers put in my code and measure the execution.How can I compare this to optimal performance and find what might be the bottle necks? (I did hear about visual profiler but couldnt get it to work ..it keeps giving me "cannot load output" error).
Each card has a maximum memory bandwidth and processing speed. For example, the GTX 480 bandwidth is 177.4 GB/s. You will need to know the specs for your card.
The first thing to decide is whether your code is memory bound or computation bound. If it is clearly one or the other, that will help you focus on the correct "efficiency" to measure. If your program is memory bound, then you need to compare your bandwidth with the cards maximum bandwidth.
You can calculate memory bandwidth by computing the amount of memory you read/write and dividing by run time (I use cuda events for timing). Here is a good example of calculating bandwidth efficiency (look at the whitepaper for the parallel reduction) and using it to help validate a kernel.
I don't know very much about determining the efficiency if instead you are ALU bound. You can probably count (or profile) the number of instructions, but what is the card's maximum?
I'm also not sure what to do in the likely case that your kernel is something in between memory bound and ALU bound.
Anyone...?
Generally "efficiently" would probably be a measure of how much memory and GPU cycles (average, min, max) of your program is using. Then the efficiency measure would be avg(mem)/total memory for the time period and so on with AVG(GPU cycles)/Max GPU cycles.
Then I'd compare these metrics to metrics from some GPU benchmark suites (which you can assume to be pretty efficient at using most of the GPU). Or you could measure against some random GPU intensive programs of your choice. That'd be how I'd do it but I've never thought to try so good luck!
As for bottlenecks and "optimal" performance. These are probably NP-Complete problems that no one can help you with. Get out the old profiler and debuggers and start working your way through your code.
Can't help with profiler and microoptimisation, but there is a CUDA calculator http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls , which trys to estimate how does your CUDA code use the hardware resources, based on this values:
Threads Per Block
Registers Per Thread
Shared Memory Per Block (bytes)

Resources