Is there a way of parallelizing tf.nn.conv2d operation in TensorFlow using GPU when the batch size equals to 1?
I want to give only one image to my CNN and I want the computations to be parallelized on a GPU. As far as I understand, TensorFlow does not parallelize efficiently for a batch size of 1 as the process is faster on CPU.
Related
In my multi-GPU tensorflow (1.13) training, some sparse-related operations consume a considerable amount of time. In the timeline, I found that these sparse operations can only be performed on the CPU, which have no GPU kernel support, and resulting in frequent memory copies.
e.g.
As shown above, SparseFillEmptyRows and SparseSegmentSum take up most of the CPU time and cause a large number of memory copies (DtoH && HtoD). If these two ops can be moved to GPU, I think there can be a big performance improvement.
I want to know what is the reason behind it. Is it simply that no one is developing it? Or does the sparse operation perform poorly on the GPU?
I want to accelerate my algorithm because I need to run it on hundreds of images,so I tried to use unvectorized GPU code, running the same code on GPU, I have nvidia Geforce GT 650M with 2 GB on my PC, however it was very slow than the CPU version. After searching I am convinced to pass to vectorized GPU code using batch process (pagefun, bsxfun), I tried so much to solve this problem without a solution. can someone help me about this code:
Q=100;
for i=3:n-2
for j=3:m-2
A(i,j)=0;
for c=1:Q
if B(i,j,c)~=0
A(i,j)=A(i,j)+(-(B(i,j,c)).*log(B(i,j,c)));
end
end
end
end
Another question Why Matlab uses just 20% of my CPU? How I can take benefits of my CPU to accelerate my processing
Is Matlab a single threaded app?
Thanks in advance
The vectorized version is this:
BB = B(3:(n-2),3:(m-2),:);
cutoff = 10^(-6);
logBB = log(BB);
logBB(BB<cutoff) = 0; % remove divergent terms
A = -sum(BB.*logBB,3);
This should already run much faster even on a CPU. If you have a GPU, all you need to do is have the input array
BB = gpuArray(BB);
stored on the GPU, and then collect the results
A = gather(A);
back to the CPU
You will need to buy the parallel computing toolbox. (using parfor).
This is a well known limitation of matlab where some of the underlying functions do not paralellize across multiple cores (not threads). A quick ballpark is to look at how much CPU matlab uses and multiply that by the number of COREs you have in your pc (this should get you to somewhere around 100%).
If you want to utilize your computers GPU, parallel computing toolbox is the only way to do so.
From mathworks
This really depends on what you are doing. For some code MATLAB can only utilize a single core of a single processor, for other code, MATLAB will automatically utilize all available cores (and maybe processors). It really depends on the underlying functions. Some things cannot be easily parallelized. Sometimes you can help MATLAB with things like parfor loops. Other times you might need something like MPI. Still other times there really is nothing you can do.
I benchmarked Eigen SGEMM operation using one thread and using 8 threads and what I got was that the performance peaked at 512x512 but then droped when exceding that size. I was wondering if there was any specific reason for this perhaps something with complexety of the larger matrix's? I looked at the benchmark on the website of Eigen for matrix-matrix operations but didn't see anything similar.
At 512x512 I got like 4x faster in parallel. But in 4096x4096 I got barely 2x faster. I am using openMP for parallelism and to down it to one thread I set num_of_threads to two.
Your results suggest that this algorithm is primarily memory bandwidth bound at large matrix size. 4Kx4K matrix (float?) exceeds cache size of any CPU available to mere mortals, while 512x512 will comfortably fit into L3 cache on most modern CPUs.
I ran a few tests on matrix multiplication using several BLAS implementations including Eigen. I've posted the results here. You might find it useful.
I am working on some signal processing code in SciPy, and am now trying to use a numerical optimizer to tune it. Unfortunately, as these things go, it is turning out to be quite a slow process.
The operations I must perform for this optimization are the following:
Load a large 1-d data file (~ 120000 points)
Run optimizer, which:
Executes a signal processing operation, does not modify original data, produces 120000 new data points.
Examines difference between original signal and new signal using various operations,
One of which includes FFT-based convolution
Generates a single "error" value to summarise the result -- this is what should be minimized
Looks at error and re-runs operation with different parameters
The signal processing and error functions take under 3 seconds, but unfortunately doing it 50,000 times takes much longer. I am experimenting with various more efficient optimisation algorithms, but no matter what it's going to take thousands of iterations.
I have parallelised a couple of the optimisers I'm trying using CPU threads, which wasn't too difficult since the optimiser can easily perform several scheduled runs at once on separate threads using ThreadPool.map.
But this is only about a 2x speed-up on my laptop, or maybe 8x on a multicore computer. My question is, is this an application for which I could make use of GPU processing? I have already translated some parts of the code to C, and I could imagine using OpenCL to create a function from an array of parameters to an array of error values, and running this hundreds of times at once. -- Even if it performs the sequential processing part slowly, getting all the results in one shot would be amazing.
However, my guess is that the memory requirements (loading up a large file and producing a temporary one of equal size to generate every data point) would make it difficult to run the whole algorithm in an OpenCL kernel. I don't have much experience with GPU processing and writing CUDA/OpenCL code, so I don't want to set about learning the ins and outs if there is no hope in making it work.
Any advice?
Do you need to produce all 120,000 new points before analysing the difference? Could you calculate the new point, then decide for that point if you are converging?
How big are the points? A $50 graphics card today has 1Gb of memory - should be plenty for 120K points. I'm not as familiar with openCL as Cuda but there may also be limits on how much of this is texture memory vs general memory etc.
edit: More familiar with CUDA than OpenCL but this probably applies to both.
The memory on GPUs is a bit more complex but very flexible, you have texture memory that can be read by the GPU kernel and has some very clever cache features to make access to values in a 2d and 3d arrays very fast. There is openGL memory that you can write to for display and there is a limited (16-64k ?) cache per thread
Although transfers from main memory to the GPU are relatively slow ( few GB/s) the internal memory bus on the graphics card is 20x as fast as this
I'm working with someone who has some MATLAB code that they want to be sped up. They are currently trying to convert all of this code into CUDA to get it to run on a CPU. I think it would be faster to use MATLAB's parallel computing toolbox to speed this up, and run it on a cluster that has MATLAB's Distributed Computing Toolbox, allowing me to run this across several different worker nodes. Now, as part of the parallel computing toolbox, you can use things like GPUArray. However, I'm confused as to how this would work. Are using things like parfor (parallelization) and gpuarray (gpu programming) compatible with each other? Can I use both? Can something be split across different worker nodes (parallelization) while also making use of whatever GPUs are available on each worker?
They think its still worth exploring the time it takes to convert all of your matlab code to cuda code to run on a machine with multiple GPUs...but I think the right approach would be to use the features already built into MATLAB.
Any help, advice, direction would be really appreciated!
Thanks!
When you use parfor, you are effectively dividing your for loop into tasks, with one task per loop iteration, and splitting up those tasks to be computed in parallel by several workers where each worker can be thought of as a MATLAB session without an interactive GUI. You configure your cluster to run a specified number of workers on each node of the cluster (generally, you would choose to run a number of workers equal to the number of available processor cores on that node).
On the other hand, gpuarray indicates to MATLAB that you want to make a matrix available for processing by the GPU. Underneath the hood, MATLAB is marshalling the data from main memory to the graphics board's internal memory. Certain MATLAB functions (there's a list of them in the documentation) can operate on gpuarrays and the computation happens on the GPU.
The key differences between the two techniques are that parfor computations happen on the CPUs of nodes of the cluster with direct access to main memory. CPU cores typically have a high clock rate, but there are typically fewer of them in a CPU cluster than there are GPU cores. Individually, GPU cores are slower than a typical CPU core and their use requires that data be transferred from main memory to video memory and back again, but there are many more of them in a cluster. As far as I know, hybrid approaches are supposed to be possible, in which you have a cluster of PCs and each PC has one or more Nvidia Tesla boards and you use both parfor loops and gpuarrays. However, I haven't had occasion to try this yet.
If you are mainly interested in simulations, GPU processing is the perfect choice. However, if you want to analyse (big) data, go with Parallization. The reason for this is, that GPU processing is only faster than cpu processing if you don't have to copy data back and forth. In case of a simulation, you can generate most of the data on the GPU and only need to copy the result back. If you try to work with bigger data on the GPU you will very often run into out of memory problems.
Parallization is great if you have big data structures and more than 2 cores in your computer CPU.
If you write it in CUDA it is guaranteed to run in parallel at the chip-level versus going with MATLAB's best guess for a non-parallel architecture and your best effort to get it to run in parallel.
Kind of like drinking fresh mountain water run-off versus buying filtered water. Go with the purist solution.