Accelarating image processing algorithm on GPU, parallel processing Matlab - performance

I want to accelerate my algorithm because I need to run it on hundreds of images,so I tried to use unvectorized GPU code, running the same code on GPU, I have nvidia Geforce GT 650M with 2 GB on my PC, however it was very slow than the CPU version. After searching I am convinced to pass to vectorized GPU code using batch process (pagefun, bsxfun), I tried so much to solve this problem without a solution. can someone help me about this code:
Q=100;
for i=3:n-2
for j=3:m-2
A(i,j)=0;
for c=1:Q
if B(i,j,c)~=0
A(i,j)=A(i,j)+(-(B(i,j,c)).*log(B(i,j,c)));
end
end
end
end
Another question Why Matlab uses just 20% of my CPU? How I can take benefits of my CPU to accelerate my processing
Is Matlab a single threaded app?
Thanks in advance

The vectorized version is this:
BB = B(3:(n-2),3:(m-2),:);
cutoff = 10^(-6);
logBB = log(BB);
logBB(BB<cutoff) = 0; % remove divergent terms
A = -sum(BB.*logBB,3);
This should already run much faster even on a CPU. If you have a GPU, all you need to do is have the input array
BB = gpuArray(BB);
stored on the GPU, and then collect the results
A = gather(A);
back to the CPU

You will need to buy the parallel computing toolbox. (using parfor).
This is a well known limitation of matlab where some of the underlying functions do not paralellize across multiple cores (not threads). A quick ballpark is to look at how much CPU matlab uses and multiply that by the number of COREs you have in your pc (this should get you to somewhere around 100%).
If you want to utilize your computers GPU, parallel computing toolbox is the only way to do so.
From mathworks
This really depends on what you are doing. For some code MATLAB can only utilize a single core of a single processor, for other code, MATLAB will automatically utilize all available cores (and maybe processors). It really depends on the underlying functions. Some things cannot be easily parallelized. Sometimes you can help MATLAB with things like parfor loops. Other times you might need something like MPI. Still other times there really is nothing you can do.

Related

Use theano in a macbook pro without NVIDIA card

I have:
MacBook Pro (Retina, 13-inch, Mid 2014)
OS X Yosemite
Intel Iris 1536MB
I heard that I can not use GPU theano but CPU I do, but I want to know if the programming will be the same and internaly theano will work with CPU or in anothe case with GPU. Or if when I programming I have one way to program for each one.
Thanks a lot
For the most part a Theano program that runs well on a CPU will also run well on a GPU, with no changes to the code required. There are, however, a few things to bear in mind.
Not all operations have GPU versions so it's possible to create a computation that contains a component that cannot run on the GPU at all. When you run one of these computations on a GPU it will silently fall back to the CPU, so the program will run without failing and the result will be correct, but the computation is not running as efficiently as it might and will be slower because it has to copy data backwards and forwards between main and GPU memory.
How you access your data can affect GPU performance quite a lot but has little impact on CPU performance. Main memory tends to be larger than GPU memory so it's often the case that your entire dataset can fit in main memory but not in GPU memory. There are techniques that can be used to avoid this problem but you need to bear in mind the GPU limitations in advance.
If you stick to conventional neural network techniques, and follow the patterns used in the Theano sample/tutorial code, then it will probably run fine on the GPU since this is the primary use-case for Theano.
Yes, effectively Theano understand if you have GPU or not, and decide to use CPU or GPU to create variables, the only difficult is that when you create a model and variables with Theano with or without GPU config of variables change, in other words if you create a model with GPU (or CPU) and save them in a *.pickle (e.g.) and you go to another pc without CPU (or GPU respectively) this model and their variables saved will not work.

(Un)Limit CPU usage - Windows - Mathematica

I'm programming in Mathematica 8.
When I run my programme, I check with Win8 task manager that the CPU usage is at 35% as soon as it starts to run, and my memory usage also increases to 44%. Does Win8 limit the amount of CPU usage that a certain programme may have? I need to make my computer to use more of its resources to run the programme faster.
Any help would be appreciated.
What's happening here is a common misconception about how processors approach problems involving heavy computation.
Although you may indeed have a powerful 4-core processing machine, and you have a program which is capable of using all 4 processing cores (which mathematica definitely is!), unless the code is written in a parallel fashion, you will only be able to use 1 core at a time to do the calculations. As Mysticial mentioned in the comment, not all code is parallelizable, in fact, I'd say a great many problems are not inherently able to be parallelized.
Check here for some good examples of problems that can be split up in a parallel fashion well. Now, your memory usage will simply increase with the size of the data that's being stored temporarily. (ex: storing a 69X69 matrix takes up less memory (RAM) than a 4000X4000, being parallel has little to do with this, and more with the problem itself).
Anyway, tl;dr, your computer is acting normally. To use all 100% of that 4-core machine you're using, check out This mathematica reference guide to parallel operations.

parallel processing multiple evaluations of a sequential task on a large data set -- a task for GPU computing?

I am working on some signal processing code in SciPy, and am now trying to use a numerical optimizer to tune it. Unfortunately, as these things go, it is turning out to be quite a slow process.
The operations I must perform for this optimization are the following:
Load a large 1-d data file (~ 120000 points)
Run optimizer, which:
Executes a signal processing operation, does not modify original data, produces 120000 new data points.
Examines difference between original signal and new signal using various operations,
One of which includes FFT-based convolution
Generates a single "error" value to summarise the result -- this is what should be minimized
Looks at error and re-runs operation with different parameters
The signal processing and error functions take under 3 seconds, but unfortunately doing it 50,000 times takes much longer. I am experimenting with various more efficient optimisation algorithms, but no matter what it's going to take thousands of iterations.
I have parallelised a couple of the optimisers I'm trying using CPU threads, which wasn't too difficult since the optimiser can easily perform several scheduled runs at once on separate threads using ThreadPool.map.
But this is only about a 2x speed-up on my laptop, or maybe 8x on a multicore computer. My question is, is this an application for which I could make use of GPU processing? I have already translated some parts of the code to C, and I could imagine using OpenCL to create a function from an array of parameters to an array of error values, and running this hundreds of times at once. -- Even if it performs the sequential processing part slowly, getting all the results in one shot would be amazing.
However, my guess is that the memory requirements (loading up a large file and producing a temporary one of equal size to generate every data point) would make it difficult to run the whole algorithm in an OpenCL kernel. I don't have much experience with GPU processing and writing CUDA/OpenCL code, so I don't want to set about learning the ins and outs if there is no hope in making it work.
Any advice?
Do you need to produce all 120,000 new points before analysing the difference? Could you calculate the new point, then decide for that point if you are converging?
How big are the points? A $50 graphics card today has 1Gb of memory - should be plenty for 120K points. I'm not as familiar with openCL as Cuda but there may also be limits on how much of this is texture memory vs general memory etc.
edit: More familiar with CUDA than OpenCL but this probably applies to both.
The memory on GPUs is a bit more complex but very flexible, you have texture memory that can be read by the GPU kernel and has some very clever cache features to make access to values in a 2d and 3d arrays very fast. There is openGL memory that you can write to for display and there is a limited (16-64k ?) cache per thread
Although transfers from main memory to the GPU are relatively slow ( few GB/s) the internal memory bus on the graphics card is 20x as fast as this

MATLAB Parallel Computing Toolbox - Parallelization vs GPU?

I'm working with someone who has some MATLAB code that they want to be sped up. They are currently trying to convert all of this code into CUDA to get it to run on a CPU. I think it would be faster to use MATLAB's parallel computing toolbox to speed this up, and run it on a cluster that has MATLAB's Distributed Computing Toolbox, allowing me to run this across several different worker nodes. Now, as part of the parallel computing toolbox, you can use things like GPUArray. However, I'm confused as to how this would work. Are using things like parfor (parallelization) and gpuarray (gpu programming) compatible with each other? Can I use both? Can something be split across different worker nodes (parallelization) while also making use of whatever GPUs are available on each worker?
They think its still worth exploring the time it takes to convert all of your matlab code to cuda code to run on a machine with multiple GPUs...but I think the right approach would be to use the features already built into MATLAB.
Any help, advice, direction would be really appreciated!
Thanks!
When you use parfor, you are effectively dividing your for loop into tasks, with one task per loop iteration, and splitting up those tasks to be computed in parallel by several workers where each worker can be thought of as a MATLAB session without an interactive GUI. You configure your cluster to run a specified number of workers on each node of the cluster (generally, you would choose to run a number of workers equal to the number of available processor cores on that node).
On the other hand, gpuarray indicates to MATLAB that you want to make a matrix available for processing by the GPU. Underneath the hood, MATLAB is marshalling the data from main memory to the graphics board's internal memory. Certain MATLAB functions (there's a list of them in the documentation) can operate on gpuarrays and the computation happens on the GPU.
The key differences between the two techniques are that parfor computations happen on the CPUs of nodes of the cluster with direct access to main memory. CPU cores typically have a high clock rate, but there are typically fewer of them in a CPU cluster than there are GPU cores. Individually, GPU cores are slower than a typical CPU core and their use requires that data be transferred from main memory to video memory and back again, but there are many more of them in a cluster. As far as I know, hybrid approaches are supposed to be possible, in which you have a cluster of PCs and each PC has one or more Nvidia Tesla boards and you use both parfor loops and gpuarrays. However, I haven't had occasion to try this yet.
If you are mainly interested in simulations, GPU processing is the perfect choice. However, if you want to analyse (big) data, go with Parallization. The reason for this is, that GPU processing is only faster than cpu processing if you don't have to copy data back and forth. In case of a simulation, you can generate most of the data on the GPU and only need to copy the result back. If you try to work with bigger data on the GPU you will very often run into out of memory problems.
Parallization is great if you have big data structures and more than 2 cores in your computer CPU.
If you write it in CUDA it is guaranteed to run in parallel at the chip-level versus going with MATLAB's best guess for a non-parallel architecture and your best effort to get it to run in parallel.
Kind of like drinking fresh mountain water run-off versus buying filtered water. Go with the purist solution.

How to get best performance of 8 core system using INTEL fortran

Please let me know how to set INTEL fortran compiler option to gain the best performance of 8 core system for IA32 and X64 bits. Actually I want to execute a fortran program and take the advantages of the all CPU time available in 8 core system. Now the program is only using 13 % of CPU time.
You can learn about autovectorization and guided auto-parallelization features of Intel FORTRAN in this tutorial: http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/start/win/tutorial_comp_for_win.pdf.
If you are doing linear algebra, solvers, FFTs, you might get best results if you map your problem into calls into the Intel Math Kernel Libraries: http://software.intel.com/en-us/articles/intel-mkl/
which are already multithreaded and vectorized and cache optimized.
If you are doing media / signal processing you might map your problem into calls into the Intel Performance Primitives library: http://software.intel.com/en-us/articles/intel-ipp/
Happy hacking!
In my specific application, a computational network model containing several loops running thoughout 20k iterations, each iteration accessing a number of nested if's, just by enabling /Q2 level optimization in the compiler was sufficient to reduce the computing time drastically, while keeping the CPU load around 15%.
On a similar note, I have noticed rising the optimization setting to the last level (/Q3), did do what you were asking (running all CPUs at about full load), but the computing time have NOT been reduced at all.
Therefore, if one has a small problem and several cases to test and processing capacity is the only bottleneck, it could be a good idea to open more than one Fortran solution and run those cases simultaneously.

Resources