I have implemented Cholesky Factorization for solving large linear equation on GPU using ATI Stream SDK. Now I want to exploit computation power of more and more GPUs and I want to run this code on multiple GPUs.
Currently I have One Machine and One GPU installed on it and cholesky factorization is running properly.
I want to do it for N machine and all have one GPU installed on them. So Suggest me how should I proceed.
First, you have to be aware that this approach will introduce three levels of latency for any communication between nodes:
GPU memory on machine 1 to main memory on machine 1
Main memory on machine 1 to main memory on machine 2
Main memory on machine 2 to GPU memory on machine 2
A good first step will be to do some back of the envelop calculations to determine if the speed up you gain by splitting the problem between multiple machines will outweigh the latency you introduce.
Once you're sure the approach is the one you want to follow, then it's pretty much up to you to implement this correctly. Note that, currently, NVIDIA's CUDA or OpenCL libraries will be better choices for you because they allow you to access the GPU for computation without having it coupled with an X session. Once ATI's OpenCL implementation supports the GPU, then this should also be a viable option.
Since you already have a working GPU implementation, here are the basic steps you must follow:
Determine how you update your factorization algorithm to support processing by separate nodes
Set up the data exchange between N computers (I notice you have opted for MPI for this)
Set up the scatter operation that will divide the input problem amongst the computational nodes
Set up the data exchange between a machine and its GPU
Set up the gather operation that will gather the results from the nodes into the one node
It's a very specialised question. Suggest you check the Stream developer resources and the Stream Developer Forums.
I showed this Q to a colleague of mine who knows about these things.
He suggested you use ScaLAPACK.
Related
If I use the function:
net=feedfowardnet([60 60])
net2=train(net,x,t)
It takes about 20 minuets to train. (I have done this on multiple computers {with the same specs}, and the average time is always around 20 mins)
If I use the function:
parpool %//starts a local parallel pool connected to 2 workers
net2=train(net,x,t,'useParallel','yes')
It takes around 40 minuets to complete training. I have two cores, so this is counter intuitive, it should be twice as fast, not twice as slow. I am using the same starting network, and the same training inputs and targets.
Also, when I open the task manager during NN training, it shows that both CPUs are working at 100%, even when parpool and useParallel are turned off.
This page of the Mathworks website says that "Parallel Computing Toolbox™ allows Neural Network Toolbox™ to simulate and train networks faster and on larger datasets than can fit on one PC. Parallel training is currently supported for backpropagation training only, not for self-organizing maps."
I am using 2000 training examples in the data set. There are 32 inputs and 3 outputs so this is definitely a large data set. The parallel pool is also definitely turned off when I just use the net2=train(net,x,t) function.
I have tested the use of parpool with other functions, (ones containing parfor loops), and the calculation is usually twice as fast. It is just seems to be the neural network training that goes slower.
Is there any reason for this?
I am using an Intel Core 2 Duo E8400 Cpu #3GHz, and I am using MATLAB version R2013 b. I am also using a computer on a network (inside a university). I'm not sure if this makes a difference.
More info on the university computer network. I am using multiple computers on the network at the same time. I have not connected them together in a way to do distributed computing, each one is just doing its own thing using parallel computing on its own 2 processors. However I am not sure if the computers are still interfering with each other in some way because they are logged on with the same user. I load the training input and target data into the matlab workspace on each computer using:
load('H:\18-03-14\x.mat')
load('H:\18-03-14\net.mat')
load('H:\18-03-14\t.mat')
Where H: is the network drive. I am not sure if once these are in the matlab workspace, they are still somehow connected, and interfere with each other across different computers. Do they?
There are two types of parallelism going on: multicore and multithread. MATLAB implements basic matrix operations with multicore support, so even without the parallel computing toolbox, all your cores are being utilized.
The Parallel Computing Toolbox additionally allows for multiple threads. Multiple threads have two advantages. The first is that they allow calculations to be threaded across multiple PCs using MATLAB Distributed Computing Server for a reliable linear speed up. The second is less obvious, which is that multiple threads on a single PC can improve on already single-threaded multicore calculations, even though that might seem counter intuitive. However, there is extra overhead for using parallel threads so this is not always the case. More cores and large problems are more likely to see a speed up.
Your problem is not a large one. A large problem would be 10's of thousands of examples or more, whereas your problem only has 2000.
Also, two hidden layers with 60 neurons each is almost certainly a far larger network than you need. Your problem has 2000 samples * 3 outputs = 6000 constraints. Your network has 32*60+60*60+60*3 weights and 32+60+3 biases for a total of 5795 adjustable variables which is almost as many as the number of constraints. I would suggest far fewer weights and probably only a single hidden layer. This will train much faster, and is also likely to generalize better.
So maybe start with feedforwardnet(100) and then increase that if the desired accuracy is not found.
You can see the benefits of multiple threads on a larger problem using this Neural Network Toolbox example dataset with 68,308 examples:
[x,t] = vinyl_dataset;
net = feedforwardnet(140,'trainscg');
rng(0), tic, net2 = train(net,x,t); toc
parpool
rng(0), tic, net2 = train(net,x,t,'useParallel','yes'); toc
The 100% cpu load without use of the parallel computing toolbox shows, that the function train or the relevant called functions are implemented using multi threading. In these cases, the parallel computing toolbox only adds a useless overhead of inter process communication.
I'm asking on behalf of a friend working in numerical astrophysics.
Basically what he's doing is simulating a cloud of gas. There are a finite number of cells and the timestep is defined such that gas cannot cross more than one cell each step. Each cell has properties like density and temperature. Each timestep, these (and position) need to be calculated. It's mainly position that's the issue I believe as that is affected primarily by the interactions of gravity among the cells, all of which affect each other.
At the moment he's running this on a cluster of ~150 nodes but I wondered, if it's parallelizable like this, could it be run faster on a few GPUs with CUDA? At the moment it takes him a couple of days to finish a simulation. As GPUs generally have ~500 cores, it seemed like they could provide a boost.
Maybe I'm totally wrong.
Yes this sounds like a decent application for a GPU. GPU processing is most effective when it's running the same function on a large data set. If you've already got it running in parallel on a cluster computer, I'd say write it and test it on a single graphics card, and see if that's an improvement on a single cluster, then scale accordingly.
The task you describe is a good fit for the GPU. GPUs have successfully been used for dramatically improving the performance in areas such as particle, aerodynamics and fluid simulations.
Without knowing more details about the simulation it's impossible to say for sure whether it would gain a performance boost. Broadly speaking, algorithms that are memory bound ( that is, relatively few arithmetic operations per memory transaction ) tend to benefit most from offloading to the GPU.
For astrophysics simulations specifically, the following link may be of use : http://www.astrogpu.org/
I need to write an application that hashes words from a dictionary to make WPA pre-shared-keys. This is my thesis for a "Networking Security" course. The application needs to be parallel for increased performance. I have some experience with MPI from my IT studies but I would like to tie it up with CUDA. The idea is to use MPI to distribute the load evenly to the nodes of the cluster and then utilize CUDA to run the individual chunks in parallel inside the GPUs of the nodes.
Distributing the load with MPI is something I can easily do and have done in the past. Also computing with CUDA is something I can learn. There is also a project (pyrit) that does more or less what I need to do (actually a lot more) and I can get ideas from there.
I would like some advice on how to make the connection between MPI and CUDA. If there is somebody that has built anything like this I would greatly appreciate his advice and suggestions. Also if you happen to know of any resources on the topic please do point them to me.
Sorry for the lengthy intro but I thought it was necessary to give some background.
This question is largerly open-ended and so it's hard to give a definitive answer. This one is just a summary of the comments made High Performance Mark, me and Jonathan Dursi. I do not claim authorship and thus made this answer a community wiki.
MPI and CUDA are orthogonal. The former is an IPC middleware and is used to communicate between processes (possibly residing on separate nodes) while the latter provides highly data-parallel shared-memory computing to each process that uses it. You can break the task into many small subtasks and use MPI to distribute them to worker processes running on the network. The master/worker approach is suitable for this kind of application, especially if words in the dictionary vary greatly in their length and variance in processing time is to be expected. Provided with all the necessary input values, worker processes can then use CUDA to perform the necessary computations in parallel and then return results back using MPI. MPI also provides the mechanisms necessary to launch and control multinode jobs.
Although MPI and CUDA could be used separately, modern MPI implementations provide some mechanisms that blur the boundaries between those two. It could be either direct support for device pointers in MPI communication operations that transparently call CUDA functions to copy memory when necessary or it could be even support for RDMA to/from device memory without intermediate copy to main memory. The former simplifies your code while the latter can save different amount of time, depending on how your algorithm is structured. The latter also requires both failry new CUDA hardware and drivers and newer networking equipment (e.g. newer InfiniBand HCA).
MPI libraries that support direct GPU memory operations include MVAPICH2 and the trunk SVN version of Open MPI.
I know for some machine learning algorithm like random forest, which are by nature should be implemented in parallel. I do a home work and find there are these three parallel programming framework, so I am interested in knowing what are the major difference between these three types of parallelism?
Especially, if some one can point me to some study compare the difference between them, that will be perfect!
Please list the pros and cons for each parallelism , thanks
MPI is a message passing paradigm of parallelism. Here, you have a root machine which spawns programs on all the machines in its MPI WORLD. All the threads in the system are independent and hence the only way of communication between them is through messages over network. The network bandwidth and throughput is one of the most crucial factor in MPI implementation's performance. Idea : If there is just one thread per machine and you have many cores on it, you can use OpenMP shared memory paradigm for solving subsets of your problem on one machine.
CUDA is a SMT paradigm of parallelism. It uses state of the art GPU architecture to provide parallelisim. A GPU contains (blocks of ( set of cores)) working on same instruction in a lock-step fashion (This is similar to SIMD model). Hence, if all the threads in your system do a lot of same work, you can use CUDA. But the amount of shared memory and global memory in a GPU are limited and hence you should not use just one GPU for solving a huge problem.
Hadoop is used for solving large problems on commodity hardware using Map Reduce paradigm. Hence, you do not have to worry about distributing data or managing corner cases. Hadoop also provides a file system HDFS for storing data on compute nodes.
Hadoop, MPI and CUDA are completely orthogonal to each other. Hence, it may not be fair to compare them.
Though, you can always use ( CUDA + MPI ) to solve a problem using a cluster of GPU's. You still need a simple core to perform the communication part of the problem.
I'm working with someone who has some MATLAB code that they want to be sped up. They are currently trying to convert all of this code into CUDA to get it to run on a CPU. I think it would be faster to use MATLAB's parallel computing toolbox to speed this up, and run it on a cluster that has MATLAB's Distributed Computing Toolbox, allowing me to run this across several different worker nodes. Now, as part of the parallel computing toolbox, you can use things like GPUArray. However, I'm confused as to how this would work. Are using things like parfor (parallelization) and gpuarray (gpu programming) compatible with each other? Can I use both? Can something be split across different worker nodes (parallelization) while also making use of whatever GPUs are available on each worker?
They think its still worth exploring the time it takes to convert all of your matlab code to cuda code to run on a machine with multiple GPUs...but I think the right approach would be to use the features already built into MATLAB.
Any help, advice, direction would be really appreciated!
Thanks!
When you use parfor, you are effectively dividing your for loop into tasks, with one task per loop iteration, and splitting up those tasks to be computed in parallel by several workers where each worker can be thought of as a MATLAB session without an interactive GUI. You configure your cluster to run a specified number of workers on each node of the cluster (generally, you would choose to run a number of workers equal to the number of available processor cores on that node).
On the other hand, gpuarray indicates to MATLAB that you want to make a matrix available for processing by the GPU. Underneath the hood, MATLAB is marshalling the data from main memory to the graphics board's internal memory. Certain MATLAB functions (there's a list of them in the documentation) can operate on gpuarrays and the computation happens on the GPU.
The key differences between the two techniques are that parfor computations happen on the CPUs of nodes of the cluster with direct access to main memory. CPU cores typically have a high clock rate, but there are typically fewer of them in a CPU cluster than there are GPU cores. Individually, GPU cores are slower than a typical CPU core and their use requires that data be transferred from main memory to video memory and back again, but there are many more of them in a cluster. As far as I know, hybrid approaches are supposed to be possible, in which you have a cluster of PCs and each PC has one or more Nvidia Tesla boards and you use both parfor loops and gpuarrays. However, I haven't had occasion to try this yet.
If you are mainly interested in simulations, GPU processing is the perfect choice. However, if you want to analyse (big) data, go with Parallization. The reason for this is, that GPU processing is only faster than cpu processing if you don't have to copy data back and forth. In case of a simulation, you can generate most of the data on the GPU and only need to copy the result back. If you try to work with bigger data on the GPU you will very often run into out of memory problems.
Parallization is great if you have big data structures and more than 2 cores in your computer CPU.
If you write it in CUDA it is guaranteed to run in parallel at the chip-level versus going with MATLAB's best guess for a non-parallel architecture and your best effort to get it to run in parallel.
Kind of like drinking fresh mountain water run-off versus buying filtered water. Go with the purist solution.