I am conducting a global maximization for a rather daunting high dimensional problem. To make my life easier, my building-block program was ported to OpenMP and ran smoothly fine.
The main program actually consists of 4 building-block programs, each working under different setting. And the real task of mine is to feed the main program with a long list of parameter combinations. My preliminary thinking for overcoming this challenge is to divide the list into 10 smaller parts in parallelism.
Suppose the computing capacity I have is a high performance cluster on which a node has 8 cores (or 16 threads). My question is: is it correct that I simply use the usual MPI routines like MPI_INIT and its pals to complete the extension of my program from OpenMP to the hybrid with MPI? Is it correct that I simply specify the following in my PBS script:
#!/bin/bash -l
#PBS -l nodes=40:ppn=8
...
export OMP_NUM_THREADS=16
...
Or else do I need to think deeper by using alternative routine like MPI_INIT_THREAD to have my work done?
=============[edited June 24, 2014]
Here is the PBS file I finally figured out for my multi-threaded MPI program (without overlapping communication across OMP and MPI). My program works in this way: one multi-threaded MPI process is executed per node. Each node fully forks workload to all the threads that are physically associated with it. In addition, since I am also using Intel MKL and Intel MPI, I made corresponding adjustment in the PBS script below.
1 #!/bin/bash -l
2 #PBS -l walltime=01:00:00,nodes=32:ppn=8,pmem=2000mb
3 export OMP_NUM_THREADS=8
4 cd $PBS_O_WORKDIR
5 mpirun -perhost 1 -np 32 -hostfile "$PBS_NODEFILE" \
6 -env I_MPI_PIN_DOMAIN omp \
7 -env KMP_AFFINITY compact ./main
Besides, be sure to add -mt_mpi into the compiler flags to correctly enable the support of Intel MKL.
It's true that it's not required to do anything special here in terms of MPI as long as you're never calling MPI functions in a parallel section. If you are going to do that, you need to use MPI_INIT_THREAD and provide the level of thread safety that you require.
In reality, you should probably do this anyway. If you're not going to be doing multiple MPI calls in parallel, then you can get by with MPI_THREAD_FUNNELED, otherwise, you probably need MPI_THREAD_MULTIPLE.
Related
I am using a cluster to train a recurrent neural network developed using PyTorch. PyTorch automatically threads, which allows to use all the cores of a machine in parallel without having to explicitly program for it. This is great !
Now when I try to use several nodes at the same time using a script like this one :
#$ -S /bin/bash
#$ -N comparison_of_architecture
#$ -pe mvapich2-rostam 32
#4 -tc 4
#$ -o /scratch04.local/cnelias/Deep-Jazz/logs/out_comparison_training.txt
#$ -e /scratch04.local/cnelias/Deep-Jazz/logs/err_comparison_training.txt
#$ -t 1
#$ -cwd
I see that 4 nodes are being used but only one is actually doing work, so "only" 32 cores are in use.
I have no knowledge of parallel programming and I don't understand a thing in the tutorial provided on PyTorch's website, I am afraid this is completely out of my scope.
Are you aware of a simple way to let a PyTorch program run on several machines without having to explicitly program the exchanges of the messages and computation between these machines ?
PS : I unfortunately don't have a GPU and the cluster I am using also doesn't, otherwise I would have tried it.
tl;dr There is no easy solution.
There are two ways how you can parallelize training of a deep learning model. The most commonly used is data parallelism (as opposed to model parallelism). In that case, you have a copy of the model on each device, run the model and back-propagation on each device independently and get the weight gradients. Now, the tricky part begins. You need to collect all gradients at a single place, sum them (differentiation is linear w.r.t. summation) and do the optimizer step. The optimizer computes the weight updates and you need to tell each copy of your model how to update the weights.
PyTorch can somehow do this for multiple GPUs on a single machine, but as far as I know, there is no ready-made solution to do this on multiple machines.
I am running 60 MPI processes and MKL_THREAD_NUM is set to 4 to get me to the full 240 hardware threads on the Xeon Phi. My code is running but I want to make sure that MKL is actually using 4 threads. What is the best way to check this with the limited Xeon Phi linux kernel?
You can set MKL_NUM_THREADS to 4 if you like. However,using every single thread does not necessarily give the best performance. In some cases, the MKL library knows things about the algorithm that mean fewer threads is better. In these cases, the library routines can choose to use fewer threads. You should only use 60 MPI ranks if you have 61 coresIf you are going to use that many MPI ranks, you will want to set the I_MPI_PIN_DOMAIN environment variable to "core". Remember to leave one core free for the OS and system level processes. This will put one rank per core on the coprocessor and allow all the OpenMP threads for each MPI process to reside on the same core, giving you better cache behavior. If you do this, you can also use micsmc in gui mode on the host processor to continuously monitor the activity on all the cores. With one MPI processor per core, you can see how much of the time all threads on a core are being used.
Set MKL_NUM_THREADS to 4. You can use environment variable or runtime call. This value will be respected so there is nothing to check.
Linux kernel on KNC is not stripped down so I don't know why you think that's a limitation. You should not use any system calls for this anyways though.
What happend if I ran an MPI program which require 3 nodes (i.e. mpiexec -np 3 ./Program) on a single machine which has 2 cpu?
This depends on your MPI implementation, of course. Most likely, it will create three processes, and use shared memory to exchange the messages. This will work just fine: the operating system will dispatch the two CPUs across the three processes, and always execute one of the ready processes. If a process waits to receive a message, it will block, and the operating system will schedule one of the other two processes to run - one of which will be the one that is sending the message.
Martin has given the right answer and I've plus-1ed him, but I just want to add a few subtleties which are a little too long to fit into the comment box.
There's nothing wrong with having more processes than cores, of course; you probably have dozens running on your machine well before you run any MPI program. You can try with any command-line executable you have sitting around something like mpirun -np 24 hostname or mpirun -np 17 ls on a linux box, and you'll get 24 copies of your hostname, or 17 (probably interleaved) directory listings, and everything runs fine.
In MPI, this using more processes than cores is generally called 'oversubscribing'. The fact that it has a special name already suggests that its a special case. The sorts of programs written with MPI typically perform best when each process has its own core. There are situations where this need not be the case, but it's (by far) the usual one. And for this reason, for instance, OpenMPI has optimized for the usual case -- it just makes the strong assumption that every process has its own core, and so is very agressive in using the CPU to poll to see if a message has come in yet (since it figures it's not doing anything else crucial). That's not a problem, and can easily be turned off if OpenMPI knows it's being oversubscribed ( http://www.open-mpi.org/faq/?category=running#oversubscribing ). It's a design decision, and one which improves the performance of the vast majority of cases.
For historical reasons I'm more familiar with OpenMPI than MPICH2, but my understanding is that MPICH2s defaults are more forgiving of the oversubscribed case -- but I think even there, too it's possible to turn on more agressive busywaiting.
Anyway, this is a long way of saying that yes, there what you're doing is perfectly fine, and if you see any weird problems when you switch MPIs or even versions of MPIs, do a quick search to see if there are any parameters that need to be tweaked for this case.
I have a number crunching C/C++ application. It is basically a main loop for different data sets. We got access to a 100 node cluster with openmp and mpi available. I would like to speedup the application but I am an absolut newbie for both mpi and openmp. I just wonder what is the easiest one to learn and to debug even if the performance is not the best.
I also wonder what is the most adequate for my main loop application.
Thanks
If your program is just one big loop using OpenMP can be as simple as writing:
#pragma omp parallel for
OpenMP is only useful for shared memory programming, which unless your cluster is running something like kerrighed means that the parallel version using OpenMP will only run on at most one node at a time.
MPI is based around message passing and is slightly more complicated to get started. The advantage is though that your program could run on several nodes at one time, passing messages between them as and when needed.
Given that you said "for different data sets" it sounds like your problem might actually fall into the "embarrassingly parallel" category, where provided you've got more than 100 data sets you could just setup the scheduler to run one data set per node until they are all completed, with no need to modify your code and almost a 100x speed up over just using a single node.
For example if your cluster is using condor as the scheduler then you could submit 1 job per data item to the "vanilla" universe, varying only the "Arguments =" line of the job description. (There are other ways to do this for Condor which may be more sensible and there are also similar things for torque, sge etc.)
OpenMP is essentially for SMP machines, so if you want to scale to hundreds of nodes you will need MPI anyhow. You can however use both. MPI to distribute work across nodes and OpenMP to handle parallelism across cores or multiple CPUs per node. I would say OpenMP is a lot easier than messing with pthreads. But it being coarser grained, the speed up you will get from OpenMP will usually be lower than a hand optimized pthreads implementation.
In a test I'm building here my goal is to create a parser. So I've built a concept proof that reads all messages from a file, and after pushing all of them to memory I'm spawning one process to parse each message. Until that, everything is fine, and I've got some nice results. But I could see that the erlang VM is not using all my processor power (I have a quad core), in fact it is using about 25% percent of my processor when doing my test. I've made a counter-test using c++ that uses four threads and obviously it is using 100% thus producing a better result (I've respected the same queue model erlang uses).
So I'm wondering what could be "slowing" my erlang test? I know it's not a serialization matter as I'm spawning one process per message. One thing I've thought is that maybe my message is too small (about 10k each), and so making that much of processes is not helping achieve a great performance.
Some facts about the test:
106k messages
On erlang (25% processor power used) - 204 msecs
On my C++ test (100% processor power used) - 80 msecs
Yes the difference isn't that great but if there is more power available certainly there is more room for improvement, right?
Ah, I've done some profilling and wasn't able to find another way to optimize, since there are few function calls and most of them are string to object convertion.
Update:
Woooow! Following Hassan Syed idea, I've managed to achieve 35 msecs against 80 from c++! This is awesome!
It seems your erlang VM is using only one core.
Try starting it like this:
erl -smp enable +S 4
The -smp enable flag tells Erlang to start the runtime system with SMP support enabled
With +S 4 you start 4 Erlang schedulers (1 for each core)
You can see if you have SMP enabled when you start the shell:
Erlang R13B01 (erts-5.7.2) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:0] [kernel-poll:false]
Eshell V5.7.2 (abort with ^G)
1>
[smp:2:2] tells it is running with smp enabled 2 schedulers 2 schesulers online
If you have once source file and you spawn one process per "expression" you really do not understand when to parallelise. It costs FAR more to spawn and process and process an expression than to just have one process to process an entire file. A suitable strategy would be to have one process per file rather than one process per expression.
Another alternative strategy would be to split the file in two,three or x chunks, and process those chunks. This of course assumes the source isn't linearly dependant and the chunks' processing time needs to exceed the time to create and spawn a process (ussualy by far, because time waste in process X is time taken away from the rest of the machine).
-- Discussion C++ vs Erlang and your findings --
Erlang has a user-space kernel that emulates a lot of the primitives of the OS kernel. Especially the scheduler and blocking primitives. This means that there is some overhead when comparing the same strategy used in a procedural raw language such as C++. You must tune your task partitioning to every entry from the implementation space (CPU/memory/OS/programming language) according to its properties.
You should bind the schedulers to the CPU cores:
erlang:system_flag(scheduler_bind_type, processor_spread).