I am using a cluster to train a recurrent neural network developed using PyTorch. PyTorch automatically threads, which allows to use all the cores of a machine in parallel without having to explicitly program for it. This is great !
Now when I try to use several nodes at the same time using a script like this one :
#$ -S /bin/bash
#$ -N comparison_of_architecture
#$ -pe mvapich2-rostam 32
#4 -tc 4
#$ -o /scratch04.local/cnelias/Deep-Jazz/logs/out_comparison_training.txt
#$ -e /scratch04.local/cnelias/Deep-Jazz/logs/err_comparison_training.txt
#$ -t 1
#$ -cwd
I see that 4 nodes are being used but only one is actually doing work, so "only" 32 cores are in use.
I have no knowledge of parallel programming and I don't understand a thing in the tutorial provided on PyTorch's website, I am afraid this is completely out of my scope.
Are you aware of a simple way to let a PyTorch program run on several machines without having to explicitly program the exchanges of the messages and computation between these machines ?
PS : I unfortunately don't have a GPU and the cluster I am using also doesn't, otherwise I would have tried it.
tl;dr There is no easy solution.
There are two ways how you can parallelize training of a deep learning model. The most commonly used is data parallelism (as opposed to model parallelism). In that case, you have a copy of the model on each device, run the model and back-propagation on each device independently and get the weight gradients. Now, the tricky part begins. You need to collect all gradients at a single place, sum them (differentiation is linear w.r.t. summation) and do the optimizer step. The optimizer computes the weight updates and you need to tell each copy of your model how to update the weights.
PyTorch can somehow do this for multiple GPUs on a single machine, but as far as I know, there is no ready-made solution to do this on multiple machines.
Related
Due to the difficulty of compiling VW on a RHEL machine, I am opting out to use a compiled versions of VW provided by Ariel Faigon (thank you!) here. I'm calling VW from Python, so I am planning on using Python's subprocess module (I couldn't get the python package to compile either). I am wondering if there would be any downsides to this approach. Would I see any performance lags?
Thank you so much for your help!
Feeding a live vowpal wabbit process via Python's subprocess is ok (fast). as long as you don't start a new process per example and avoid excessive context switches. In my experience, in this set up, you can expect a throughput of ~500k features per second on typical dual-core hardware. This is not as fast as the (10x faster) ~5M features/sec vw typically processes when not interacting with any other software (reading from file/cache), but is good enough for most practical purposes. Note that the bottleneck in this setting would most likely be the processing by the additional process, not vowpal-wabbit itself.
It is recommended to feed vowpal-wabbit in batches (N examples at a time, instead of one at a time) both on input (feeding vw) and on output (reading vw responses). If you're using subprocess.Popen to connect to the process, make sure to pass a large bufsize otherwise by default the Popen iterator would be line-buffered (one example at a time) which might result in a per example context-switch between the producer of examples and consumer (vowpal wabbit).
Assuming your vw command line is in vw_cmd, it would be something like:
vw_proc = subprocess.Popen(vw_cmd,
stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
bufsize=1048576)
Generally, slowness can come from:
Too many context switches (generating and processing one example at a time)
Too much processing outside vw (e.g. generating the examples in the first place, feature transformation)
Startup overhead (e.g. reading the model) per example.
So avoiding all the above pitfalls should give you the fastest throughput possible under the circumstances of having to interact with additional processes.
I am conducting a global maximization for a rather daunting high dimensional problem. To make my life easier, my building-block program was ported to OpenMP and ran smoothly fine.
The main program actually consists of 4 building-block programs, each working under different setting. And the real task of mine is to feed the main program with a long list of parameter combinations. My preliminary thinking for overcoming this challenge is to divide the list into 10 smaller parts in parallelism.
Suppose the computing capacity I have is a high performance cluster on which a node has 8 cores (or 16 threads). My question is: is it correct that I simply use the usual MPI routines like MPI_INIT and its pals to complete the extension of my program from OpenMP to the hybrid with MPI? Is it correct that I simply specify the following in my PBS script:
#!/bin/bash -l
#PBS -l nodes=40:ppn=8
...
export OMP_NUM_THREADS=16
...
Or else do I need to think deeper by using alternative routine like MPI_INIT_THREAD to have my work done?
=============[edited June 24, 2014]
Here is the PBS file I finally figured out for my multi-threaded MPI program (without overlapping communication across OMP and MPI). My program works in this way: one multi-threaded MPI process is executed per node. Each node fully forks workload to all the threads that are physically associated with it. In addition, since I am also using Intel MKL and Intel MPI, I made corresponding adjustment in the PBS script below.
1 #!/bin/bash -l
2 #PBS -l walltime=01:00:00,nodes=32:ppn=8,pmem=2000mb
3 export OMP_NUM_THREADS=8
4 cd $PBS_O_WORKDIR
5 mpirun -perhost 1 -np 32 -hostfile "$PBS_NODEFILE" \
6 -env I_MPI_PIN_DOMAIN omp \
7 -env KMP_AFFINITY compact ./main
Besides, be sure to add -mt_mpi into the compiler flags to correctly enable the support of Intel MKL.
It's true that it's not required to do anything special here in terms of MPI as long as you're never calling MPI functions in a parallel section. If you are going to do that, you need to use MPI_INIT_THREAD and provide the level of thread safety that you require.
In reality, you should probably do this anyway. If you're not going to be doing multiple MPI calls in parallel, then you can get by with MPI_THREAD_FUNNELED, otherwise, you probably need MPI_THREAD_MULTIPLE.
Is there a way to spread xz compression efforts across multiple CPU's? I realize that this doesn't appear possible with xz itself, but are there other utilities that implement the same compression algorithm that would allow more efficient processor utilization? I will be running this in scripts and utility apps on systems with 16+ processors and it would be useful to at least use 4-8 processors to potentially speed up compression rates.
Multiprocessor (multithreading) compression support was added to xz in version 5.2, in December 2014.
To enable the functionality, add the -T option, along with either the number of worker threads to spawn, or -T0 to spawn as many CPU's as the OS reports:
xz -T0 big.tar
xz -T4 bigish.tar
The default single threaded operation is equivalent to -T1.
I have found that running it with a couple of hyper-threads less than the total number of hyperthreads on my CPU† provides a good balance of responsiveness and compression speed.
† So -T10 on my 6 core, 12 thread workstation.
As scai and Dzenly said in comments
If you want to use this in combination with tar just call export XZ_DEFAULTS="-T 0" before.
or use smth like: XZ_OPT="-2 -T0"
I have a number crunching C/C++ application. It is basically a main loop for different data sets. We got access to a 100 node cluster with openmp and mpi available. I would like to speedup the application but I am an absolut newbie for both mpi and openmp. I just wonder what is the easiest one to learn and to debug even if the performance is not the best.
I also wonder what is the most adequate for my main loop application.
Thanks
If your program is just one big loop using OpenMP can be as simple as writing:
#pragma omp parallel for
OpenMP is only useful for shared memory programming, which unless your cluster is running something like kerrighed means that the parallel version using OpenMP will only run on at most one node at a time.
MPI is based around message passing and is slightly more complicated to get started. The advantage is though that your program could run on several nodes at one time, passing messages between them as and when needed.
Given that you said "for different data sets" it sounds like your problem might actually fall into the "embarrassingly parallel" category, where provided you've got more than 100 data sets you could just setup the scheduler to run one data set per node until they are all completed, with no need to modify your code and almost a 100x speed up over just using a single node.
For example if your cluster is using condor as the scheduler then you could submit 1 job per data item to the "vanilla" universe, varying only the "Arguments =" line of the job description. (There are other ways to do this for Condor which may be more sensible and there are also similar things for torque, sge etc.)
OpenMP is essentially for SMP machines, so if you want to scale to hundreds of nodes you will need MPI anyhow. You can however use both. MPI to distribute work across nodes and OpenMP to handle parallelism across cores or multiple CPUs per node. I would say OpenMP is a lot easier than messing with pthreads. But it being coarser grained, the speed up you will get from OpenMP will usually be lower than a hand optimized pthreads implementation.
In Make this flag exists:
-l [load], --load-average[=load]
Specifies that no new jobs (commands) should be started if there are others jobs running and the load average is at least load (a floating-point number). With no argument, removes a previous load limit.
Do you have a good strategy for what value to use for the load limit ? It seems to differ a lot between my machines.
Acceptable load depends on the number of CPU cores. If there is one core, than load average more than 1 is overload. If there are four cores, than load average of more than four is overload.
People often just specify the number of cores using -j switch.
See some empirical numbers here: https://stackoverflow.com/a/17749621/412080
I recommend against using the -l option.
In principle, -l seems superior to -j. -j says, start this many jobs. -l says, make sure this many jobs are running. Often, those are almost the same thing, but when you have I/O bound jobs are other oddities, then -l should be better.
That said, the concept of load average is a bit dubious. It is necessarily a sampling of what goes on on the system. So if you run make -j -l N (for some N) and you have a well-written makefile, then make will immediately start a large number of jobs and run out of file descriptors or memory before even the first sample of the system load can be taken. Also, the accounting of the load average differs across operating systems, and some obscure ones don't have it at all.
In practice, you'll be as well off using -j and will have less headaches. To get more performance out of the build, tune your makefiles, play with compiler options, and use ccache or similar.
(I suspect the original reason for the -l option stems from a time when multiple processors were rare and I/O was really slow.)