Let's say I want to build a program that compiles Ocaml source code in parallel. I actually want to understand how to accomplish this with Ocaml today. So given the current state of Ocaml today, how do I parallelize parts of my program?
Should I just spawn new processes with the Unix module?
In the case of parallel compilation, does this have any overhead/performance impact?
To compile files you have to run the ocaml compiler, one for each file. So yes, you have to have to start new processes and the Unix module has the necessary functionality for that.
As for overhead or performance impacts consider this:
1) You need to start one process per file you compile. Weather you do that sequentially or parallel the number of processes started remains the same.
2) Compiling a file takes long compared to the necessary bookkeeping of starting each compile, even if you consider parallel compilations.
So are you worried about the overhead of starting and tracking multiple compiler processes in parallel? Don't be. That's less than 0.1% of your time. On the other hand utilizing 2 cpu cores by running 2 compilers will basically double your speed.
Or are you worried about multiple compilers running in parallel? You need twice the ram and on most modern cpus caches are shared so cache performance will suffer too to some extend. But unless you are working on some embedded system that is absolutely memory starved that won't be a problem. So again the benefit of using multiple cores far outweigh the drawbacks.
Just follow the simple rule for running things parallel: Run one job per CPU core with maybe one extra to smooth out IO waits. More jobs will have them fight for CPU time without benefits.
Related
I have a fortran code that I am using to calculate some quantities related to the work that I do. The code itself involves several nested loops, and requires very little disk I/O. Whenever the code is modified, I run it against a suite of several input files (just to make sure it's working properly).
To make a long story short, the most recent update has increased the run time of the program by about a factor of four, and running each input file serially with one CPU takes about 45 minutes (a long time to wait, just to see whether anything was broken). Consequently, I'd like to run each of the input files in parallel across the 4 cpus on the system. I've been attempting to implement the parallelism via a bash script.
The interesting thing I have noted is that, when only one instance of the program is running on the machine, it takes about three and a half minutes to crank through one of the input files. When four instances of the program are running, it takes more like eleven and a half minute to crank through one input file (bringing my total run time down from about 45 minutes to 36 minutes - an improvement, yes, but not quite what I had hoped for).
I've tried implementing the parallelism using gnu parallel, xargs, wait, and even just starting four instances of the program in the background from the command line. Regardless of how the instances are started, I see the same slow down. Consequently, I'm pretty sure this isn't an artifact of the shell scripting, but something going on with the program itself.
I have tried rebuilding the program with debugging symbols turned off, and also using static linking. Neither of these had any noticeable impact. I'm currently building the program with the following options:
$ gfortran -Wall -g -O3 -fbacktrace -ffpe-trap=invalid,zero,overflow,underflow,denormal -fbounds-check -finit-real=nan -finit-integer=nan -o [program name] {sources}
Any help or guidance would be much appreciated!
On modern CPUs you cannot expect a linear speedup. There are several reasons:
Hyperthreading GNU/Linux will see hyperthreading as a core eventhough it is not a real core. It is more like 30% of a core.
Shared caches If your cores share the same cache and a single instance of your program uses the full shared cache, then you will get more cache misses if you run more instances.
Memory bandwidth A similar case as the shared cache is the shared memory bandwidth. If a single thread uses the full memory bandwidth, then running more jobs in parallel may congest the bandwidth. This can partly be solved by running on a NUMA where each CPU has some RAM that is "closer" than other RAM.
Turbo mode Many CPUs can run a single thread at a higher clock rate than multiple threads. This is due to heat.
All of these will exhibit the same symptom: Running a single thread will be faster than each of the multiple threads, but the total throughput of the multiple threads will be bigger than the single thread.
Though I must admit your case sounds extreme: With 4 cores I would have expected a speedup of at least 2.
How to identify the reason
Hyperthreading Use taskset to select which cores to run on. If you use 2 of the 4 cores is there any difference if you use #1+2 or #1+3?
Turbo mode Use cpufreq-set to force a low frequency. Is the speed now the same if you run 1 or 2 jobs in parallel?
Shared cache Not sure how to do this, but if it is somehow possible to disable the cache, then comparing 1 job to 2 jobs run at the same low frequency should give an indication.
I'm a big fan of speeding up my builds using "make -j8" (replacing 8 with whatever my current computer's number of cores is, of course), and compiling N files in parallel is usually very effective at reducing compile times... unless some of the compilation processes are sufficiently memory-intensive that the computer runs out of RAM, in which case all the various compile processes start swapping each other out, and everything slows to a crawl -- thus defeating the purpose of doing a parallel compile in the first place.
Now, the obvious solution to this problem is "buy more RAM" -- but since I'm too cheap to do that, it occurs to me that it ought to be possible to have an implementation of 'make' (or equivalent) that watches the system's available RAM, and when RAM gets down to near zero and the system starts swapping, make would automatically step in and send a SIGSTOP to one or more of the compile processes it had spawned. That would allow the stopped processes to get fully swapped out, so that the other processes could finish their compile without further swapping; then, when the other processes exit and more RAM becomes available, the 'make' process would send a SIGCONT to the paused processes, allowing them to resume their own processing. That way most swapping would be avoided, and I could safely compile on all cores.
Is anyone aware of a program that implements this logic? Or conversely, is there some good reason why such a program wouldn't/couldn't work?
For GNU Make, there's the -l option:
-l [load], --load-average[=load]
Specifies that no new jobs (commands) should be started if there are others jobs running and the load average is at least load (a floating-
point number). With no argument, removes a previous load limit.
I don't think there's a standard option for this, though.
I have a number crunching C/C++ application. It is basically a main loop for different data sets. We got access to a 100 node cluster with openmp and mpi available. I would like to speedup the application but I am an absolut newbie for both mpi and openmp. I just wonder what is the easiest one to learn and to debug even if the performance is not the best.
I also wonder what is the most adequate for my main loop application.
Thanks
If your program is just one big loop using OpenMP can be as simple as writing:
#pragma omp parallel for
OpenMP is only useful for shared memory programming, which unless your cluster is running something like kerrighed means that the parallel version using OpenMP will only run on at most one node at a time.
MPI is based around message passing and is slightly more complicated to get started. The advantage is though that your program could run on several nodes at one time, passing messages between them as and when needed.
Given that you said "for different data sets" it sounds like your problem might actually fall into the "embarrassingly parallel" category, where provided you've got more than 100 data sets you could just setup the scheduler to run one data set per node until they are all completed, with no need to modify your code and almost a 100x speed up over just using a single node.
For example if your cluster is using condor as the scheduler then you could submit 1 job per data item to the "vanilla" universe, varying only the "Arguments =" line of the job description. (There are other ways to do this for Condor which may be more sensible and there are also similar things for torque, sge etc.)
OpenMP is essentially for SMP machines, so if you want to scale to hundreds of nodes you will need MPI anyhow. You can however use both. MPI to distribute work across nodes and OpenMP to handle parallelism across cores or multiple CPUs per node. I would say OpenMP is a lot easier than messing with pthreads. But it being coarser grained, the speed up you will get from OpenMP will usually be lower than a hand optimized pthreads implementation.
So I'm running perl 5.10 on a core 2 duo macbook pro compiled with threading support: usethreads=define, useithreads=define. I've got a simple script to read 4 gzipped files containing aroud 750000 lines each. I'm using Compress::Zlib to do the uncompressing and reading of the files. I've got 2 implementations the only difference between them being one includes use threads. Other than that both script run the same subroutine to do the reading. Hence in psuedocode the non-threading program does this:
read_gzipped(file1);
read_gzipped(file2);
read_gzipped(file3);
read_gzipped(file4);
The threaded version goes like this:
my thr0 = threads->new(\$read_gzipped,'file1')
my thr1 = threads->new(\$read_gzipped,'file1')
my thr2 = threads->new(\$read_gzipped,'file1')
my thr3 = threads->new(\$read_gzipped,'file1')
thr0->join()
thr1->join()
thr2->join()
thr3->join()
Now the threaded version is actually running almost 2 times slower then the non-threaded script. This obviously was not the result I was hoping for. Can anyone explain what I'm doing wrong here?
You're using threads to try and speed up something that's IO-bound, not CPU-bound. That just introduces more IO contention, which slows down the script.
My guess is the bottleneck for GZIP operations is disk access. If you have four threads competing for disk access on platter harddisk, that slows things down considerably. The disk head will have to move to different files in rapid succession. If you just process one file at a time, the head can stay near that file, and the disk cache will be more accurate.
ithreads work well if you're dealing with something which is mostly not cpu bound. decompression is cpu bound.
You can easily alleviate the problem with using Parallel::ForkManager module.
Generally - threads in Perl and not really good.
I'm not prepared to assume that you're I/O bound without seeing the output of top while this is running. Like depesz, I tend to assume that compression/decompression operations (which are math-heavy) are more likely to be CPU-bound.
When you're dealing with a CPU-bound operation, using more threads/processes than you have processors will almost never[1] improve matters - if the CPU utilization is already at 100%, more threads/processes won't magically increase its capacity - and will most likely make things worse by adding in more context-switching overhead.
[1] I've heard it suggested that heavy compilations, such as building a new kernel, benefit from telling make to use twice as many processes as the machine has processors and my personal experience has been that this seems to be accurate. The explanation I've heard for it is that this allows each CPU to be kept busy compiling in one process while the other process is waiting for data to be fetched from main memory. If you view compiling as a CPU-bound process, this is an exception to the normal rule. If you view it as an I/O bound case (where the I/O is between the CPU and main memory rather than disk/network/user I/O), it is not.
what is the difference between parallel processing and multi core processing
Parallel and multi-core processing both refer to the same thing: the ability to execute code at the same time (in more than one core/CPU/machine.) So in this sense multi-core is just a means to do parallel processing.
On the other hand, concurrency (which is probably what you mean by parallel processing) refers to having multiple units of execution (threads or processes) that are interleaved. This can also happen in either in a single core CPU or in many cores/CPUs or even in many machines (clusters).
Summing up, multicore is a subset of parallel and concurrency can occur with or without parallelism. The field that studies this is distributed systems or distributed computing.
Parallel processing just refers to a program running more than 1 part simultaneously, usually with the different parts communicating in some way. This might be on multiple cores, multiple threads on one core (which is really simulated parallel processing), multiple CPUs, or even multiple machines.
Multicore processing is usually a subset of parallel processing.
Multicore processing means code working on more than one "core" of a single CPU chip. A core is like a little processor within a processor. So making code work for multicore processing will nearly always be talking about the parallelization aspect (though would also include removing any core specific assumptions, which you shouldn't normally have anyway).
As far as an algorithm design goes, if it is correct in a parallel processing point of view, it will be correct multicore.
However, if you need to optimise your code to get it to run as fast as possible "in parallel" then the differences between multicore, multi-cpu, multi-machine, or vectorised will make a big difference.
Parallel processing can be done inside a single core with multiple threads.
Multi-Core processing means distributing those threads to make use of the multiple cores in a CPU.