I have an application that's written in Fortran and there is one particular subroutine call that takes a long time to execute. I was wondering if it's possible to distribute the tasks for computation over multiple nodes.
The current serial flow of the code is as follows:
D = Some computations that give me D and it is in memory
subroutine call
<within the subroutine>
iteration from 1 .. n
{
independent operations on D
}
I wish to distribute the iterations over n/4 machines. Can someone please guide me with this? Do let me know if something's not very clear!
Depending on the underlying implementation, coarrays (F2008) may allow processing to be distributed over multiple nodes. Partitioning the iteration space across the images is relatively straightforward, communication of the results back to one image (or to all images) is where some complexity might arise. Some introductory material on coarrays can be found here.
Again, depending on the underlying implementation, DO CONCURRENT (F2008) may allow parallel processing of iterations (though unlikely to be across nodes). Restrictions exist on what can be done in the scope of a DO CONCURRENT construct that mean that iterations can be executed in any order, appropriately capable compilers may be able to then transform that further into concurrent execution.
When one has existing code and wants to parallelize incrementally (or just one routine), shared memory approaches are the "quick hit". Especially when it is known that the iterations are independant I'd first recommend looking at compiler flags for auto-parallelization, language constructs such as DO CONCURRENT (thanks to #IanH for reminding me of that), and OpenMP compiler directives.
As my extended comment is about distributed memory, however, I'll come to that.
I'll assume you don't have access to some advanced process-spawning setup on all of your potential machines. That is, you'll have processes running on various machines each being charged for the time regardless of what work is being done. Then, the work-flow looks like
Serial outer loop
Calculate D
Distribute D to the parallel environment
Inner parallel loop on subsets of D
Gather D on the master
If the processors/processes in the parallel environment are doing nothing else - or you're being charged regardless - then this is the same to you as
Outer loop
All processes calculate D
Each process works on its subset of D
Synchronize D
The communication side, MPI or coarrays (which I'd recommend in this case, again see #IanH's answer, where image synchronization etc., is as limited as a few loops with [..]) here is just in the synchronization.
As an endnote: multi-machine coarray support is very limited. ifort as I understand requires an licence beyond the basic, g95 has some support, the Cray compiler may well. That's a separate question, however. MPI would be well supported.
Related
I know that the usual method when we want to make a big math computation faster is to use multiprocessing / parallel processing: we split the job in for example 4 parts, and we let 4 CPU cores run in parallel (parallelization). This is possible for example in Python with multiprocessing module: on a 4-core CPU, it would allow to use 100% of the processing power of the computer instead of only 25% for a single-process job.
But let's say we want to make faster a non-easily-splittable computation job.
Example: we are given a number generator function generate(n) that takes the previously-generated number as input, and "it is said to have 10^20 as period". We want to check this assertion with the following pseudo-code:
a = 17
for i = 1..10^20
a = generate(a)
check if a == 17
Instead of having a computer's 4 CPU cores (3.3 Ghz) running "in parallel" with a total of 4 processes, is it possible to emulate one very fast single-core CPU of 13.2 Ghz (4*3.3) running one single process with the previous code?
Is such technique available for a desktop computer? If not, is it available on cloud computing platforms (AWS EC2, etc.)?
Single-threaded performance is extremely valuable; it's much easier to write sequential code than to explicitly expose thread-level parallelism.
If there was an easy and efficient general-purpose way to do what you're asking which works when there is no parallelism in the code, it would already be in widespread use. Either internally inside multi-core CPUs, or in software if it required higher-level / larger-scale code transformations.
Out-of-order CPUs can find and exploit instruction-level parallelism within a single thread (over short distances, like a couple hundred instructions), but you need explicit thread-level parallelism to take advantage of multiple cores.
This is similar to How does a single thread run on multiple cores? over on SoftwareEnginnering.SE, except that you've already ruled out any easy-to-find parallelism including instruction-level parallelism. (And the answer is: it doesn't. It's the hardware of a single core that finds the instruction-level parallelism in a single thread; my answer there explains some of the microarchitectural details of how that works.)
The reverse process: turning one big CPU into multiple weaker CPUs does exist, and is useful for running multiple threads which don't have much instruction-level parallelism. It's called SMT (Simultaneous MultiThreading). You've probably heard of Intel's Hyperthreading, the most widely known implementation of SMT. It trades single-threaded performance for more throughput, keeping more execution units fed with useful work more of the time. The cost of building a single wide core grows at least quadratically, which is why typical desktop CPUs don't just have a single massive core with 8-way SMT. (And note that a really wide CPU still wouldn't help with a totally dependent instruction stream, unless the generate function has some internal instruction-level parallelism.)
SMT would be good if you wanted to test 8 different generate() functions at once on a quad-core CPU. Without SMT, you could alternate in software between two generate chains in one thread, so out-of-order execution could be working on instructions from both dependency chains in parallel.
Auto-parallelization by compilers at compile time is possible for source that has some visible parallelism, but if generate(a) isn't "separable" (not the correct technical term, I think) then you're out of luck.
e.g. if it's return a + hidden_array[static_counter++]; then the compiler can use math to prove that summing chunks of the array in parallel and adding the partial sums will still give the same result.
But if there's truly a serial dependency through a (like even a simple LCG PRNG), and the software doesn't know any mathematical tricks to break the dependency or reduce it to a closed form, you're out of luck. Compilers do know tricks like sum(0..n) = n*(n+1)/2 (evaluated slightly differently to avoid integer overflow in a partial result), or a+a+a+... (n times) is a * n, but that doesn't help here.
There is a scheme studied mostly in the academy called "Thread Decomposition". It aims to do more or less what you ask about - given a single-threaded code, it tries to break it down into multiple threads in order to divide the work on a multicore system. This process can be done by a compiler (although this requires figuring out all possible side effects at compile time which is very hard), by a JIT runtime, or through HW binary-translation, but each of these methods has complicated limitations and drawbacks.
Unfortunately, other than being automated, this process has very little appeal as it can hardly match true manual parallelization done by a person how understands the code. It also doesn't simply scale performance according to the number of threads, since it usually incurs a large overhead in the form of code that has to be duplicated.
Example paper by some nice folks from UPC in Barcelona: http://ieeexplore.ieee.org/abstract/document/5260571/
I have a number crunching C/C++ application. It is basically a main loop for different data sets. We got access to a 100 node cluster with openmp and mpi available. I would like to speedup the application but I am an absolut newbie for both mpi and openmp. I just wonder what is the easiest one to learn and to debug even if the performance is not the best.
I also wonder what is the most adequate for my main loop application.
Thanks
If your program is just one big loop using OpenMP can be as simple as writing:
#pragma omp parallel for
OpenMP is only useful for shared memory programming, which unless your cluster is running something like kerrighed means that the parallel version using OpenMP will only run on at most one node at a time.
MPI is based around message passing and is slightly more complicated to get started. The advantage is though that your program could run on several nodes at one time, passing messages between them as and when needed.
Given that you said "for different data sets" it sounds like your problem might actually fall into the "embarrassingly parallel" category, where provided you've got more than 100 data sets you could just setup the scheduler to run one data set per node until they are all completed, with no need to modify your code and almost a 100x speed up over just using a single node.
For example if your cluster is using condor as the scheduler then you could submit 1 job per data item to the "vanilla" universe, varying only the "Arguments =" line of the job description. (There are other ways to do this for Condor which may be more sensible and there are also similar things for torque, sge etc.)
OpenMP is essentially for SMP machines, so if you want to scale to hundreds of nodes you will need MPI anyhow. You can however use both. MPI to distribute work across nodes and OpenMP to handle parallelism across cores or multiple CPUs per node. I would say OpenMP is a lot easier than messing with pthreads. But it being coarser grained, the speed up you will get from OpenMP will usually be lower than a hand optimized pthreads implementation.
I am impressed with intel thread building blocks. I like how i should write task and not thread code and i like how it works under the hood with my limited understanding (task are in a pool, there wont be 100 threads on 4cores, a task is not guaranteed to run because it isnt on its own thread and may be far into the pool. But it may be run with another related task so you cant do bad things like typical thread unsafe code).
I wanted to know more about writing task. I like the 'Task-based Multithreading - How to Program for 100 cores' video here http://www.gdcvault.com/sponsor.php?sponsor_id=1 (currently second last link. WARNING it isnt 'great'). My fav part was 'solving the maze is better done in parallel' which is around the 48min mark (you can click the link on the left side. That part is really all you need to watch if any).
However i like to see more code examples and some API of how to write task. Does anyone have a good resource? I have no idea how a class or pieces of code may look after pushing it onto a pool or how weird code may look when you need to make a copy of everything and how much of everything is pushed onto a pool.
Java has a parallel task framework similar to Thread Building Blocks - it's called the Fork-Join framework. It's available for use with the current Java SE 6 and to be included in the upcoming Java SE 7.
There are resources available for getting started with the framework, in addition to the javadoc class documentation. From the jsr166 page, mentions that
"There is also a wiki containing additional documentation, notes, advice, examples, and so on for these classes."
The fork-join examples, such as matrix multiplication are a good place to start.
I used the fork-join framework in solving some of Intel's 2009 threading challenges. The framework is lightweight and low-overhead - mine was the only Java entry for the Kight's Tour problem and it outperformed other entries in the competition. The java sources and writeup are available from the challenge site for download.
EDIT:
I have no idea how a class or pieces
of code may look after pushing it onto
a pool [...]
You can make your own task by subclassing one of the ForKJoinTask subclasses, such as RecursiveTask. Here's how to compute the fibonacci sequence in parallel. (Taken from the RecursiveTask javadocs - comments are mine.)
// declare a new task, that itself spawns subtasks.
// The task returns an Integer result.
class Fibonacci extends RecursiveTask<Integer> {
final int n; // the n'th number in the fibonacci sequence to compute
Fibonnaci(int n) { this.n = n; } // constructor
Integer compute() { // this method is the main work of the task
if (n <= 1) // 1 or 0, base case to end recursion
return n;
Fibonacci f1 = new Fibonacci(n - 1); // create a new task to compute n-1
f1.fork(); // schedule to run asynchronously
Fibonacci f2 = new Fibonacci(n - 2); // create a new task to compute n-2
return f2.invoke() + f1.join(); // wait for both tasks to compute.
// f2 is run as part of this task, f1 runs asynchronously.
// (you could create two separate tasks and wait for them both, but running
// f2 as part of this task is a little more efficient.
}
}
You then run this task and get the result
// default parallelism is number of cores
ForkJoinPool pool = new ForkJoinPool();
Fibonacci f = new Fibonacci(100);
int result = pool.invoke(f);
This is a trivial example to keep things simple. In practice, performance would not be so good, since the work executed by the task is trivial compared to the overhead of the task framework. As a rule of thumb, a task should perform some significant computation - enough to make the framework overhead insignificant, yet not so much that you end up with one core at the end of the problem running one large task. Splitting large tasks into smaller ones ensures that one core isn't left doing lots of work while other cores are idle - using smaller tasks keeps more cores busy, but not so small that the task does no real work.
[...] or how weird code may look when
you need to make a copy of everything
and how much of everything is pushed
onto a pool.
Only the tasks themselves are pushed into a pool. Ideally you don't want to be copying anything: to avoid interference and the need for locking, which would slow down your program, your tasks should ideally be working with independent data. Read-only data can be shared amongst all tasks, and doesn't need to be copied. If threads need to co-operate building some large data structure, it's best they build the pieces separately and then combine them at the end. The combining can be done as a separate task, or each task can add it's piece of the puzzle to the overall solution. This often does require some form of locking, but it's not a considerable performance issue if the work of the task is much greater than the the work updating the solution. My Knight's Tour solution takes this approach to update a common repository of tours on the board.
Working with tasks and concurrency is quite a paradigm shift from regular single-threaded programming. There are often several designs possible to solve a given problem, but only some of these will be suitable for a threaded solution. It can take a few attempts to get the feel for how to recast familiar problems in a multi-threaded way. The best way to learn is to look at the examples, and then try it for yourself. Always profile, and meausre the effects of varying the number of threads. You can explicitly set the number of threads (cores) to use in the pool in the pool constructor. When tasks are broken up linearly, you can expect near linear speedup as the number of threads increases.
Playing with "frameworks" which claim to be solving unsolvable (optimal task scheduling is NP hard) is not going to help you at all - reading books and than articles on concurrent algorithms will. So called "tasks" are nothing more that a fancy name for defining separability of the problem (parts that can be computed independently of each other). Class of separable problems is very small - and they are already covered in old books.
For problems which are not separable you have to plan phases and data barriers between phases to exchange data. Optimal orchestration of data barriers for simultaneous data exchange is not just NP hard but impossible to solve in a general way in principle - you'd need to examine history of all possible interleavings - that's like the power set of already exponential set (like going from N to R in math). The reason I mention is to make it clear that no software can ever do this for you and that how to do it is intrinsically dependent on actual algorithm and a make or break of whether parallelization is feasible at all (even if it's theoretically possible).
When you enter high parallelism you can't even maintain a queue, you don't even have a memory bus anymore - imagine 100 CPU-s trying to sync up on just a single shared int or trying to do memory bus arbitrage. You have to pre-plan and pre-configure everything that's going to run and essentially prove correctness on a white board. Intel's Threading Building Blocks are a small kid in that world. They are for small number of cores which can still share a memory bus. Running separable problems is a no-brainer which you can do without any "framework".
So you are back to having to read as many different parallel algorithms as you can. It normally takes 1-3 years to research approximately optimal data barrier layout for one problem. It becomes layout when you go for say 16+ cores on a single chip since only first neighbors can exchange data efficiently (during one data barrier cycle). So you'll actually learn much more by looking at CUDA and papers and results with IBM-s experimental 30-core CPU than Intel's sales pitch or some Java toy.
Beware of demo problems for which the size of resources wasted (number of cores and memory) is much bigger that the speedup they achieve. If it takes 4 cores and 4x RAM to solve something 2x faster, the solution is not scalable for parallelization.
All of the concurrent programs I've seen or heard details of (admittedly a small set) at some point use hardware synchronization features, generally some form of compare-and-swap. The question is: are there any concurrent programs in the wild where the thread interact throughout there life and get away without any synchronization?
Example of what I'm thinking of include:
A program that amounts to a single thread running a yes/no test on a large set of cases and a big pile of threads tagging cases based on a maybe/no tests. This doesn't need synchronization because dirty data will only effect performance rather than correctness.
A program that has many threads updating a data structure where any state that is valid now, will always be valid, so dirty reads or writes don't invalidate anything. An example of this is (I think) path compression in the union-find algorithm.
If you can break work up into completely independent chunks, then yes there are concurrent algorithms whose only synchronisation point is the one at the end of the work where all threads join. Parallel speedup is then a factor of being able to break into tasks whose sizes are as similiar as possible.
Some indirect methods for solving systems of linear equations, like Successive over-relaxation ( http://en.wikipedia.org/wiki/Successive_over-relaxation ), don't really need the iterations to be synchronized.
I think it's a bit trick question because e.g. if you program in C, malloc() must be multi-thread safe and uses hardware synchronization, and in Java the garbage collector requires hardware synchronization anyway. All Java programs require the GC, and hardly any C program makes it without malloc() (or C++ program / new() operator).
There is a whole class of algorithms which are sometimes referred to as "embarallel" (contraction of "embarrassingly parallel"). Many image processing algorithms fall into this class, where each pixel may be processed independently (which makes implementation with e.g. SIMD or GPGPU very straightforward).
Well, without any synchronization at all (even at the end of the algorithm) you obviously can't do anything useful because you can't even transfer the results of concurrent computations to the main thread: suppose that they were on remote machines without any communication channels to the main machine.
The simplest example is inside java.lang.String which is immutable and lazily caches its hash code. This cache is written to without synchronization because (a) its cheaper, (b) the value is recomputable, and (c) JVM guarantees no tearing. The tolerance of data races in purely functional contexts allows tricks like this to be used safely without explicit synchronization.
I agree with Mitch's answer. I would like to add that the ray tracing algorithm can work without synchronization until the point where all threads join.
I wrote a C program which reads a dataset from a file and then applies a data mining algorithm to find the clusters and classes in the data. At the moment I am trying to rewrite this sequential program multithreaded with PThreads and I am newbie to a parallel programming and I have a question about the number of worker threads which struggled my mind:
What is the best practice to find the number of worker threads when you do parallel programming and how do you determine it? Do you try different number of threads and see its results then determine or is there a procedure to find out the optimum number of threads. Of course I'm investigating this question from the performance point of view.
There are a couple of issues here.
As Alex says, the number of threads you can use is application-specific. But there are also constraints that come from the type of problem you are trying to solve. Do your threads need to communicate with one another, or can they all work in isolation on individual parts of the problem? If they need to exchange data, then there will be a maximum number of threads beyond which inter-thread communication will dominate, and you will see no further speed-up (in fact, the code will get slower!). If they don't need to exchange data then threads equal to the number of processors will probably be close to optimal.
Dynamically adjusting the thread pool to the underlying architecture for speed at runtime is not an easy task! You would need a whole lot of additional code to do runtime profiling of your functions. See for example the way FFTW works in parallel. This is certainly possible, but is pretty advanced, and will be hard if you are new to parallel programming. If instead the number of cores estimate is sufficient, then trying to determine this number from the OS at runtime and spawning your threads accordingly will be a much easier job.
To answer your question about technique: Most big parallel codes run on supercomputers with a known architecture and take a long time to run. The best number of processors is not just a function of number, but also of the communication topology (how the processors are linked). They therefore benefit from a testing phase where the best number of processors is determined by measuring the time taken on small problems. This is normally done by hand. If possible, profiling should always be preferred to guessing based on theoretical considerations.
You basically want to have as many ready-to-run threads as you have cores available, or at most 1 or 2 more to ensure no core that's available to you will ever be left idle. The trick is in estimating how many threads will typically be blocked waiting for something else (mostly I/O), as that is totally dependent on your application and even on external entities beyond your control (databases, other distributed services, etc, etc).
In the end, once you've determined about how many threads should be optimal, running benchmarks for thread pool sizes around your estimated value, as you suggest, is good practice (at the very least, it lets you double check your assumptions), especially if, as it appears, you do need to get the last drop of performance out of your system!