Optimum number of threads for a highly parallelizable problem

Optimum number of threads for a highly parallelizable problem - parallel-processing

I parallelized a simulation engine in 12 threads to run it on a cluster of 12 nodes(each node running one thread). Since chances of availability of 12 systems is generally less, I also tweaked it for 6 threads(to run on 6 nodes), 4 threads(to run on 4 nodes), 3 threads(to run on 3 nodes), and 2 threads(to run on 2 nodes). I have noticed that more the number of nodes/threads, more is the speedup. But obviously, the more nodes I use, the more expensive(in terms of cost and power) the execution becomes.
I want to publish these results in a journal so I want to know if there are any laws/theorems which will help me to decide the optimum number of nodes on which I should run this program?
Thanks,
Akshey

How have you parallelised your program and what is inside each of your nodes ?
For instance, on one of my clusters I have several hundred nodes each containing 4 dual-core Xeons. If I were to run an OpenMP program on this cluster I would place a single execution on one node and start up no more than 8 threads, one for each processor core. My clusters are managed by Grid Engine and used for batch jobs, so there is no contention while a job is running. In general there is no point in asking for more than one node on which to run an OpenMP job since the shared-memory approach doesn't work on distributed-memory hardware. And there's not much to be gained by asking for fewer than 8 threads on an 8-core node, I have enough hardware available not to have to share it.
If you have used a distributed-memory programming approach, such as MPI, then you are probably working with a number of processes (rather than threads) and may well be executing these processes on cores on different nodes, and be paying the costs in terms of communications traffic.
As #Blank has already pointed out the most efficient way to run a program, if by efficiency one means 'minimising total cpu-hours', is to run the program on 1 core. Only. However, for jobs of mine which can take, say, a week on 256 cores, waiting 128 weeks for one core to finish its work is not appealing.
If you are not already familiar with the following terms, Google around for them or head for Wikipedia:
Amdahl's Law
Gustafson's Law
weak scaling
strong scaling
parallel speedup
parallel efficiency
scalability.

"if there are any laws/theorems which will help me to decide the optimum number of nodes on which I should run this program?"
There's no such general laws, because every problem has slightly different characteristics.
You can make a mathematical model of the performance of your problem on different number of nodes, knowing how much computational work has to be done, and how much communications has to be done, and how long each takes. (The communications times can be estimated by the amount of commuincations, and typical latency/bandwidth numbers for your nodes' type of interconnect). This can guide you as to good choices.
These models can be valuable for understanding what is going on, but to actually determine the right number of nodes to run on for your code for some given problem size, there's really no substitute for running a scaling test - running the problem on various numbers of nodes and actually seeing how it performs. The numbers you want to see are:
Time to completion as a function of number of processors: T(P)
Speedup as a function of number of processors: S(P) = T(1)/T(P)
Parallel efficiency: E(P) = S(P)/P
How do you choose the "right" number of nodes? It depends on how many jobs you have to run, and what's an acceptable use of computational resources.
So for instance, in plotting your timing results you might find that you have a minimum time to completion T(P) at some number of processors -- say, 32. So that might seem like the "best" choice. But when you look at the efficiency numbers, it might become clear that the efficiency started dropping precipitously long before that; and you only got (say) a 20% decrease in run time over running at 16 processors - that is, for 2x the amount of computational resources, you only got a 1.25x increase in speed. That's usually going to be a bad trade, and you'd prefer to run at fewer processors - particularly if you have a lot of these simulations to run. (If you have 2 simulations to run, for instance, in this case you could get them done in 1.25 time units insetad of 2 time units by running the two simulations each on 16 processors simultaneously rather than running them one at a time on 32 processors).
On the other hand, sometimes you only have a couple runs to do and time really is of the essence, even if you're using resources somewhat inefficiently. Financial modelling can be like this -- they need the predictions for tomorrow's markets now, and they have the money to throw at computational resources even if they're not used 100% efficiently.
Some of these concepts are discussed in the "Introduction to Parallel Performance" section of any parallel programming tutorials; here's our example, https://support.scinet.utoronto.ca/wiki/index.php/Introduction_To_Performance

Increasing the number of nodes leads to diminishing returns. Two nodes is not twice as fast as one node; four nodes even less so than two. As such, the optimal number of nodes is always one; it is with a single node that you get most work done per node.

Related

Would communication and inter-connection have any impact on a computation bound application on multinode?

I have a computation bound application. I have executed it on multi-nodes ( 4nodes, 8nodes) I'm wondering if communication between the nodes could have any effect on the run time? If so, how would it be possible? because as far as I found, computation bound application just depends on the computing capability of system.
Also, can I consider CPU amount of my system as computing capability?
Any help would be appreciated.
Updated:
In order to see if the application is memory-bound or compute-bound, I've run the application over 1 nodes using different number of cores. For that application (NPB-LU), the run time decreased linearly by increasing the number of cores. So I found this application could be compute-bound (I didn't have another option to figure it out).
Then, I have predicted the run time of the application with a model which considers the latency(in my case it's message-time) in different connection levels like inter-socket, inter-node. There are some difference in the predicted time which achieved by different latency connection levels although the application seemed to be computation-bound.
n:grid size, p:number of cores, m(total Mops/s), f(Mop/s/core)

Imagine you have horse that is drinking water, let's say 1 liter per minute.
In order to give the water to the horse you have a water well where you can take the water from. Imagine you can pump up to 1.5 liters per minute.
Having this situation your water consumption is horse-bounded.
Then it turns out that you have two horses drinking the same amount of water: 1 liter each per minute. Then your water consumption is no longer horse-bounded but well-bounded.
Your application behavior can change depending of the environment. In order to determine what is happening to your application I recommend you to profile your app. You have a lot of alternatives such as gprof, perf, PAPI and many others to better observe what is your application behaviour.
Then you can determine experimentally very intersting metrics like Instructions per Clock cycle, which can give you a better understanding of the behaviour of your app.

How long does it take to process the file If I have only one worker node?

Let's say I have a data with 25 blocks and the replication factor is 1. The mapper requires about 5 mins to read and process a single block of the data. Then how can I calculate the time for one worker node? The what about 15 nodes? Will the time be changed if we change the replication factor to 3?
I really need a help.

First of all I would advice reading some scientific papers regarding the issue (Google Scholar is a good starting point).
Now a bit of discussion. From my latest experiments I have concluded that processing time has very strong relation with amount of data you want to process (makes sense). On our cluster, on average it takes around 7-8 seconds for Mapper to read a block of 128MBytes. Now there are several factors which you need to consider in order to predict the overall execution time:
How much data the Mapper produces, which will determine moreless the time Hadoop requires to execute Shuffling
What Reducer is doing? Does it do some iterative processing? (might be slow!)
What is the configuration of the resources? (how many Mappers and Reducers are allowed to run on the same machine)
Finally are there other jobs running simultaneously? (this might be slowing down the jobs significantly, since your Reducer slots can be occupied waiting for data instead of doing useful things).
So already for one machine you are seeing the complexity of the task of predicting the time of job execution. Basically during my study I was able to conclude that in average one machine is capable of processing from 20-50 MBytes/second (the rate is calculated according to the following formula: total input size/total job running time). The processing rate includes the staging time (when your application is starting and uploading required files to the cluster for example). The processing rate is different for different use cases and greatly influenced by the input size and more importantly the amount of data produced by Mappers (once again this values are for our infrastructure and on different machine configuration you will be seeing completely different execution times).
When you start scaling your experiments, you would see in average improved performance, but once again from my study I could conclude that it is not linear and you would need to fit by yourself, for your own infrastructure the model with respective variables which would approximate the job execution time.
Just to give you an idea, I will share some part of the results. The rate when executing determine use case on 1 node was ~46MBytes/second, for 2 nodes it was ~73MBytes/second and for 3 nodes it was ~85MBytes/second (in my case the replication factor was equal to the number of nodes).
The problem is complex requires time, patience and some analytical skills to solve it. Have fun!

Hybrid : OpenMPI + OpenMP on a cluster

I solve numerically some Ordinary Differential Equations.
I have a very simple (conceptually), but very long computations. There is a very long array (~2M cells) and for each cell I need to perform numerical integration. This procedure should be repeated 1000 times. By using OpenMP parallelism and one 24-core machine, it takes around a week to do this (which is not acceptable).
I have a cluster of 20 such (24-core) machines and think about Hybrid implementation. I want to use MPI to pass over these 20 nodes and at each node use regular OpenMP parallelism.
Basically, I need to split my very long array to 20(nodes)X24(proccs) working units.
Are there any suggestion of better implementation or better ideas? I've read a lot on this subject and I've got impression, that sometimes such hybrid implementation does not necessarily bring a real speed up.
May be I should create a "pool of workers" and "feed" them with my array or something else.
Any suggestion and useful links are welcome!

If your computation is as embarrassingly parallel as you indicate you should expect good speedup by spreading the load across all 20 of your machines. By good I mean close to 20 and by close to 20 I mean any number which you actually get which leaves you thinking that the effort has been worthwhile.
Your proposed hybrid solution is certainly feasible and you should get good speedup if you implement it.
One alternative to a hybrid MPI+OpenMP program would be a job script (written in your favourite scripting language) which simply splits your large array into 20 pieces and starts 20 jobs, one on each machine running an instance of your program. When they've all finished have another script ready to recombine the results. This would avoid having to write any MPI code at all.
If your computer has an installation of Grid Engine you can probably write a job submission script to submit your work as an array job and let Grid Engine take care of parcelling the work out to the individual machines/tasks. I expect that other job management systems have similar facilities but I'm not familiar with them.
Another alternative would be an all-MPI code, that is drop the OpenMP altogether and modify your code to use whatever processors it finds available when you run it. Again, if your program requires little or no inter-process communication you should get good speedup.
Using MPI on a shared memory computer is sometimes a better (in performance terms) approach than OpenMP, sometimes worse. Trouble is, it's difficult to be certain about which approach is better for a particular program on a particular architecture with RAM and cache and interconnects and buses and all the other variables to consider.
One factor I've ignored, largely because you've provided no data to consider, is the load-balancing of your program. If you split your very large dataset into 20 equal-sized pieces do you end up with 20 equal-duration jobs ? If not, and if you have an idea how job time varies with inputs, you might do something more sophisticated in splitting the job up than simply chopping your into those 20 equal pieces. You might, for instance, chop it into 2000 equal pieces and serve them one at a time to the machinery for execution. In this case what you gain in load-balancing might be at risk of being lost to the time costs of job management. You pays yer money and you takes yer choice.
From your problem statement I wouldn't be making a decision about which solution to go for on the basis of expected performance, because I'd expect any of the approaches to get into the same ballpark performance-wise, but on the time to develop a working solution.

Variation of the job scheduling prob

I'm doing some administration work for an aviation transport company. They build aircraft containers and such here. One of the things they want me to code is a order optimization script that the guys on the floor can use to get the most out of the given material. To give a simple overview: say we order a certain amount beams that are 10 meters per unit. We need beam chunks of 5x 6m, 10x 3.5m, 4x 3m, which are acquired by cutting the 10m in smaller parts. What would be the minimum amount of 10m beams we need to order?
There are some parallels with the multiprocessor job scheduling problem (one beam is a processor, each chunk a job), although that focusses on minimizing the time required to perform all jobs instead of minimizing the amount of processors needed to perform all jobs within a pre-set time. The multiprocessor job scheduling problem is in NP-complete, but I wonder if my variation of the problem is too. Does anybody know similar problems and methods for solving them?

This problem is exactly: http://en.wikipedia.org/wiki/Cutting_stock_problem (more generally http://en.wikipedia.org/wiki/Bin_packing_problem). You can use any old ILP solver. I like http://lpsolve.sourceforge.net/5.5/, its quite friendly to use.

Prioritizing Erlang nodes

Assuming I have a cluster of n Erlang nodes, some of which may be on my LAN, while others may be connected using a WAN (that is, via the Internet), what are suitable mechanisms to cater for a) different bandwidth availability/behavior (for example, latency induced) and b) nodes with differing computational power (or even memory constraints for that matter)?
In other words, how do I prioritize local nodes that have lots of computational power, over those that have a high latency and may be less powerful, or how would I ideally prioritize high performance remote nodes with high transmission latencies to specifically do those processes with a relatively huge computations/transmission (that is, completed work per message ,per time unit) ratio?
I am mostly thinking in terms of basically benchmarking each node in a cluster by sending them a benchmark process to run during initialization, so that the latencies involved in messasing can be calculated, as well as the overall computation speed (that is, using a node-specific timer to determine how fast a node terminates with any task).
Probably, something like that would have to be done repeatedly, on the one hand in order to get representative data (that is, averaging data) and on the other hand it might possibly even be useful at runtime in order to be able to dynamically adjust to changing runtime conditions.
(In the same sense, one would probably want to prioritize locally running nodes over those running on other machines)
This would be meant to hopefully optimize internal job dispatch so that specific nodes handle specific jobs.

We've done something similar to this, on our internal LAN/WAN only (WAN being for instance San Francisco to London). The problem boiled down to a combination of these factors:
The overhead in simply making a remote call over a local (internal) call
The network latency to the node (as a function of the request/result payload)
The performance of the remote node
The compute power needed to execute the function
Whether batching of calls provides any performance improvement if there was a shared "static" data set.
For 1. we assumed no overhead (it was negligible compared to the others)
For 2. we actively measured it using probe messages to measure round trip time, and we collated information from actual calls made
For 3. we measured it on the node and had them broadcast that information (this changed depending on the load current active on the node)
For 4 and 5. we worked it out empirically for the given batch
Then the caller solved to get the minimum solution for a batch of calls (in our case pricing a whole bunch of derivatives) and fired them off to the nodes in batches.
We got much better utilization of our calculation "grid" using this technique but it was quite a bit of effort. We had the added advantage that the grid was only used by this environment so we had a lot more control. Adding in an internet mix (variable latency) and other users of the grid (variable performance) would only increase the complexity with possible diminishing returns...

The problem you are talking about has been tackled in many different ways in the context of Grid computing (e.g, see Condor). To discuss this more thoroughly, I think some additional information is required (homogeneity of the problems to be solved, degree of control over the nodes [i.e. is there unexpected external load etc.?]).
Implementing an adaptive job dispatcher will usually require to also adjust the frequency with which you probe the available resources (otherwise the overhead due to probing could exceed the performance gains).
Ideally, you might be able to use benchmark tests to come up with an empirical (statistical) model that allows you to predict the computational hardness of a given problem (requires good domain knowledge and problem features that have a high impact on execution speed and are simple to extract), and another one to predict communication overhead. Using both in combination should make it possible to implement a simple dispatcher that bases its decisions on the predictive models and improves them by taking into account actual execution times as feedback/reward (e.g., via reinforcement learning).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio