I have a function to calculate fitness value, say func(). In my implementation, I have used MPI for Parallelization.
There are 3 machines in the MPI cluster connected via LAN. These machines are installed with NFS protocol. Memory is not shared among these machines.
The main while loop runs 500 times.
Inside this while loop, I use MPI to parallelize the 9 func() calls. Meaning, 9 times func() is called inside main while loop, I parallelized that in a way each of 3 nodes gets to call 3 func() calls and return the results to the master node.
MPI Workflow diagram - please refer to this diagram
What Happens Inside each node please refer this diagram
This continues 500 times in the while loop. (Meaning, in each next loop, again 9 func() calls are parallelized)
Is this strategy called a parallel computing or a distributed computing?
Considering the definitions, parallel computing is parallelizing multiple tasks in parallel and distributed computing is distributing a single task on multiple nodes having a common goal. I feel it's parallel computing.
But, here I am executing on different machines, so should I consider it as distributed computing?
Please clear this doubt.
If you use distributed computing to solve a single problem, then it is also parallel computing. You are using multiple computers (or processors) to solve which satisfies the simple definiton of parallel computing.
Parallel computing uses two or more processors (cores, computers) in
combination to solve a single problem.
But, Not all parallel computing is distributed. You can perform parallel tasks to solve a problem using shared memory (Using programming models like OpenMP) where you only use a single computer.
Personal Opinion: You can use MPI to solve a problem by using shared memory or in a single computer (without using shared memory) but it remains parallel computing (by broad definition of distributed computing, there should be multiple computers to make it a distributed computing even though MPI has it's own memory space and uses messsage passing)
A distributed computer system consists of multiple software components
that are on multiple computers, but run as a single system.
In your case it is both distributed and parallel. As Gilles Gouaillardet pointed out in comments:
Your program is MPI, so it is both parallel (several tasks collaborate
to achieve a goal) and distributed (each task has its own memory
space, and communicate with other tasks via message - e.g. no shared
memory)
Related
I’m a beginner in the field of Graph Matching and Parallel Computing. I read a paper that talks about an efficient parallel matching algorithm. They explained the importance of the locality, but I don't know it represents what? and What is good and bad locality?
Our distributed memory parallelization (using MPI) on p processing elements (PEs or MPI processes) assigns nodes to PEs and stores all edges incident to a node locally. This can be done in a load balanced way if no node has degree exceeding m/p. The second pass of the basic algorithm from Section 2 has to exchange information on candidate edges that cross a PE boundary. In the worst case, this can involve all edges handled by a PE, i.e., we can expect better performance if we manage to keep most edges locally. In our experiments, one PE owns nodes whose numbers are a consecutive range of the input numbers. Thus, depending on how much locality the input numbering contains we have a highly local or a highly non-local situation.
Generally speaking, locality in distributed models is basically the extent to which a global solution for a computational problem problem can be obtained from locally available data.
Good locality is when most nodes can construct solutions using local data, since they'll require less communication to get any missing data. Bad locality would be if a node spends more than desirable time fetching data, rather than finding a solution using local data.
Think of a simple distributed computer system which comprises a collection of computers each somewhat like a desktop PC, in as much as each one has a CPU and some RAM. (These are the nodes mentioned in the question.) They are assembled into a distributed system by plugging them all into the same network.
Each CPU has memory-bus access (very fast) to data stored in its local RAM. The same CPU's access to data in the RAM on another computer in the system will run across the network (much slower) and may require co-operation with the CPU on that other computer.
locality is a property of the data used in the algorithm, local data is on the same computer as the CPU, non-local data is elsewhere on the distributed system. I trust that it is clear that parallel computations can proceed more quickly the more that each CPU has to work only with local data. So the designers of parallel programs for distributed systems pay great attention to the placement of data often seeking to minimise the number and sizes of exchanges of data between processing elements.
Complication, unnecessary for understanding the key issues: of course on real distributed systems many of the individual CPUs are multi-core, and in some designs multiple multi-core CPUs will share the same enclosure and have approximately memory-bus-speed access to all the RAM in the same enclosure. Which makes for a node which itself is a shared-memory computer. But that's just detail and a topic for another answer.
TL;DR: Is there any way to get SGE to round-robin between servers when scheduling jobs, instead of allocating all jobs to the same server whenever it can?
Details:
I have a large compute process that consists of many smaller jobs. I'm using SGE to distribute the work across multiple servers in a cluster.
The process requires a varying number of tasks at different points in time (technically, it is a DAG of jobs). Sometimes the number of parallel jobs is very large (~1 per CPU in the cluster), sometimes it is much smaller (~1 per server). The DAG is dynamic and not uniform so it isn't easy to tell how many parallel jobs there are/will at any given point.
The jobs use a lot of CPU but also do some non trivial amount of IO (especially at job startup and shutdown). They access a shared NFS server connected to all the compute servers. Each compute server has a narrower connection (10Gb/s) but the NFS server has several wide connections (40Gbs) into the communication switch. Not sure what the bandwidth of the switch backbone is, but it is a monster so it should be high.
For optimal performance, jobs should be scheduled across different servers when possible. That is, if I have 20 servers, each with 20 processors, submitting 20 jobs should run one job on each. Submitting 40 jobs should run 2 on each, etc. Submitting 400 jobs would saturate the whole cluster.
However, SGE is perversely intent on minimizing my I/O performance. Submitting 20 jobs would schedule all of them on a single server. So they all fight for a single measly 10Gb network connection when 19 other machines with a bandwidth of 190Gb sit idle.
I can force SGE to execute each job on a different server in several ways (using resources, using special queues, using my parallel environment and specifying '-t 1-', etc.). However, this means I will only be able to run one job per server, period. When the DAG opens up and spawns many jobs, the jobs will stall waiting for a completely free server while 19 out of the 20 processors of each machine will stay idle.
What I need is a way to tell SGE to to assign each job to the next server that has an available slot in a round-robin order. A better way would be to assign the job to the least loaded server (maximal number of unused slots, or maximal fraction of unused slots, or minimal number of used slots, etc.). But a dead simple round-robin would do the trick.
This seems like a much more sensible strategy in general, compared to SGE's policy of running each job on the same server as the previous job, which is just about the worst possible strategy for my case.
I looked over SGE's configuration options but I couldn't find any way to modify the scheduling strategy. That said, SGE's documentation isn't exactly easy to navigate, so I could have easily missed something.
Does anyone know of any way to get SGE to change its scheduling strategy to round-robin or least-loaded or anything along these lines?
Thanks!
Simply change allocation_rule to $round_robin for the SGE parallel environment (sge_pe file):
allocation_rule
The allocation rule is interpreted by the scheduler thread
and helps the scheduler to decide how to distribute parallel
processes among the available machines. If, for instance, a
parallel environment is built for shared memory applications
only, all parallel processes have to be assigned to a single
machine, no matter how much suitable machines are available.
If, however, the parallel environment follows the distri-
buted memory paradigm, an even distribution of processes
among machines may be favorable.
The current version of the scheduler only understands the
following allocation rules:
<int>: An integer number fixing the number of processes
per host. If the number is 1, all processes have
to reside on different hosts. If the special
denominator $pe_slots is used, the full range of
processes as specified with the qsub(1) -pe switch
has to be allocated on a single host (no matter
which value belonging to the range is finally
chosen for the job to be allocated).
$fill_up: Starting from the best suitable host/queue, all
available slots are allocated. Further hosts and
queues are "filled up" as long as a job still
requires slots for parallel tasks.
$round_robin:
From all suitable hosts a single slot is allocated
until all tasks requested by the parallel job are
dispatched. If more tasks are requested than suit-
able hosts are found, allocation starts again from
the first host. The allocation scheme walks
through suitable hosts in a best-suitable-first
order.
Source: http://gridscheduler.sourceforge.net/htmlman/htmlman5/sge_pe.html
Assuming that we have four 16-core nodes (node1, node2, node3, node4). How can I run a big parallelized program on node1,2,3 at the same time? Or even using 16 cores in all, however allocated as 7cores in node1 + 8cores in node2 + 1core in node3 (other part being occupied)?
Is MPI the common way? Does openmp solely suffice? I haven't learnt MPI, but have used openmp within single node.
You can use a combination of both OpenMP and MPI if required.
While MPI does utilize every core on each node and has been optimized to use locality of reference when it finds out that its other tasks are on the same machine the code base needs to change a lot in case it has already been developed.
Incrementally parallellizing your code is recommended using OpenMP and so you might want to orchestrate a hybrid where each task of MPI utilizes the cores using OpenMP.
So number of nodes = number of MPI tasks
number of cores per machines = number of OpenMP tasks per machine
I am new to parallel computing and just starting to try out MPI and Hadoop+MapReduce on Amazon AWS. But I am confused about when to use one over the other.
For example, one common rule of thumb advice I see can be summarized as...
Big data, non-iterative, fault tolerant => MapReduce
Speed, small data, iterative, non-Mapper-Reducer type => MPI
But then, I also see implementation of MapReduce on MPI (MR-MPI) which does not provide fault tolerance but seems to be more efficient on some benchmarks than MapReduce on Hadoop, and seems to handle big data using out-of-core memory.
Conversely, there are also MPI implementations (MPICH2-YARN) on new generation Hadoop Yarn with its distributed file system (HDFS).
Besides, there seems to be provisions within MPI (Scatter-Gather, Checkpoint-Restart, ULFM and other fault tolerance) that mimic several features of MapReduce paradigm.
And how does Mahout, Mesos and Spark fit in all this?
What criteria can be used when deciding between (or a combo of) Hadoop MapReduce, MPI, Mesos, Spark and Mahout?
There might be good technical criteria for this decision but I haven't seen anything published on it. There seems to be a cultural divide where it's understood that MapReduce gets used for sifting through data in corporate environments while scientific workloads use MPI. That may be due to underlying sensitivity of those workloads to network performance. Here are a few thoughts about how to find out:
Many modern MPI implementations can run over multiple networks but are heavily optimized for Infiniband. The canonical use case for MapReduce seems to be in a cluster of "white box" commodity systems connected via ethernet. A quick search on "MapReduce Infiniband" leads to http://dl.acm.org/citation.cfm?id=2511027 which suggests that use of Infiniband in a MapReduce environment is a relatively new thing.
So why would you want to run on a system that's highly optimized for Infiniband? It's significantly more expensive than ethernet but has higher bandwidth, lower latency and scales better in cases of high network contention (ref: http://www.hpcadvisorycouncil.com/pdf/IB_and_10GigE_in_HPC.pdf).
If you have an application that would be sensitive to those effects of optimizations for Infiniband that are already baked into many MPI libraries, maybe that would be useful for you. If your app is relatively insensitive to network performance and spends more time on computations that don't require communication between processes, maybe MapReduce is a better choice.
If you have the opportunity to run benchmarks, you could do a projection on whichever system you have available to see how much improved network performance would help. Try throttling your network: downclock GigE to 100mbit or Infiniband QDR to DDR, for example, draw a line through the results and see if the purchase of a faster interconnect optimized by MPI would get you where you want to go.
The link you posted about FEM being done on MapReduce: Link
uses MPI. It states it right there in the abstract. They combined MPI's programming model (non-embarrassingly parallel) with HDFS to "stage" the data to exploit data locality.
Hadoop is purely for embarrassingly parallel computations. Anything that requires processes to organize themselves and exchange data in complex ways will get crap performance with Hadoop. This can be demonstrated both from an algorithmic complexity point of view, and also from a measurement point of view.
I have something in mind but I don't know the typical solution that could help me achieve that.
I need to have a distributed environment where not only memory is shared but processing is also shared, that means ALL Shared Processors work as one Big Processor Computing The code I wrote.
Could this be achieved knowing that I have limited knowledge in Data Grids and Hadoop?
Data Grid Platform (I knew that memory only is shared in that model) or Hadoop (where the code is shared among nodes but each node processes the code separately from other nodes but works on a subset of the data on HDFS).
But I need a solution that not only (shares memory or code as hadoop) but also the processing power of all the machines as one Single Big processor and one single Big Memory?
Do you expect that you just spawn the thread and it get executed somewhere and the middleware miraculously balances the load across nodes, moving threads from one node to another? I think you won't find this directly. The tagged frameworks don't have transparent shared memory either, for good reasons.
When using multiple nodes, you usually need them for processing power, and hiding everything and pretending you're on single machine will tend to unnecessary communication, slowing stuff down.
Instead, you can always design your app using the distribution API provided by those frameworks. For example in Infinispan, look for the Map-Reduce or Distributed Executors API.
I need to have a distributed environment where not only memory is shared but processing is also shared, that means ALL Shared Processors work as one Big Processor Computing The code I wrote.
You are not benefiting with processing on single machine. Application will scale if the processing is spread across multiple machines. If you want to see benefits of one Big Processor Computing, you can virtualize big physical machine into multiple virtual nodes (using technologies like VMWare).
But distributed processing across multiple VM nodes across multiple physical machines in a big cluster is best for distributed applications. Hadoop/Spark is best fit for these type of applications depending on batch processing (Hadoop) or real time processing needs (Spark).