What are some scenarios for which MPI is a better fit than MapReduce? - parallel-processing

As far as I understand, MPI gives me much more control over how exactly different nodes in the cluster will communicate.
In MapReduce/Hadoop, each node does some computation, exchanges data with other nodes, and then collates its partition of results. Seems simple, but since you can iterate the process, even algorithms like K-means or PageRank fit the model quite well. On a distributed file system with locality of scheduling, the performance is apparently good. In comparison, MPI gives me explicit control over how nodes send messages to each other.
Can anyone describe a cluster programming scenario where the more general MPI model is an obvious advantage over the simpler MapReduce model?

Almost any scientific code -- finite differences, finite elements, etc. Which kind of leads to the circular answer, that any distributed program which doesn't easily map to MapReduce would be better implemented with a more general MPI model. Not sure that's much help to you, I'll downvote this answer right after I post it.

Athough, this question has been answered, I would like to add/reiterate one very important point.
MPI is best suited for problems that require a lot of interprocess communication.
When Data becomes large (petabytes, anyone?), and there is little interprocess communication, MPI becomes a pain. This is so because the processes will spend all the time sending data to each other (bandwidth becomes a limiting factor) and your CPUs will remain idle. Perhaps an even bigger problem is reading all that data.
This is the fundamental reason behind having something like Hadoop. The Data also has to be distributed - Hadoop Distributed File System!
To say all this in short, MPI is good for task parallelism and Hadoop is good for Data Parallelism.

The best answer that I could come up with is that MPI is better than MapReduce in two cases:
For short tasks rather than batch processing. For example, MapReduce cannot be used to respond to individual queries - each job is expected to take minutes. I think that in MPI, you can build a query response system where machines send messages to each other to route the query and generate the answer.
For jobs nodes need to communicate more than what iterated MapReduce jobs support, but not too much so that the communication overheads make the computation impractical. I am not sure how often such cases occur in practice, though.

I expect that MPI beats MapReduce easily when the task is iterating over a data set whose size is comparable with the processor cache, and when communication with other tasks is frequently required. Lots of scientific domain-decomposition parallelization approaches fit this pattern. If MapReduce requires sequential processing and communication, or ending of processes, then the computational performance benefit from dealing with a cache-sized problem is lost.

When the computation and data that you are using have irregular behaviors that mostly translates to many message-passings between objects, or when you need low level hardware level accesses e.g. RDMA then MPI is better. In some answers that you see in here the latency of tasks or memory consistency model gets mentioned, frameworks like Spark or Actor Models like AKKA have shown that they can compete with MPI. Finally one should consider that MPI has benefit of being for years the main base for development of libraries needed for scientific computations (This are the most important missing parts missing from new frameworks using DAG/MapReduce Models).
All in all, I think the benefits that MapReduce/DAG models are bringing to the table like dynamic resource managers, and fault tolerance computation will make make them feasible for scientific computing groups.

Related

Which metrics to measure the efficiency of a MapReduce application?

I wrote a MapReduce application which run on 6 nodes of computers.
I am sure that my MapReduce algorithm (run on a cluster of computers) outperforms the sequential algorithm (run on a single computer), but I think this does not mean that my MapReduce algorithm is efficiently enough, right?
I have searched around and found: speedup, scaleup, and sizeup metrics. Is it true that we normally consider these metrics when measuring the efficiency of MapReduce application? Is there any metric that we need to consider?
Thank you a lot.
Before specifically addressing your question, let's revisit the map-reduce model and see what's the real problem, it tries to solve. You can refer this answer (by me/ of course you can refer other good answers for the same problem), to get an idea of map-reduce model.
So what it really tries to solve? It deduces a generic model that can be applied to solve vast a range of problems that needs to process a massive amount of data (usually in GBs or even Peta Bytes). And the real deal of this model is, it can be easily parallelized and can even be easily distributed the execution among number of nodes. This article (by me) has some detailed explanation of whole model.
So let's go to your question, you are asking about measuring the efficiency of a map reduce program based on speed, memory-efficiency and scalability.
Speaking to the point, the efficiency of a map-reduce program always depend on how far it enjoys the parallelism given by the underlying computational power. This directly indicates that a map-reduce program runs on one cluster may not be the ideal program to run in a different cluster. So we need to have a good idea of our cluster, if we hope to build up our program to a precisely fine-tuned level. But practically its rare some one needs to get it tuned up to that much level.
Let's take your points one by one:
Speed up:
It depends on how you split your input to different portions. This directly deduces the amount of parallelism (in human control). So as I mentioned above, the speed-up directly depends on how your split logic going to be able to utilize your cluster.
Memory efficiency:
It mostly depends on how memory efficient your mapper logic and reducer logic are.
Scalability:
This is mostly out of concern. You can see that the map-reduce model is already highly scalable to a level that one would rarely think about an extra mile.
So speaking as a whole, efficiency of a map reduce program is rarely a concern (even speed and memory). Practically speaking the most valuable metric is the quality of its output. i.e. how good your analytic data are. (in place of marketing, research etc.)

distribute processing to a cluster of heterogeneous compute nodes taking relative performance and cost of communication into account?

Given a cluster of truly heterogeneous compute nodes how is it possible to
distribute processing to them while taking into account both their relative performance
and cost of passing messages between them?
(I know optimising this is NP-complete in general)
Which concurrency platforms currently best support this?
You might rephrase/summarise the question as:
What algorithms make most efficient use of cpu, memory and communications resources for distributed computation in theory and what existing (open source) platforms come closest to realising this?
Obviously this depends somewhat on workload so understanding the trade-offs is critical.
Some Background
I find some on S/O want to understand the background so they can provide a more specific answer, so I've included quite a bit below, but its not necessary to the essence of the question.
A typical scenario I see is:
We have an application which runs on X nodes
each with Y cores. So we start with a homogeneous cluster.
Every so often the operations team buys one or more new servers.
The new servers are faster and may have more cores.
They are integrated into the cluster to make things run faster.
Some older servers may be re-purposed but the new cluster now contains machines with different performance characteristics.
The cluster is no-longer homogeneous but has more compute power overall.
I believe this scenario must be standard in big cloud data-centres as well.
Its how this kind of change in infrastructure can be best utilised that I'm really interested in.
In one application I work with the work is divided into a number of relative long tasks. Tasks are allocated to logical processors (we usually have one per core) as they become
available. While there are tasks to perform cores are generally not unoccupied but
for the most part those jobs can be classified as "embarassingly scalable".
This particular application is currently C++ with a roll your own concurrency platform using ssh and nfs for large task.
I'm considering the arguments for various alternative approaches.
Some parties prefer various hadoop mad/reduce options. I'm wondering how they shape up versus more C++/machine oriented approaches such as openMP, Cilk++. I'm more interested in the pros and cons than the answer for that specific case.
The task model itself seems scalable and sensible independent of platform.
So, I'm assuming a model where you divide work into tasks and a (probably distributed) scheduler tries to decide which processor to which allocate each task. I am open to alternatives.
There could be task queues for each node, possibly each processor and idle processors should allow work stealing (e.g. from processors with long queues).
However, when I look at the various models of high performance and cloud cluster computing I don't see this discussed so much.
Michael Wong classifies parallelism, ignoring hadoop, into two main camps (starting around 14min in).
https://isocpp.org/blog/2016/01/the-landscape-of-parallelism-michael-wong-meetingcpp-2015
HPC and multi-threaded applications in industry
The HPC community seems to favour openMP on a cluster of identical nodes.
This may still be heterogeneous if each node supports CUDA or has FPGA support but each node tends to be identical.
If that's the case do they upgrade their data centres in a big bang or what?
(E.g. supercomputer 1 = 100 nodes of type x. supercomputer v2.0 is on a different site with
200 nodes of type y).
OpenMP only supports a single physical computer by itself.
The HPC community gets around this either using MPI (which I consider too low level) or by creating a virtual machine from all the nodes
using a hypervisor like scaleMP or vNUMA (see for example - OpenMP program on different hosts).
(anyone know of a good open source hypervisor for doing this?)
I believe these are still considered the most powerful computing systems in the world.
I find that surprising as I don't see what prevents the map/reduce people creating an even bigger cluster more easily
that is much less efficient overall but wins on brute force due to the total number of cores utilised?
So which other concurrency platforms support truly heterogeneous nodes with widely varying characteristics and how do they deal with the performance mismatch (and similarly the distribution of data)?
I'm excluding MPI as an option as while powerful it is too low-level. You might as well say use sockets. A framework building on MPI would be acceptable (does X10 work this way?).
From the user's perspective the map/reduce
approach seems to be add enough nodes that it doesn't matter and not worry about using them at maximum efficiency.
Actually those details are kept under the hood in the implementation
of the schedulers and distributed file systems.
How/where is the cost of computation and message passing taken into account?
Is there any way in openMP (or your favourite concurrency platform)
to make effective use of information that this node is N times as fast as this node and the data transfer rate
to or from this node is on average X Mb/s?
In YARN you have Dominant Resource Fairness:
http://blog.cloudera.com/blog/2013/12/managing-multiple-resources-in-hadoop-2-with-yarn/
http://static.usenix.org/event/nsdi11/tech/full_papers/Ghodsi.pdf
This covers memory and cores using Linux Control Groups but it does not yet
cover disk and network I/O resources.
Are there equivalent or better approaches in other concurrency platforms? How do they compare to DRF?
Which concurrency platforms handle this best and why?
Are there any popular ones that are likely to be evolutionary dead ends?
OpenMP keeps surprising me by actively thriving. Could something like Cilk++ be made to scale this way?
Apologies in advance for combining several PhD thesis worth questions into one.
I'm basically looking for tips on what to look for for further reading
and advice on which platforms to investigate further (from the programmer's perspective).
A good summary of some platforms to investigate and/or links to papers or articles would suffice as a useful answer.

Is my application running efficiently?

The question is generic and can be extended to other frameworks or contexts beyond Spark & Machine Learning algorithms.
Regardless of the details, from a high-level point-of-view, the code is applied on a large dataset of labeled text documents. It passes by 9 iterations of cross-validation to tune some parameters of a Logistic Regression multi-class classifier.
It is expected that this kind of Machine Learning processing will be expensive in term of time and resources.
I am running now the code and everything seems to be OK, except that I have no idea if my application is running efficiently or not.
I couldn't find guidelines saying that for a certain type and amount of data, and for certain type of processing and computing resources the processing time should be in the approximate order of...
Is there any method that help in judging if my application is running slow or fast, or it is purely a matter of experience?
I had the same question and I didn't find a real answer/tool/way to test how good my performances were just looking "only inside" my application.
I mean, as far as I know, there's no tool like a speedtest or something like for the internet connection :-)
The only way I found is to re-write my app (if possible) with another stack in order to see if the difference (in terms of time) is THAT big.
Otherwise, I found very interesting 2 main resources, even if quite old:
1) A sort of 4 point guide to remember when coding:
Understanding the Performance of Spark Applications, SPark Summit 2013
2) A 2-episode article from Cloudera blog to tune at best your jobs:
episode1
episode2
Hoping it could help
FF
Your question is pretty generic, so I would also highlight few generic areas where you can look out for performance optimizations: -
Scheduling Delays - Are there significant scheduling delays in scheduling the tasks? if yes then you can analyze the reasons (may be your cluster needs more resources etc).
Utilization of Cluster - are your jobs utilizing the available cluster resources (like CPU, mem)? In case not then again look out for the reasons. May be creating more partitions helps in faster execution. May be there is significant time taken in serialization, so can you switch to Kyro Serialization.
JVM Tuning - Consider analyzing GC logs and tune if you find anomalies.
Executor Configurations - Analyze the memory/ cores provided to your executors. It should be sufficient to hold the data processed by the task/job.
your DAG and
Driver Configuration - Same as executors, Driver should also have enough memory to hold the results of certain functions like collect().
Shuffling - See how much time is spend in Shuffling and kind of Data Locality used by your task.
All the above are needed for the preliminary investigations and in some cases it can also increase the performance of your jobs to an extent but there could be complex issues for which the solution will depend upon case to case basis.
Please also see Spark Tuning Guide

What is the difference and how to choose between distributed queue and distributed computing platform?

there are many files need to process with two computers real-timely,I want to distribute them to the two computers and these tasks need to be completed as soon as possibile(means real-time processing),I am thinking about the below plan:
(1) distributed queue like Gearman
(2)distributed computing platform like hadoop/spark/storm/s4 and so on
I have two questions
(1)what is the advantage and disadvantage between (1) and (2)?
(2) How to choose in (2),hadoop?spark?storm?s4?or other?
thanks!
Maybe I have not described the question clearly. In most case,there are 1000-3000 files with the same format , these files are independent,you do not need to care their order,the size of one file maybe tens to hundreds of KB and in the future, the number of files and size of single file will rise. I have wrote a program , it can process the file and pick up the data and then store the data in mongodb. Now there are only two computers, I just want a solution that can process these files with the program quickly(as soon as possibile) and is easy to extend and maintain
distributed queue is easy to use in my case bur maybe hard to extend and maintain , hadoop/spark is to "big" in the two computers but easy to extend and maintain, which is better, i am confused.
It depends a lot on the nature of your "processing". Some dimensions that apply here are:
Are records independent from each other or you need some form of aggregation? i.e: do you need some pieces of data to go together? Say, all transactions from a single user account.
Is you processing CPU bound? Memory bound? FileSystem bound?
What will be persisted? How will you persist it?
Whenever you see new data, do you need to recompute any of the old?
Can you discard data?
Is the data somewhat ordered?
What is the expected load?
A good solution will depend on answers to these (and possibly others I'm forgetting). For instance:
If computation is simple but storage and retrieval is the main concern, you should maybe look into a distributed DB rather than either of your choices.
It could be that you are best served by just logging things into a distributed filesystem like HDFS and then run batch computations with Spark (should be generally better than plain hadoop).
Maybe not, and you can use Spark Streaming to process as you receive the data.
If order and consistency are important, you might be better served by a publish/subscribe architecture, especially if your load could be more than what your two servers can handle, but there are peak and slow hours where your workers can catch up.
etc. So the answer to "how you choose?" is "by carefully looking at the constraints of your particular problem, estimate the load demands to your system and picking the solution that better matches those". All of these solutions and frameworks dominate the others, that's why they are all alive and kicking. The choice is all in the tradeoffs you are willing/able to make.
Hope it helps.
First of all, dannyhow is right - this is not what real-time processing is about. There is a great book http://www.manning.com/marz/ which says a lot about lambda archtecture.
The two ways you mentioned serves completly different purposes and are connected to the definition of word "task". For example, Spark will take a whole job you got for him and divide it into "tasks", but the outcome of one task is useless for you, you still need to wait for whole job to finish. You can create small jobs working on the same dataset and use spark's caching to speed it up. But then you won't get much advantage from distribution (if they have to be run one after another).
Are the files big? Are there connected somehow to each other? If yes, I'd go with Spark. If no, distributed queue.

Best method of having a single process distributed across a cluster

I'm very new to cluster computing, and wanted to know more about the various software used for cluster computing, and which is best for particular tasks. In particular, the problem I am trying to solve involves a Manager/Workers type scenario, where a single Manager is responsible for the creation of 100s to 1000s of jobs. Each job, while relatively large, must execute on a small frame-by-frame basis. I.e. the Manager will tell each job, "advance one frame and report back to me". The execution of a single frame will be very small, so latency between the Manager and the worker machines must be very small, on the order of microseconds.
Thank you! Any information would be appreciated, even stuff that doesn't perfectly fit the scenario I described, just to give me a starting point. Some that I have researched so far are Hadoop, HTCondor, and Akka.
Since communication latency is important to you, you should probably consider using MPI. It's not too difficult to write simple Master/Worker programs using MPI, and it will probably give you the best performance, especially if your cluster has high performance networking, such as infiniband.
If, as it seems, you're using Java, you will have to do some research to determine a good Java/MPI package. You'll find some suggestions here: Java openmpi.

Resources