Given a cluster of truly heterogeneous compute nodes how is it possible to
distribute processing to them while taking into account both their relative performance
and cost of passing messages between them?
(I know optimising this is NP-complete in general)
Which concurrency platforms currently best support this?
You might rephrase/summarise the question as:
What algorithms make most efficient use of cpu, memory and communications resources for distributed computation in theory and what existing (open source) platforms come closest to realising this?
Obviously this depends somewhat on workload so understanding the trade-offs is critical.
Some Background
I find some on S/O want to understand the background so they can provide a more specific answer, so I've included quite a bit below, but its not necessary to the essence of the question.
A typical scenario I see is:
We have an application which runs on X nodes
each with Y cores. So we start with a homogeneous cluster.
Every so often the operations team buys one or more new servers.
The new servers are faster and may have more cores.
They are integrated into the cluster to make things run faster.
Some older servers may be re-purposed but the new cluster now contains machines with different performance characteristics.
The cluster is no-longer homogeneous but has more compute power overall.
I believe this scenario must be standard in big cloud data-centres as well.
Its how this kind of change in infrastructure can be best utilised that I'm really interested in.
In one application I work with the work is divided into a number of relative long tasks. Tasks are allocated to logical processors (we usually have one per core) as they become
available. While there are tasks to perform cores are generally not unoccupied but
for the most part those jobs can be classified as "embarassingly scalable".
This particular application is currently C++ with a roll your own concurrency platform using ssh and nfs for large task.
I'm considering the arguments for various alternative approaches.
Some parties prefer various hadoop mad/reduce options. I'm wondering how they shape up versus more C++/machine oriented approaches such as openMP, Cilk++. I'm more interested in the pros and cons than the answer for that specific case.
The task model itself seems scalable and sensible independent of platform.
So, I'm assuming a model where you divide work into tasks and a (probably distributed) scheduler tries to decide which processor to which allocate each task. I am open to alternatives.
There could be task queues for each node, possibly each processor and idle processors should allow work stealing (e.g. from processors with long queues).
However, when I look at the various models of high performance and cloud cluster computing I don't see this discussed so much.
Michael Wong classifies parallelism, ignoring hadoop, into two main camps (starting around 14min in).
HPC and multi-threaded applications in industry
The HPC community seems to favour openMP on a cluster of identical nodes.
This may still be heterogeneous if each node supports CUDA or has FPGA support but each node tends to be identical.
If that's the case do they upgrade their data centres in a big bang or what?
(E.g. supercomputer 1 = 100 nodes of type x. supercomputer v2.0 is on a different site with
200 nodes of type y).
OpenMP only supports a single physical computer by itself.
The HPC community gets around this either using MPI (which I consider too low level) or by creating a virtual machine from all the nodes
using a hypervisor like scaleMP or vNUMA (see for example - OpenMP program on different hosts).
(anyone know of a good open source hypervisor for doing this?)
I believe these are still considered the most powerful computing systems in the world.
I find that surprising as I don't see what prevents the map/reduce people creating an even bigger cluster more easily
that is much less efficient overall but wins on brute force due to the total number of cores utilised?
So which other concurrency platforms support truly heterogeneous nodes with widely varying characteristics and how do they deal with the performance mismatch (and similarly the distribution of data)?
I'm excluding MPI as an option as while powerful it is too low-level. You might as well say use sockets. A framework building on MPI would be acceptable (does X10 work this way?).
From the user's perspective the map/reduce
approach seems to be add enough nodes that it doesn't matter and not worry about using them at maximum efficiency.
Actually those details are kept under the hood in the implementation
of the schedulers and distributed file systems.
How/where is the cost of computation and message passing taken into account?
Is there any way in openMP (or your favourite concurrency platform)
to make effective use of information that this node is N times as fast as this node and the data transfer rate
to or from this node is on average X Mb/s?
In YARN you have Dominant Resource Fairness:
This covers memory and cores using Linux Control Groups but it does not yet
cover disk and network I/O resources.
Are there equivalent or better approaches in other concurrency platforms? How do they compare to DRF?
Which concurrency platforms handle this best and why?
Are there any popular ones that are likely to be evolutionary dead ends?
OpenMP keeps surprising me by actively thriving. Could something like Cilk++ be made to scale this way?
Apologies in advance for combining several PhD thesis worth questions into one.
I'm basically looking for tips on what to look for for further reading
and advice on which platforms to investigate further (from the programmer's perspective).
A good summary of some platforms to investigate and/or links to papers or articles would suffice as a useful answer.


Is my application running efficiently?

The question is generic and can be extended to other frameworks or contexts beyond Spark & Machine Learning algorithms.
Regardless of the details, from a high-level point-of-view, the code is applied on a large dataset of labeled text documents. It passes by 9 iterations of cross-validation to tune some parameters of a Logistic Regression multi-class classifier.
It is expected that this kind of Machine Learning processing will be expensive in term of time and resources.
I am running now the code and everything seems to be OK, except that I have no idea if my application is running efficiently or not.
I couldn't find guidelines saying that for a certain type and amount of data, and for certain type of processing and computing resources the processing time should be in the approximate order of...
Is there any method that help in judging if my application is running slow or fast, or it is purely a matter of experience?
I had the same question and I didn't find a real answer/tool/way to test how good my performances were just looking "only inside" my application.
I mean, as far as I know, there's no tool like a speedtest or something like for the internet connection :-)
The only way I found is to re-write my app (if possible) with another stack in order to see if the difference (in terms of time) is THAT big.
Otherwise, I found very interesting 2 main resources, even if quite old:
1) A sort of 4 point guide to remember when coding:
Understanding the Performance of Spark Applications, SPark Summit 2013
2) A 2-episode article from Cloudera blog to tune at best your jobs:
Hoping it could help
Your question is pretty generic, so I would also highlight few generic areas where you can look out for performance optimizations: -
Scheduling Delays - Are there significant scheduling delays in scheduling the tasks? if yes then you can analyze the reasons (may be your cluster needs more resources etc).
Utilization of Cluster - are your jobs utilizing the available cluster resources (like CPU, mem)? In case not then again look out for the reasons. May be creating more partitions helps in faster execution. May be there is significant time taken in serialization, so can you switch to Kyro Serialization.
JVM Tuning - Consider analyzing GC logs and tune if you find anomalies.
Executor Configurations - Analyze the memory/ cores provided to your executors. It should be sufficient to hold the data processed by the task/job.
your DAG and
Driver Configuration - Same as executors, Driver should also have enough memory to hold the results of certain functions like collect().
Shuffling - See how much time is spend in Shuffling and kind of Data Locality used by your task.
All the above are needed for the preliminary investigations and in some cases it can also increase the performance of your jobs to an extent but there could be complex issues for which the solution will depend upon case to case basis.
Please also see Spark Tuning Guide

Best method of having a single process distributed across a cluster

I'm very new to cluster computing, and wanted to know more about the various software used for cluster computing, and which is best for particular tasks. In particular, the problem I am trying to solve involves a Manager/Workers type scenario, where a single Manager is responsible for the creation of 100s to 1000s of jobs. Each job, while relatively large, must execute on a small frame-by-frame basis. I.e. the Manager will tell each job, "advance one frame and report back to me". The execution of a single frame will be very small, so latency between the Manager and the worker machines must be very small, on the order of microseconds.
Thank you! Any information would be appreciated, even stuff that doesn't perfectly fit the scenario I described, just to give me a starting point. Some that I have researched so far are Hadoop, HTCondor, and Akka.
Since communication latency is important to you, you should probably consider using MPI. It's not too difficult to write simple Master/Worker programs using MPI, and it will probably give you the best performance, especially if your cluster has high performance networking, such as infiniband.
If, as it seems, you're using Java, you will have to do some research to determine a good Java/MPI package. You'll find some suggestions here: Java openmpi.

How efficient are opensource computation platform like Hadoop etc.?

How efficient are opensource distributed computation frameworks like Hadoop? By efficiency, I mean CPU cycles that can be used for the "actual job" in tasks that are mostly pure computation. In other words, how much CPU cycles are used for overhead, or wasted because of being not used? I'm not looking for specific numbers, just a rough picture. E.g. can I expect to use 90% of the cluster's CPU power? 99%? 99.9%?
To be more specific, let's say I want to calculate PI, and I have an algorithm X. When I perform this on a single core in a tight loop, let's say I get some performance Y. If I do this calculation in a distributed fashion using e.g. Hadoop, How much performance degradation can I expect?
I understand this would depend on many factors, but what would be the rough magnitude? I'm thinking of a cluster with maybe 10 - 100 servers (80 - 800 CPU cores total), if that matters.
Technically hadoop has considerable overheads in several dimensions:
a) Per task overhead which can be estimated from 1 to 3 seconds.
b) HDFS Data reading overhead, due to passing data via socket and CRC calculation. It is harder to estimate
These overheads can be very significant if you have a lot of small tasks, and/or if your data processing is light.
In the same time if your have big files (less tasks) and Your data processing is heavy (let say a few mb/sec per core) then Hadoop overhead can be negleted.
In a bottom line - Hadoop overhead is variable things which higly depends on the nature of processing you are doing.
This question is too broad and vague to answer usefully. There are many different open-source platforms, varying very widely in their quality. Some early Beowulfs were notoriously wasteful, for example, whereas modern MPI2 is pretty lean.
Also, "efficiency" means different things in different domains. It might mean the amount of CPU overhead spent on constructing and passing messages relative to the work payload (in which case you're comparing MPI vs Map/Reduce), or it might mean the number of CPU cycles wasted by the interpreter/VM, if any (in which case you're comparing C++ vs Python).
It depends on the problem you are trying to solve, too. In some domains, you have lots of little messages flying back and forth, in which case the CPU cost of constructing them matters a lot (like high-frequency trading). In others, you have relatively few but large work-blocks, so the cost of packing the messages is small compared to the computational efficiency of the math inside the work block (like Folding#Home).
So in summary, this is an impossible question to answer generally, because there's no one answer. It depends on specifically what you're trying to do with the distributed platform, and what machinery it is running on.
MapR is one of the alternative for Apache Hadoop and Srivas (CTO and founder of MapR) has compared MapR with Apache Hadoop. The below presentation and video have metrics comparing MapR and Apache Hadoop. Looks like the hardware is not efficiently used in Apache Hadoop.
Apache Hadoop seems to be inefficient in some dimensions, but there is a lot of activity in Apache Hadoop community around scalability/reliability/availability/efficiency. Next Generation MapReduce, HDFS Scalability/Availability are some of things being worked currently. These would be available in the Hadoop version 0.23.
Till some time back, the focus of the Hadoop community seemed to be on scalability, but now shifting towards efficiency also.

What are some scenarios for which MPI is a better fit than MapReduce?

As far as I understand, MPI gives me much more control over how exactly different nodes in the cluster will communicate.
In MapReduce/Hadoop, each node does some computation, exchanges data with other nodes, and then collates its partition of results. Seems simple, but since you can iterate the process, even algorithms like K-means or PageRank fit the model quite well. On a distributed file system with locality of scheduling, the performance is apparently good. In comparison, MPI gives me explicit control over how nodes send messages to each other.
Can anyone describe a cluster programming scenario where the more general MPI model is an obvious advantage over the simpler MapReduce model?
Almost any scientific code -- finite differences, finite elements, etc. Which kind of leads to the circular answer, that any distributed program which doesn't easily map to MapReduce would be better implemented with a more general MPI model. Not sure that's much help to you, I'll downvote this answer right after I post it.
Athough, this question has been answered, I would like to add/reiterate one very important point.
MPI is best suited for problems that require a lot of interprocess communication.
When Data becomes large (petabytes, anyone?), and there is little interprocess communication, MPI becomes a pain. This is so because the processes will spend all the time sending data to each other (bandwidth becomes a limiting factor) and your CPUs will remain idle. Perhaps an even bigger problem is reading all that data.
This is the fundamental reason behind having something like Hadoop. The Data also has to be distributed - Hadoop Distributed File System!
To say all this in short, MPI is good for task parallelism and Hadoop is good for Data Parallelism.
The best answer that I could come up with is that MPI is better than MapReduce in two cases:
For short tasks rather than batch processing. For example, MapReduce cannot be used to respond to individual queries - each job is expected to take minutes. I think that in MPI, you can build a query response system where machines send messages to each other to route the query and generate the answer.
For jobs nodes need to communicate more than what iterated MapReduce jobs support, but not too much so that the communication overheads make the computation impractical. I am not sure how often such cases occur in practice, though.
I expect that MPI beats MapReduce easily when the task is iterating over a data set whose size is comparable with the processor cache, and when communication with other tasks is frequently required. Lots of scientific domain-decomposition parallelization approaches fit this pattern. If MapReduce requires sequential processing and communication, or ending of processes, then the computational performance benefit from dealing with a cache-sized problem is lost.
When the computation and data that you are using have irregular behaviors that mostly translates to many message-passings between objects, or when you need low level hardware level accesses e.g. RDMA then MPI is better. In some answers that you see in here the latency of tasks or memory consistency model gets mentioned, frameworks like Spark or Actor Models like AKKA have shown that they can compete with MPI. Finally one should consider that MPI has benefit of being for years the main base for development of libraries needed for scientific computations (This are the most important missing parts missing from new frameworks using DAG/MapReduce Models).
All in all, I think the benefits that MapReduce/DAG models are bringing to the table like dynamic resource managers, and fault tolerance computation will make make them feasible for scientific computing groups.

Prioritizing Erlang nodes

Assuming I have a cluster of n Erlang nodes, some of which may be on my LAN, while others may be connected using a WAN (that is, via the Internet), what are suitable mechanisms to cater for a) different bandwidth availability/behavior (for example, latency induced) and b) nodes with differing computational power (or even memory constraints for that matter)?
In other words, how do I prioritize local nodes that have lots of computational power, over those that have a high latency and may be less powerful, or how would I ideally prioritize high performance remote nodes with high transmission latencies to specifically do those processes with a relatively huge computations/transmission (that is, completed work per message ,per time unit) ratio?
I am mostly thinking in terms of basically benchmarking each node in a cluster by sending them a benchmark process to run during initialization, so that the latencies involved in messasing can be calculated, as well as the overall computation speed (that is, using a node-specific timer to determine how fast a node terminates with any task).
Probably, something like that would have to be done repeatedly, on the one hand in order to get representative data (that is, averaging data) and on the other hand it might possibly even be useful at runtime in order to be able to dynamically adjust to changing runtime conditions.
(In the same sense, one would probably want to prioritize locally running nodes over those running on other machines)
This would be meant to hopefully optimize internal job dispatch so that specific nodes handle specific jobs.
We've done something similar to this, on our internal LAN/WAN only (WAN being for instance San Francisco to London). The problem boiled down to a combination of these factors:
The overhead in simply making a remote call over a local (internal) call
The network latency to the node (as a function of the request/result payload)
The performance of the remote node
The compute power needed to execute the function
Whether batching of calls provides any performance improvement if there was a shared "static" data set.
For 1. we assumed no overhead (it was negligible compared to the others)
For 2. we actively measured it using probe messages to measure round trip time, and we collated information from actual calls made
For 3. we measured it on the node and had them broadcast that information (this changed depending on the load current active on the node)
For 4 and 5. we worked it out empirically for the given batch
Then the caller solved to get the minimum solution for a batch of calls (in our case pricing a whole bunch of derivatives) and fired them off to the nodes in batches.
We got much better utilization of our calculation "grid" using this technique but it was quite a bit of effort. We had the added advantage that the grid was only used by this environment so we had a lot more control. Adding in an internet mix (variable latency) and other users of the grid (variable performance) would only increase the complexity with possible diminishing returns...
The problem you are talking about has been tackled in many different ways in the context of Grid computing (e.g, see Condor). To discuss this more thoroughly, I think some additional information is required (homogeneity of the problems to be solved, degree of control over the nodes [i.e. is there unexpected external load etc.?]).
Implementing an adaptive job dispatcher will usually require to also adjust the frequency with which you probe the available resources (otherwise the overhead due to probing could exceed the performance gains).
Ideally, you might be able to use benchmark tests to come up with an empirical (statistical) model that allows you to predict the computational hardness of a given problem (requires good domain knowledge and problem features that have a high impact on execution speed and are simple to extract), and another one to predict communication overhead. Using both in combination should make it possible to implement a simple dispatcher that bases its decisions on the predictive models and improves them by taking into account actual execution times as feedback/reward (e.g., via reinforcement learning).
