Assuming I have a cluster of n Erlang nodes, some of which may be on my LAN, while others may be connected using a WAN (that is, via the Internet), what are suitable mechanisms to cater for a) different bandwidth availability/behavior (for example, latency induced) and b) nodes with differing computational power (or even memory constraints for that matter)?
In other words, how do I prioritize local nodes that have lots of computational power, over those that have a high latency and may be less powerful, or how would I ideally prioritize high performance remote nodes with high transmission latencies to specifically do those processes with a relatively huge computations/transmission (that is, completed work per message ,per time unit) ratio?
I am mostly thinking in terms of basically benchmarking each node in a cluster by sending them a benchmark process to run during initialization, so that the latencies involved in messasing can be calculated, as well as the overall computation speed (that is, using a node-specific timer to determine how fast a node terminates with any task).
Probably, something like that would have to be done repeatedly, on the one hand in order to get representative data (that is, averaging data) and on the other hand it might possibly even be useful at runtime in order to be able to dynamically adjust to changing runtime conditions.
(In the same sense, one would probably want to prioritize locally running nodes over those running on other machines)
This would be meant to hopefully optimize internal job dispatch so that specific nodes handle specific jobs.
We've done something similar to this, on our internal LAN/WAN only (WAN being for instance San Francisco to London). The problem boiled down to a combination of these factors:
The overhead in simply making a remote call over a local (internal) call
The network latency to the node (as a function of the request/result payload)
The performance of the remote node
The compute power needed to execute the function
Whether batching of calls provides any performance improvement if there was a shared "static" data set.
For 1. we assumed no overhead (it was negligible compared to the others)
For 2. we actively measured it using probe messages to measure round trip time, and we collated information from actual calls made
For 3. we measured it on the node and had them broadcast that information (this changed depending on the load current active on the node)
For 4 and 5. we worked it out empirically for the given batch
Then the caller solved to get the minimum solution for a batch of calls (in our case pricing a whole bunch of derivatives) and fired them off to the nodes in batches.
We got much better utilization of our calculation "grid" using this technique but it was quite a bit of effort. We had the added advantage that the grid was only used by this environment so we had a lot more control. Adding in an internet mix (variable latency) and other users of the grid (variable performance) would only increase the complexity with possible diminishing returns...
The problem you are talking about has been tackled in many different ways in the context of Grid computing (e.g, see Condor). To discuss this more thoroughly, I think some additional information is required (homogeneity of the problems to be solved, degree of control over the nodes [i.e. is there unexpected external load etc.?]).
Implementing an adaptive job dispatcher will usually require to also adjust the frequency with which you probe the available resources (otherwise the overhead due to probing could exceed the performance gains).
Ideally, you might be able to use benchmark tests to come up with an empirical (statistical) model that allows you to predict the computational hardness of a given problem (requires good domain knowledge and problem features that have a high impact on execution speed and are simple to extract), and another one to predict communication overhead. Using both in combination should make it possible to implement a simple dispatcher that bases its decisions on the predictive models and improves them by taking into account actual execution times as feedback/reward (e.g., via reinforcement learning).
Related
There are different consensus algorithm, which are used in permission-oriented blockchain, such as
PAXOS
RAFT
Byzantine General Model
Which of the consensus algorithms are synchronous and asynchronous and why ? Please explain in detail.
Thanks
*I am not an expert on distributed systems still i will try to answer your question.
In distributed systems, People use an underlying model that assumes some properties about time (“how long will it take for this message to arrive?”) and some properties about the types of faults (“how can nodes in the protocol do the wrong thing?”).
There are three main types of timing models usually used for distributed systems the synchronous model, the asynchronous model and the partially synchronous model. Each of these models makes some guarantees about the length of time (“latency”) that can occur between the exchange of messages amongst nodes in a given round of the protocol execution. This categorization is important because in the distributed setting a single node cannot distinguish between a peer node that has failed and a peer node that is just taking a long time to respond.
In the synchronous model, there is some maximum value (“upper bound”) T on the time between when a node sends a message and when you can be certain that the receiving node hears the message. You also have an upper bound P on the relative difference in speed between nodes (so you can account for machines with slow processors).
In the asynchronous model, we remove both upper bounds T and P. Messages can take arbitrarily long to reach peers and each node can take an arbitrary amount of time to respond. When we say arbitrary, we include “infinity” meaning that it takes forever for some event to occur.
The partially synchronous model in a mix of the two: upper bounds exist for T and P but the protocol designer does not know them and the task is designing mechanisms that still come to consensus in light of this fact. In practice, protocol implementers can achieve systems resembling this model given the realistic characteristics of modern networks/machines (messages usually get where they are going) and use of tactics like timeouts to indicate when a node should retry sending a message.
Keeping in mind the above facts, Both Paxos and Raft belongs to the partial synchronous models.
The Byzantine Generals’ Problem is a classic problem faced by any distributed computer system network. Aim is to maintain same state on all participant nodes in presence of malicious nodes.
In distributed systems, there a collection of hard problems that you constantly need to deal with.
Things fail. You can never count on anything being reliable. Even if you have
perfectly bug-free software, and hardware that never breaks, you’ve still got
to deal with the fact that network connections can break, or messages within a
network can get lost, or that some bozo might sever your network connection
with a bulldozer. (That really happened while I was at Google!)
Given (1), you can never rely on one copy of anything, because that copy might
become unavailable due to a failure. So you need to keep multiple copies, and
those copies need to be consistent – meaning that at any time, all of the
copies agree about their contents.
There’s no way to maintain a single completely consistent view of time between
multiple computers. Due to inconsistencies in individual machine performance,
and variable network delays, variable storage latency, and several other
factors, there’s no canonical way of saying that for two events X and Y, “X
happened before Y”. What that means is that when you try to maintain a consistent set of data, you can’t just say “Run all of the events in order”, because while one server maintaining one copy might “know” that X happened before Y, another server maintaining another copy might be just as certain that Y happened before X.
In short, everything can fail at any time; after failure, participants can recover and rejoin the system; any no part of the system acts in an actively adversarial way(byzantine failures may be because of malware).
To solve this problem we have consensus algorithm with the aim to make all participants to agree on the same state.
Consensus involves multiple servers agreeing on values. Once they reach a decision on a value, that decision is final. Typical consensus algorithms make progress when any majority of their servers is available.
Paxos and Raft are consensus algorithms which solves byzantine general problem in distributed networks public or private.
Given a cluster of truly heterogeneous compute nodes how is it possible to
distribute processing to them while taking into account both their relative performance
and cost of passing messages between them?
(I know optimising this is NP-complete in general)
Which concurrency platforms currently best support this?
You might rephrase/summarise the question as:
What algorithms make most efficient use of cpu, memory and communications resources for distributed computation in theory and what existing (open source) platforms come closest to realising this?
Obviously this depends somewhat on workload so understanding the trade-offs is critical.
Some Background
I find some on S/O want to understand the background so they can provide a more specific answer, so I've included quite a bit below, but its not necessary to the essence of the question.
A typical scenario I see is:
We have an application which runs on X nodes
each with Y cores. So we start with a homogeneous cluster.
Every so often the operations team buys one or more new servers.
The new servers are faster and may have more cores.
They are integrated into the cluster to make things run faster.
Some older servers may be re-purposed but the new cluster now contains machines with different performance characteristics.
The cluster is no-longer homogeneous but has more compute power overall.
I believe this scenario must be standard in big cloud data-centres as well.
Its how this kind of change in infrastructure can be best utilised that I'm really interested in.
In one application I work with the work is divided into a number of relative long tasks. Tasks are allocated to logical processors (we usually have one per core) as they become
available. While there are tasks to perform cores are generally not unoccupied but
for the most part those jobs can be classified as "embarassingly scalable".
This particular application is currently C++ with a roll your own concurrency platform using ssh and nfs for large task.
I'm considering the arguments for various alternative approaches.
Some parties prefer various hadoop mad/reduce options. I'm wondering how they shape up versus more C++/machine oriented approaches such as openMP, Cilk++. I'm more interested in the pros and cons than the answer for that specific case.
The task model itself seems scalable and sensible independent of platform.
So, I'm assuming a model where you divide work into tasks and a (probably distributed) scheduler tries to decide which processor to which allocate each task. I am open to alternatives.
There could be task queues for each node, possibly each processor and idle processors should allow work stealing (e.g. from processors with long queues).
However, when I look at the various models of high performance and cloud cluster computing I don't see this discussed so much.
Michael Wong classifies parallelism, ignoring hadoop, into two main camps (starting around 14min in).
https://isocpp.org/blog/2016/01/the-landscape-of-parallelism-michael-wong-meetingcpp-2015
HPC and multi-threaded applications in industry
The HPC community seems to favour openMP on a cluster of identical nodes.
This may still be heterogeneous if each node supports CUDA or has FPGA support but each node tends to be identical.
If that's the case do they upgrade their data centres in a big bang or what?
(E.g. supercomputer 1 = 100 nodes of type x. supercomputer v2.0 is on a different site with
200 nodes of type y).
OpenMP only supports a single physical computer by itself.
The HPC community gets around this either using MPI (which I consider too low level) or by creating a virtual machine from all the nodes
using a hypervisor like scaleMP or vNUMA (see for example - OpenMP program on different hosts).
(anyone know of a good open source hypervisor for doing this?)
I believe these are still considered the most powerful computing systems in the world.
I find that surprising as I don't see what prevents the map/reduce people creating an even bigger cluster more easily
that is much less efficient overall but wins on brute force due to the total number of cores utilised?
So which other concurrency platforms support truly heterogeneous nodes with widely varying characteristics and how do they deal with the performance mismatch (and similarly the distribution of data)?
I'm excluding MPI as an option as while powerful it is too low-level. You might as well say use sockets. A framework building on MPI would be acceptable (does X10 work this way?).
From the user's perspective the map/reduce
approach seems to be add enough nodes that it doesn't matter and not worry about using them at maximum efficiency.
Actually those details are kept under the hood in the implementation
of the schedulers and distributed file systems.
How/where is the cost of computation and message passing taken into account?
Is there any way in openMP (or your favourite concurrency platform)
to make effective use of information that this node is N times as fast as this node and the data transfer rate
to or from this node is on average X Mb/s?
In YARN you have Dominant Resource Fairness:
http://blog.cloudera.com/blog/2013/12/managing-multiple-resources-in-hadoop-2-with-yarn/
http://static.usenix.org/event/nsdi11/tech/full_papers/Ghodsi.pdf
This covers memory and cores using Linux Control Groups but it does not yet
cover disk and network I/O resources.
Are there equivalent or better approaches in other concurrency platforms? How do they compare to DRF?
Which concurrency platforms handle this best and why?
Are there any popular ones that are likely to be evolutionary dead ends?
OpenMP keeps surprising me by actively thriving. Could something like Cilk++ be made to scale this way?
Apologies in advance for combining several PhD thesis worth questions into one.
I'm basically looking for tips on what to look for for further reading
and advice on which platforms to investigate further (from the programmer's perspective).
A good summary of some platforms to investigate and/or links to papers or articles would suffice as a useful answer.
Is it logical to say: "If average service time for a request is X and affordable waiting time for the requests is Y then maximum number of concurrent requests to serve would be Y / X" ?
I think what I'm asking is that if there're any hidden factors that I'm not taking into account!?
If you're talking specifically about webservers, then no, your formula doesn't work, because webservers are designed to handle multiple, simultaneous requests, using forking or threading.
This turns the formula into something far harder to quantify - in my experience, web servers can handle LOTS (i.e. hundreds or thousands) of concurrent requests which consume little or no time, but tend to reduce that concurrency quite dramatically as the requests consume more time.
That means that "average service time" isn't massively useful - it can hide wide variations, and it's actually the outliers that affect you the most.
Broadly yes, but your service provider (webserver in your case) is capable of handling more than one request in parallel, so you should take that into account. I assume you measured end to end service time and havent already averaged it by number of parallel streams. One other thing you didnt and cannot realistically measure is the delay to/from your website.
What you are heading towards is the Erlang unit (not the language using the same name) which is used to described how much load a system can take. Erlangs are unitless (it is just a number) and originated from old school telephony, POTS, where it was used to describe how many wires were needed to handle X calls per time period with low blocking probability. Beyond erlang is engset which is used more for high capacity systems, such as mobile systems.
It also gets used for expensive consultant reports into realtime computer systems and databases to describe the point at which performance degradation is likely to occur. Wikipedia has an article on this http://en.wikipedia.org/wiki/Erlang_(unit) and the book 'Fixed and mobile telecommunications, network systems and services' has a good chapter on performance analysis.
While aimed at telephone systems, just replace with word webserver and it behaves the same. A webserver is the same concept, load is offered that arrives at random intervals to a system with finite parallel capacity. In your case, you can probably calculate total load with load tools easier than parallel capacity and then back calculate the formulas. This is widely done to gain a level of confidence in overall system models.
Erlang/engsetformulas are really useful when you have a randomly arriving load over parallel stream (ie web requests) and a service time that can only be averaged or estimated (ie it varies in real life). You can then calculate the blocking probability, which is the probability a new request will need to wait while current requests are serviced, and how long it will wait. It also helps analyse whether you need to handle more requests in parallel, or make each faster (#lines and holding time in erlang speak)
You will probably look into queuing systems analysis next, as a soon as requests block (queue), the models change slightly.
many factors are not taken into account
memory limits
data locking constraints such as people wanting to update the same data
application latency
caching mechanisms
different users will have different tasks on the site and put different loads
That said, one easy way to get a rough estimate is with apache ab tool (apache benchmark)
Example, get 1000 times the homepage with 100 requests at a time:
ab -c 100 -n 1000 http://www.example.com/
I very often encounter situations where I have a large number of small operations that I want to carry out independently. In these cases, the number of operations is so large compared to the actual time each operation takes so simply creating a task for each operation is inappropriate due to overhead, even though GCD overhead is typically low.
So what you'd want to do is split up the number of operations into nice chunks where each task operates on a chunk. But how can I determine the appropriate number of tasks/chunks?
Testing, and profiling. What makes sense, and what works well is application specific.
Basically you need to decide on two things:
The number of worker processes/threads to generate
The size of the chunks they will work on
Play with the two numbers, and calculate their throughput (tasks completed per second * number of workers). Somewhere you'll find a good equilibrium between speed, number of workers, and number of tasks in a chunk.
You can make finding the right balance even simpler by feeding your workers a bunch of test data, essentially a benchmark, and measuring their throughput automatically while adjusting these two variables. Record the throughput for each combination of worker size/task chunk size, and output it at the end. The highest throughput is your best combination.
Finally, if how long a particular task takes really depends on the task itself (e.g. some tasks take X time, and while some take X*3 time, then you can can take a couple of approaches. Depending on the nature of your incoming work, you can try one of the following:
Feed your benchmark historical data - a bunch of real-world data to be processed that represents the actual kind of work that will come into your worker grid, and measure throughput using that example data.
Generate random-sized tasks that cross the spectrum of what you think you'll see, and pick the combination that seems to work best on average, across multiple sizes of tasks
If you can read the data in a task, and the data will give you an idea of whether or not that task will take X time, or X*3 (or something in between) you can use that information before processing the tasks themselves to dynamically adjust the worker/task size to achieve the best throughput depending on current workload. This approach is taken with Amazon EC2 where customers will spin-up extra VMs when needed to handle higher load, and spin them back down when load drops, for example.
Whatever you choose, any unknown speed issue should almost always involve some kind of demo benchmarking, if the speed at which it runs is critical to the success of your application (sometimes the time to process is so small, that it's negligible).
Good luck!
I need to train a neural network with 2-4 hidden layers, not sure yet on the structure of the actual net. I was thinking to train it using Hadoop map reduce (cluster of 12 pcs) or a gpu in order to get faster results. What do you think it would be better ? Also are there any available libraries that have these already implemented?
Thanks
I've been luckily to work in a lab which has dabbled in both of these methods for training networks, and while both are useful in very computationally expensive settings, the location of the computational bottleneck usually determines which method to use.
Training a network using a distributed system (e.g. HADOOP)
This is useful when your network is large enough that the matrix multiplications involved in training become unwieldy on a traditional PC. This problem is particularly prevalent when you have harsh time constraints (e.g. online training), as otherwise the hassle of a HADOOP implementation isn't worth it (just run the network overnight). If you're thinking about HADOOP because you want to fiddle with network parameters and not have to wait a day before fiddling some more (frequently the case in my lab), then simply run multiple instances of the network with different parameters on different machines. That way you can make use of your cluster without dealing with actual distributed computation.
Example:
You're training a network to find the number of people in images. Instead of a predefined set of training examples (image-number of people pairs) you decide to have the program pull random images from Google. While the network is processing the image, you must view the image and provide feedback on how many people are actually in the image. Since this is image processing, your network size is probably on the scale of millions of units. And since you're providing the feedback in real time the speed of the network's computations matters. Thus, you should probably invest in a distributed implementation.
Training a network on a GPU
This is the right choice if the major computational bottleneck isn't the network size, but the size of the training set (though the networks are still generally quite large). Since GPUs are ideal for situations involving applying the same vector/matrix operation across a large number of data sets, they are mainly used when you can use batch training with a very large batch size.
Example:
You're training a network to answer questions posed in natural language. You have a huge database of question-answer pairs and don't mind the network only updating its weights every 10000 questions. With such a large batch size and presumably a rather large network as well, a GPU based implementation would be a good idea.