MPI Alltoallv or better individual Send and Recv? (Performance) - performance

I have a number of processes (of the order of 100 to 1000) and each of them has to send some data to some (say about 10) of the other processes. (Typically, but not necessary always, if A sends to B, B also sends to A.) Every process knows how much data it has to receive from which process.
So I could just use MPI_Alltoallv, with many or most of the message lengths zero.
However, I heard that for performance reasons it would be better to use several MPI_send and MPI_recv communications rather than the global MPI_Alltoallv.
What I do not understand: if a series of send and receive calls are more efficient than one Alltoallv call, why is Alltoallv not just implemented as a series of sends and receives?
It would be much more convenient for me (and others?) to use just one global call. Also I might have to be concerned about not running into a deadlock situation with several Send and Recv (fixable by some odd-even strategy or more complex? or by using buffered send/recv?).
Would you agree that MPI_Alltoallv is necessary slower than the, say, 10 MPI_Send and MPI_Recv; and if yes, why and how much?

Usually the default advice with collectives is the opposite: use a collective operation when possible instead of coding your own. The more information the MPI library has about the communication pattern, the more opportunities it has to optimize internally.
Unless special hardware support is available, collective calls are in fact implemented internally in terms of sends and receives. But the actual communication pattern will probably not be just a series of sends and receives. For example, using a tree to broadcast a piece of data can be faster than having the same rank send it to a bunch of receivers. A lot of work goes into optimizing collective communications, and it is difficult to do better.
Having said that, MPI_Alltoallv is somewhat different. It can be difficult to optimize for all irregular communication scenarios at the MPI level, so it is conceivable that some custom communication code can do better. For example, an implementation of MPI_Alltoallv might be synchronizing: it could require that all processes "check in", even if they have to send a 0-length message. I though that such an implementation is unlikely, but here is one in the wild.
So the real answer is "it depends". If the library implementation of MPI_Alltoallv is a bad match for the task, custom communication code will win. But before going down that path, check if the MPI-3 neighbor collectives are a good fit for your problem.

Related

Golang main difference from CSP-Language by Hoare

Look at this statement taken from The examples from Tony Hoare's seminal 1978 paper:
Go's design was strongly influenced by Hoare's paper. Although Go differs significantly from the example language used in the paper, the examples still translate rather easily. The biggest difference apart from syntax is that Go models the conduits of concurrent communication explicitly as channels, while the processes of Hoare's language send messages directly to each other, similar to Erlang. Hoare hints at this possibility in section 7.3, but with the limitation that "each port is connected to exactly one other port in another process", in which case it would be a mostly syntactic difference.
I'm confused.
Processes in Hoare's language communicate directly to each other. Go routines communicate also directly to each other but using channels.
So what impact has the limitation in golang. What is the real difference?
The answer requires a fuller understanding of Hoare's work on CSP. The progression of his work can be summarised in three stages:
based on Dijkstra's semaphore's, Hoare developed monitors. These are as used in Java, except Java's implementation contains a mistake (see Welch's article Wot No Chickens). It's unfortunate that Java ignored Hoare's later work.
CSP grew out of this. Initially, CSP required direct exchange from process A to process B. This rendezvous approach is used by Ada and Erlang.
CSP was completed by 1985, when his Book was first published. This final version of CSP includes channels as used in Go. Along with Hoare's team at Oxford, David May concurrently developed Occam, a language deliberately intended to blend CSP into a practical programming language. CSP and Occam influenced each other (for example in The Laws of Occam Programming). For years, Occam was only available on the Transputer processor, which had its architecture tailored to suit CSP. More recently, Occam has developed to target other processors and has also absorbed Pi calculus, along with other general synchronisation primitives.
So, to answer the original question, it is probably helpful to compare Go with both CSP and Occam.
Channels: CSP, Go and Occam all have the same semantics for channels. In addition, Go makes it easy to add buffering into channels (Occam does not).
Choices: CSP defines both the internal and external choice. However, both Go and Occam have a single kind of selection: select in Go and ALT in Occam. The fact that there are two kinds of CSP choice proved to be less important in practical languages.
Occam's ALT allows condition guards, but Go's select does not (there is a workaround: channel aliases can be set to nil to imitate the same behaviour).
Mobility: Go allows channel ends to be sent (along with other data) via channels. This creates a dynamically-changing topology and goes beyond what is possible in CSP, but Milner's Pi calculus was developed (out of his CCS) to describe such networks.
Processes: A goroutine is a forked process; it terminates when it wants to and it doesn't have a parent. This is less like CSP / Occam, in which processes are compositional.
An example will help here: firstly Occam (n.b. indentation matters)
SEQ
PAR
processA()
processB()
processC()
and secondly Go
go processA()
go processB()
processC()
In the Occam case, processC doesn't start until both processA and processB have terminated. In Go, processA and processB fork very quickly, then processC runs straightaway.
Shared data: CSP is not really concerned with data directly. But it is interesting to note there is an important difference between Go and Occam concerning shared data. When multiple goroutines share a common set of data variables, race conditions are possible; Go's excellent race detector helps to eliminate problems. But Occam takes a different stance: shared mutable data is prevented at compilation time.
Aliases: related to the above, Go allows many pointers to refer to each data item. Such aliases are disallowed in Occam, so reducing the effort needed to detect race conditions.
The latter two points are less about Hoare's CSP and more about May's Occam. But they are relevant because they directly concern safe concurrent coding.
That's exactly the point: in the example language used in Hoare's initial paper (and also in Erlang), process A talks directly to process B, while in Go, goroutine A talks to channel C and goroutine B listens to channel C. I.e. in Go the channels are explicit while in Hoare's language and Erlang, they are implicit.
See this article for more info.
Recently, I've been working quite intensively with Go's channels, and have been working with concurrency and parallelism for many years, although I could never profess to know everything about this.
I think what you're asking is what's the subtle difference between sending a message to a channel and sending directly to each other? If I understand you, the quick answer is simple.
Sending to a Channel give the opportunity for parallelism / concurrency on both sides of the channel. Beautiful, and scalable.
We live in a concurrent world. Sending a long continuous stream of messages from A to B (asynchronously) means that B will need to process the messages at pretty much the same pace as A sends them, unless more than one instance of B has the opportunity to process a message taken from the channel, hence sharing the workload.
The good thing about channels is that that you can have a number of producer/receiver go-routines which are able to push messages to the queue, or consume from the queue and process it accordingly.
If you think linearly, like a single-core CPU, concurrency is basically like having a million jobs to do. Knowing a single-core CPU can only do one thing at a time, and yet also see that it gives the illusion that lots of things are happening at the same time. When executing some code, the time the OS needs to wait a while for something to come back from the network, disk, keyboard, mouse, etc, or even some process which sleeps for a while, give the OS the opportunity to do something else in the meantime. This all happens extremely quickly, creating the illusion of parallelism.
Parallelism on the other hand is different in that the job can be run on a completely different CPU independent of what's going with other CPUs, and therefore doesn't run under the same constraints as the other CPU (although most OS's do a pretty good job at ensuring workloads are evenly distributed to run across all of it's CPUs - with perhaps the exception of CPU-hungry, uncooperative non-os-yielding-code, but even then the OS tames them.
The point is, having multi-core CPUs means more parallelism and more concurrency can occur.
Imagine a single queue at a bank which fans-out to a number of tellers who can help you. If no customers are being served by any teller, one teller elects to handle the next customer and becomes busy, until they all become busy. Whenever a customer walks away from a teller, that teller is able to handle the next customer in the queue.

Order of Goroutine Unblocking on Single Channel

Does order in which the Goroutines block on a channel determine the order they will unblock? I'm not concerned with the order of the messages that are sent (they're guaranteed to be ordered), but the order of the Goroutines that'll unblock.
Imagine a empty Channel ch shared between multiple Goroutines (1, 2, and 3), with each Goroutine trying to receive a message on ch. Since ch is empty, each Goroutine will block. When I send a message to ch, will Goroutine 1 unblock first? Or could 2 or 3 possibly receive the first message? (Or vice-versa, with the Goroutines trying to send)
I have a playground that seems to suggest that the order in which Goroutines block is the order in which they are unblocked, but I'm not sure if this is an undefined behavior because of the implementation.
This is a good question - it touches on some important issues when doing concurrent design. As has already been stated, the answer to your specific question is, according to the current implementation, FIFO based. It's unlikely ever to be different, except perhaps if the implementers decided, say, a LIFO was better for some reason.
There is no guarantee, though. So you should avoid creating code that relies on a particular implementation.
The broader question concerns non-determinism, fairness and starvation.
Perhaps surprisingly, non-determinism in a CSP-based system does not come from things happening in parallel. It is possible because of concurrency, but not because of concurrency. Instead, non-determinism arises when a choice is made. In the formal algebra of CSP, this is modelled mathematically. Fortunately, you don't need to know the maths to be able to use Go. But formally, two goroutines code execute in parallel and the outcome could still be deterministic, provided all the choices are eliminated.
Go allows choices that introduce non-determinism explicitly via select and implicitly via ends of channels being shared between goroutines. If you have point-to-point (one reader, one writer) channels, the second kind does not arise. So if it's important in a particular situation, you have a design choice you can make.
Fairness and starvation are typically opposite sides of the same coin. Starvation is one of those dynamic problems (along with deadlock, livelock and race conditions) that result perhaps in poor performance, more likely in wrong behaviour. These dynamic problems are un-testable (more on this) and need some level analysis to solve. Clearly, if part of a system is unresponsive because it is starved of access to certain resources, then there is a need for greater fairness in governing those resources.
Shared access to channel ends may well provide a degree of fairness because of the current FIFO behaviour and this may appear sufficient. But if you want it guaranteed (regardless of implementation uncertainties), it is possible instead to use a select and a bundle of point-to-point channels in an array. Fair indexing is easy to achieve by always preferring them in an order that puts the last-selected at the bottom of the pile. This solution can guarantee fairness, but probably with a small performance penalty.
(aside: see "Wot No Chickens" for a somewhat-amusing discovery made by researchers in Canterbury, UK concerning a fairness flaw in the Java Virtual Machine - which has never been rectified!)
I believe it's unspecified because the memory model document only says "A send on a channel happens before the corresponding receive from that channel completes." The spec sections on send statements and the receive operator don't say anything about what unblocks first. Right now the gc toolchain uses an orderly FIFO queue to control which goroutine unblocks, but I don't see any promises in the spec that it must always be so.
(Just for general background note that Playground code runs with GOMAXPROCS=1, i.e., on one core, so some types of concurrency-related unpredictability just won't come up.)
The order is not specified, but current implementations use a FIFO queue for waiting goroutines.
The authoritative document is the Go Memory Model. The memory model does not define a happens-before relationship for two goroutines sending to the same channel, therefore the order is not specified. Ditto for receive.

How do you mitigate proposal-number overflow attacks in Byzantine Paxos?

I've been doing a lot of research into Paxos recently, and one thing I've always wondered about, I'm not seeing any answers to, which means I have to ask.
Paxos includes an increasing proposal number (and possibly also a separate round number, depending on who wrote the paper you're reading). And of course, two would-be leaders can get into duels where each tries to out-increment the other in a vicious cycle. But as I'm working in a Byzantine, P2P environment, it makes me what to do about proposers that would attempt to set the proposal number extremely high - for example, the maximum 32-bit or 64-bit word.
How should a language-agnostic, platform-agnostic Paxos-based protocol deal with integer maximums for proposal number and/or round number? Especially intentional/malicious cases, which make the modular-arithmetic approach of overflowing back to 0 a bit unattractive?
From what I've read, I think this is still an open question that isn't addressed in literature.
Byzantine Proposer Fast Paxos addresses denial of service, but only of the sort that would delay message sending through attacks not related to flooding with incrementing (proposal) counters.
Having said that, integer overflow is probably the least of your problems. Instead of thinking about integer overflow, you might want to consider membership attacks first (via DoS). Learning about membership after consensus from several nodes may be a viable strategy, but probably still vulnerable to Sybil attacks at some level.
Another strategy may be to incorporate some proof-of-work system for proposals to limit the flood of requests. However, it's difficult to know what to use this as a metric to balance against (for example, free currency when you mine the block chain in Bitcoin). It really depends on what type of system you're trying to build. You should consider the value of information in your system, then create a proof of work system that requires slightly more cost to circumvent.
However, once you have the ability to slow down a proposal counter, you still need to worry about integer maximums in any system with a high number of (valid) operations. You should have a strategy for number wrapping or a multiple precision scheme in place where you can clearly determine how many years/decades your network can run without encountering trouble without blowing out a fixed precision counter. If you can determine that your system will run for 100 years (or whatever) without blowing out your fixed precision counter, even with malicious entities, then you can choose to simplify things.
On another (important) note, the system model used in most papers doesn't reflect everything that makes a real-life implementation practical (Raft is a nice exception to this). If anything, some authors are guilty of creating a system model that is designed to avoid a hard problem that they haven't found an answer to. So, if someone says that X will solve everything, please be aware they they only mean that it solves everything in the very specific system model that they defined. On the other side of this, you should consider that the system model is closely tied to a statement that says "Y is impossible". A nice example to explain this concept is the completely asynchronous message passing of the Ben-Or consensus algorithm which uses nondeterminism in the system model's state machine to avoid the limits specified by the FLP impossibility result (which specifies that consensus requires partially asynchronous message passing when the system model's state machine is deterministic).
So, you should continue to consider the "impossible" after you read a proof that says it can't be done. Nancy Lynch did a nice writeup on this concept.
I guess what I'm really saying is that a good solution to your question doesn't really exist yet. If you figure it out, please publish it (or let me know if you find an existing paper).

When not to use MPI

This is not a question on specific technical coding aspect of MPI. I am NEW to MPI, and not wanting to make a fool of myself of using the library in a wrong way, thus posting the question here.
As far as I understand, MPI is a environment for building parallel application on a distributed memory model.
I have a system that's interconnected with Infiniband, for the sole purpose of doing some very time consuming operations. I've already broke out the algorithm to do it in parallel, so I am really only using MPI to transmit data (results of the intermediate steps) between multiple nodes over Infiniband, which I believe one can simply use OpenIB to do.
Am I using MPI the right way? Or am I bending the original intention of the system?
Its fine to use just MPI_Send & MPI_Recv in your algorithm. As your algorithm evolves, you gain more experience, etc. you may find use for the more "advanced" MPI features such as barrier & collective communication such as Gather, Reduce, etc.
The fewer and simpler the MPI constructs you need to use to get your work done, the better MPI is a match to your problem -- you can say that about most libraries and lanaguages, as a practical matter and argualbly an matter of abstractions.
Yes, you could write raw OpenIB calls to do your work too, but what happens when you need to move to an ethernet cluster, or huge shared-memory machine, or whatever the next big interconnect is? MPI is middleware, and as such, one of its big selling points is that you don't have to spend time writing network-level code.
At the other end of the complexity spectrum, the time not to use MPI is when your problem or solution technique presents enough dynamism that MPI usage (most specifically, its process model) is a hindrance. A system like Charm++ (disclosure: I'm a developer of Charm++) lets you do problem decomposition in terms of finer grained units, and its runtime system manages the distribution of those units to processors to ensure load balance, and keeps track of where they are to direct communication appropriately.
Another not-uncommon issue is dynamic data access patterns, where something like Global Arrays or a PGAS language would be much easier to code.

Distributed array in MPI for parallel numerics

in many distributed computing applications, you maintain a distributed array of objects. Each process manages a set of objects that it may read and write exclusively and furthermore a set of objects that may only read (the content of which is authored by and frequently recerived from other processes).
This is very basic and is likely to have been done a zillion times until times until now - for example, with MPI. Hence I suppose there is something like an open source extension for MPI, which provides the basic capabilities of a distributed array for computing.
Ideally, it would be written in C(++) and mimic the official MPI standard interface style. Does anybody know anything like that? Thank you.
From what I gather from your question, you're looking for a mechanism for allowing a global view (read-only) of the problem space, but each process has ownership (read-write) of a segment of the data.
MPI is simply an API specification for inter-process communication for parallel applications and any implementation of it will work at a level lower than what you are looking for.
It is quite common in HPC applications to perform data decomposition in a way that you mentioned, with MPI used to synchronise shared data to other processes. However each application have different sharing patterns and requirements (some may wish to only exchange halo regions with neighbouring nodes, and perhaps using non-blocking calls to overlap communication other computation) so as to improve performance by making use of knowledge of the problem domain.
The thing is, using MPI to sync data across processes is simple but implementing a layer above it to handle general purpose distribute array synchronisation that is easy to use yet flexible enough to handle different use cases can be rather trickly.
Apologies for taking so long to get to the point, but to answer your question, AFAIK there isn't be an extension to MPI or a library that can efficiently handle all use cases while still being easier to use than simply using MPI. However, it is possible to to work above the level of MPI which maintaining distributed data. For example:
Use the PGAS model to work with your data. You can then use libraries such as Global Arrays (interfaces for C, C++, Fortran, Python) or languages that support PGAS such as UPC or Co-Array Fortran (soon to be included into the Fortran standards). There are also languages designed specifically for this form of parallelism, i,e. Fortress, Chapel, X10
Roll your own. For example, I've worked on a library that uses MPI to do all the dirty work but hides the complexity by providing creating custom data types for the application domain, and exposing APIs such as:
X_Create(MODE, t_X) : instantiate the array, called by all processes with the MODE indicating if the current process will require READ-WRITE or READ-ONLY access
X_Sync_start(t_X) : non-blocking call to initiate synchronisation in the background.
X_Sync_complete(t_X) : data is required. Block if synchronisation has not completed.
... and other calls to delete data as well as perform domain specific tasks that may require MPI calls.
To be honest, in most cases it is often simpler to stick with basic MPI or OpenMP, or if one exists, using a parallel solver written for the application domain. This of course depends on your requirements.
For dense arrays, see Global Arrays and Elemental (Google will find them for you).
For sparse arrays, see PETSc.
I know this is a really short answer, but there is too much documentation of these elsewhere to bother repeating it.

Resources