MPI sends that "can't be buffered by the network" - parallel-processing

I was following a tutorial on http://mpitutorial.com/tutorials/point-to-point-communication-application-random-walk/
and there is a sentence that says " If the sends eventually can’t be buffered by the network, they will block until a matching receive is posted."
When is MPI_SEND() not able to be buffered by a network?
I would guess this question has been asked before but I honestly don't know what keywords to search for here.

Long story short, it depends.
Longer story, it depends on your MPI library, the message size, the interconnect being used, how many messages have been buffered before, the tuning options you are using, and other factors.
Keep it mind a program that assumes MPI_Send() will return if no matching receive has been posted is incorrect with respect to the MPI standard.
Another way to put it is a correct MPI program will not deadlock if you replace all MPI_Send() with MPI_Ssend().

Related

Why a buffer byte array of length 512, aka [512]byte is preferred in Go for reading and writing?

I was studying Go from "Network programming with Go" by Jan Newmarch, and I noticed that almost all his examples involve [512]byte as a buffer for writing and reading to a connection.
I tried to search online but failed to get an answer. I suspect it might have something to do with i/o, but am not sure what is the exact reason behind this design.
Could anyone elaborate a bit on the choice of the buffer?
Some sample codes from the book:
func handleConn(c net.Conn){
defer c.Close()
var buf [512]byte
for{
n, err := c.Read(buf[0:])
if err != nil{ return }
_, err2 := c.Write(buf[0:])
if err2 != nil{
return
}
}
}
Not a direct answer but some background in addition to what other folks stated in comments.
The Go types which wrap files and sockets are relatively thin in the sense any Read() and Write() call done on them results in a syscall being performed (with the sockets, it's more tricky as they use async I/O via the system-provided poller such as epoll, kqueue, IOCP etc). This means that reading from a file or network by chunks of 1 byte is woefully ineffective.
To consider another extreme, it's possible to allocate, say, a 100MiB buffer and attempt to pass it to Read(). While the kernel's syscall will happily accept a destination of that size, it should be noted that contemporary OSes have internal buffers on network sockets of around 64KiB1 in size so your Read() call under most circumstances will return having read just that much data or less. This means you will be wasting most of your buffer space.
Now comes another set of considerations: what pattern of reading data from a socket your application has?
Say, when you're streaming data from a socket to an open file, you don't really care about buffering (you'd like it would be someone else's decision to pick the "right" size). For this case, just use io.Copy() (which currently (Go 1.6) uses an internal buffer of 32KiB).
Conversely, if you're parsing some application-level protocol utilizing TCP as its transport, you often need to read the data in chunks of arbitrary fixed size. For this case, the best pattern is wrapping the socket in a bufio.Reader — to combat the "small reads" problem outlined above — and then use io.ReadFull() to read data into your local arrays/slices of the size you need (if possible, do reuse your arrays and slices to lower the pressure on the garbage collector).
Another case is text-based "linewise" protocols such as SMTP or HTTP.
In these protocols, the maximum line length is typically fixed, and it makes sense to use buffers of the maximum size of the protocol's line to deal with them. (But anyway, to deal with such protocols, it's best to use the net/textproto standard package.)
As to your question per se, my stab at it is that 512 is just a beautiful number which has no special meaning. When you write a book like this you have to pick some value anyway.
As you can see from my descriptions of real-work patterns of reading from the network, most of the time you simply have no business dealing with buffering — let the standard tools do this for you. You should only resort to thinking about tuning that stuff when you're facing a real problem with the defaults provided by the standard packages.
TL;DR
The book you're reading merely explains you the basic concepts, so it has to use some number.
Real-world code seems to use other numbers when it needs to buffer (usually higher)…
…but you should not concern you with these numbers until absolutely necessary: use the ready-made tools where possible.
1 Of course, I can't say for all operating systems, and they have different knobs to tweak this stuff, and "contemporary" might start to mean different things in over a year or less, you know… still I consider my estimation as being quite close to the truth.

Golang main difference from CSP-Language by Hoare

Look at this statement taken from The examples from Tony Hoare's seminal 1978 paper:
Go's design was strongly influenced by Hoare's paper. Although Go differs significantly from the example language used in the paper, the examples still translate rather easily. The biggest difference apart from syntax is that Go models the conduits of concurrent communication explicitly as channels, while the processes of Hoare's language send messages directly to each other, similar to Erlang. Hoare hints at this possibility in section 7.3, but with the limitation that "each port is connected to exactly one other port in another process", in which case it would be a mostly syntactic difference.
I'm confused.
Processes in Hoare's language communicate directly to each other. Go routines communicate also directly to each other but using channels.
So what impact has the limitation in golang. What is the real difference?
The answer requires a fuller understanding of Hoare's work on CSP. The progression of his work can be summarised in three stages:
based on Dijkstra's semaphore's, Hoare developed monitors. These are as used in Java, except Java's implementation contains a mistake (see Welch's article Wot No Chickens). It's unfortunate that Java ignored Hoare's later work.
CSP grew out of this. Initially, CSP required direct exchange from process A to process B. This rendezvous approach is used by Ada and Erlang.
CSP was completed by 1985, when his Book was first published. This final version of CSP includes channels as used in Go. Along with Hoare's team at Oxford, David May concurrently developed Occam, a language deliberately intended to blend CSP into a practical programming language. CSP and Occam influenced each other (for example in The Laws of Occam Programming). For years, Occam was only available on the Transputer processor, which had its architecture tailored to suit CSP. More recently, Occam has developed to target other processors and has also absorbed Pi calculus, along with other general synchronisation primitives.
So, to answer the original question, it is probably helpful to compare Go with both CSP and Occam.
Channels: CSP, Go and Occam all have the same semantics for channels. In addition, Go makes it easy to add buffering into channels (Occam does not).
Choices: CSP defines both the internal and external choice. However, both Go and Occam have a single kind of selection: select in Go and ALT in Occam. The fact that there are two kinds of CSP choice proved to be less important in practical languages.
Occam's ALT allows condition guards, but Go's select does not (there is a workaround: channel aliases can be set to nil to imitate the same behaviour).
Mobility: Go allows channel ends to be sent (along with other data) via channels. This creates a dynamically-changing topology and goes beyond what is possible in CSP, but Milner's Pi calculus was developed (out of his CCS) to describe such networks.
Processes: A goroutine is a forked process; it terminates when it wants to and it doesn't have a parent. This is less like CSP / Occam, in which processes are compositional.
An example will help here: firstly Occam (n.b. indentation matters)
SEQ
PAR
processA()
processB()
processC()
and secondly Go
go processA()
go processB()
processC()
In the Occam case, processC doesn't start until both processA and processB have terminated. In Go, processA and processB fork very quickly, then processC runs straightaway.
Shared data: CSP is not really concerned with data directly. But it is interesting to note there is an important difference between Go and Occam concerning shared data. When multiple goroutines share a common set of data variables, race conditions are possible; Go's excellent race detector helps to eliminate problems. But Occam takes a different stance: shared mutable data is prevented at compilation time.
Aliases: related to the above, Go allows many pointers to refer to each data item. Such aliases are disallowed in Occam, so reducing the effort needed to detect race conditions.
The latter two points are less about Hoare's CSP and more about May's Occam. But they are relevant because they directly concern safe concurrent coding.
That's exactly the point: in the example language used in Hoare's initial paper (and also in Erlang), process A talks directly to process B, while in Go, goroutine A talks to channel C and goroutine B listens to channel C. I.e. in Go the channels are explicit while in Hoare's language and Erlang, they are implicit.
See this article for more info.
Recently, I've been working quite intensively with Go's channels, and have been working with concurrency and parallelism for many years, although I could never profess to know everything about this.
I think what you're asking is what's the subtle difference between sending a message to a channel and sending directly to each other? If I understand you, the quick answer is simple.
Sending to a Channel give the opportunity for parallelism / concurrency on both sides of the channel. Beautiful, and scalable.
We live in a concurrent world. Sending a long continuous stream of messages from A to B (asynchronously) means that B will need to process the messages at pretty much the same pace as A sends them, unless more than one instance of B has the opportunity to process a message taken from the channel, hence sharing the workload.
The good thing about channels is that that you can have a number of producer/receiver go-routines which are able to push messages to the queue, or consume from the queue and process it accordingly.
If you think linearly, like a single-core CPU, concurrency is basically like having a million jobs to do. Knowing a single-core CPU can only do one thing at a time, and yet also see that it gives the illusion that lots of things are happening at the same time. When executing some code, the time the OS needs to wait a while for something to come back from the network, disk, keyboard, mouse, etc, or even some process which sleeps for a while, give the OS the opportunity to do something else in the meantime. This all happens extremely quickly, creating the illusion of parallelism.
Parallelism on the other hand is different in that the job can be run on a completely different CPU independent of what's going with other CPUs, and therefore doesn't run under the same constraints as the other CPU (although most OS's do a pretty good job at ensuring workloads are evenly distributed to run across all of it's CPUs - with perhaps the exception of CPU-hungry, uncooperative non-os-yielding-code, but even then the OS tames them.
The point is, having multi-core CPUs means more parallelism and more concurrency can occur.
Imagine a single queue at a bank which fans-out to a number of tellers who can help you. If no customers are being served by any teller, one teller elects to handle the next customer and becomes busy, until they all become busy. Whenever a customer walks away from a teller, that teller is able to handle the next customer in the queue.

Order of Goroutine Unblocking on Single Channel

Does order in which the Goroutines block on a channel determine the order they will unblock? I'm not concerned with the order of the messages that are sent (they're guaranteed to be ordered), but the order of the Goroutines that'll unblock.
Imagine a empty Channel ch shared between multiple Goroutines (1, 2, and 3), with each Goroutine trying to receive a message on ch. Since ch is empty, each Goroutine will block. When I send a message to ch, will Goroutine 1 unblock first? Or could 2 or 3 possibly receive the first message? (Or vice-versa, with the Goroutines trying to send)
I have a playground that seems to suggest that the order in which Goroutines block is the order in which they are unblocked, but I'm not sure if this is an undefined behavior because of the implementation.
This is a good question - it touches on some important issues when doing concurrent design. As has already been stated, the answer to your specific question is, according to the current implementation, FIFO based. It's unlikely ever to be different, except perhaps if the implementers decided, say, a LIFO was better for some reason.
There is no guarantee, though. So you should avoid creating code that relies on a particular implementation.
The broader question concerns non-determinism, fairness and starvation.
Perhaps surprisingly, non-determinism in a CSP-based system does not come from things happening in parallel. It is possible because of concurrency, but not because of concurrency. Instead, non-determinism arises when a choice is made. In the formal algebra of CSP, this is modelled mathematically. Fortunately, you don't need to know the maths to be able to use Go. But formally, two goroutines code execute in parallel and the outcome could still be deterministic, provided all the choices are eliminated.
Go allows choices that introduce non-determinism explicitly via select and implicitly via ends of channels being shared between goroutines. If you have point-to-point (one reader, one writer) channels, the second kind does not arise. So if it's important in a particular situation, you have a design choice you can make.
Fairness and starvation are typically opposite sides of the same coin. Starvation is one of those dynamic problems (along with deadlock, livelock and race conditions) that result perhaps in poor performance, more likely in wrong behaviour. These dynamic problems are un-testable (more on this) and need some level analysis to solve. Clearly, if part of a system is unresponsive because it is starved of access to certain resources, then there is a need for greater fairness in governing those resources.
Shared access to channel ends may well provide a degree of fairness because of the current FIFO behaviour and this may appear sufficient. But if you want it guaranteed (regardless of implementation uncertainties), it is possible instead to use a select and a bundle of point-to-point channels in an array. Fair indexing is easy to achieve by always preferring them in an order that puts the last-selected at the bottom of the pile. This solution can guarantee fairness, but probably with a small performance penalty.
(aside: see "Wot No Chickens" for a somewhat-amusing discovery made by researchers in Canterbury, UK concerning a fairness flaw in the Java Virtual Machine - which has never been rectified!)
I believe it's unspecified because the memory model document only says "A send on a channel happens before the corresponding receive from that channel completes." The spec sections on send statements and the receive operator don't say anything about what unblocks first. Right now the gc toolchain uses an orderly FIFO queue to control which goroutine unblocks, but I don't see any promises in the spec that it must always be so.
(Just for general background note that Playground code runs with GOMAXPROCS=1, i.e., on one core, so some types of concurrency-related unpredictability just won't come up.)
The order is not specified, but current implementations use a FIFO queue for waiting goroutines.
The authoritative document is the Go Memory Model. The memory model does not define a happens-before relationship for two goroutines sending to the same channel, therefore the order is not specified. Ditto for receive.

MPI Alltoallv or better individual Send and Recv? (Performance)

I have a number of processes (of the order of 100 to 1000) and each of them has to send some data to some (say about 10) of the other processes. (Typically, but not necessary always, if A sends to B, B also sends to A.) Every process knows how much data it has to receive from which process.
So I could just use MPI_Alltoallv, with many or most of the message lengths zero.
However, I heard that for performance reasons it would be better to use several MPI_send and MPI_recv communications rather than the global MPI_Alltoallv.
What I do not understand: if a series of send and receive calls are more efficient than one Alltoallv call, why is Alltoallv not just implemented as a series of sends and receives?
It would be much more convenient for me (and others?) to use just one global call. Also I might have to be concerned about not running into a deadlock situation with several Send and Recv (fixable by some odd-even strategy or more complex? or by using buffered send/recv?).
Would you agree that MPI_Alltoallv is necessary slower than the, say, 10 MPI_Send and MPI_Recv; and if yes, why and how much?
Usually the default advice with collectives is the opposite: use a collective operation when possible instead of coding your own. The more information the MPI library has about the communication pattern, the more opportunities it has to optimize internally.
Unless special hardware support is available, collective calls are in fact implemented internally in terms of sends and receives. But the actual communication pattern will probably not be just a series of sends and receives. For example, using a tree to broadcast a piece of data can be faster than having the same rank send it to a bunch of receivers. A lot of work goes into optimizing collective communications, and it is difficult to do better.
Having said that, MPI_Alltoallv is somewhat different. It can be difficult to optimize for all irregular communication scenarios at the MPI level, so it is conceivable that some custom communication code can do better. For example, an implementation of MPI_Alltoallv might be synchronizing: it could require that all processes "check in", even if they have to send a 0-length message. I though that such an implementation is unlikely, but here is one in the wild.
So the real answer is "it depends". If the library implementation of MPI_Alltoallv is a bad match for the task, custom communication code will win. But before going down that path, check if the MPI-3 neighbor collectives are a good fit for your problem.

Is Go's buffered channel lockless?

Go's buffered channel is essentially a thread-safe FIFO queue. (See Is it possible to use Go's buffered channel as a thread-safe queue?)
I am wondering how it's implemented. Is it lock-free like described in Is there such a thing as a lockless queue for multiple read or write threads??
greping in Go's src directory (grep -r Lock .|grep chan) gives following output:
./pkg/runtime/chan.c: Lock;
./pkg/runtime/chan_test.go: m.Lock()
./pkg/runtime/chan_test.go: m.Lock() // wait
./pkg/sync/cond.go: L Locker // held while observing or changing the condition
Doesn't to be locking on my machine (MacOS, intel x86_64) though. Is there any official resource to validate this?
If you read the runtime·chansend function in chan.c, you will see that runtime·lock is called before the check to see if the channel is buffered if(c->dataqsiz > 0).
In other words, buffered channels (and all channels in general) use locks.
The reason your search did not find it was you were looking for "Lock" with a capital L. The lock function used for channels is a non-exported C function in the runtime.
You can write lock-free (and even wait-free!) implementations for everything you like. Modern hardware primitives like CMPXCHG are enough to be universally usable. But writing and verifying such algorithms isn't one of the easiest tasks. In addition to that, much faster algorithms might exists: lock free algorithms are just a very small subset of algorithms in general.
As far as I remember, Dmitry Vyukov has written a lock-free MPMC (mutli-producer/multi-consumer) channel implementation for Go in the past, but the patch was abandoned, because of some problems with Go's select statement. Supporting this statement efficiently seems to be really hard.
The main goal of Go's channel type is however, to provide a high-level concurrency primitive that is easily usable for a broad range of problems. Even developers who aren't experts at concurrent programming should be able to write correct programs that can be easily reviewed and maintained in larger software projects. If you are interested in squeezing out every last bit of performance, you would have to write a specialized queue implementation that suits your needs.

Resources