Order of Goroutine Unblocking on Single Channel - go

Does order in which the Goroutines block on a channel determine the order they will unblock? I'm not concerned with the order of the messages that are sent (they're guaranteed to be ordered), but the order of the Goroutines that'll unblock.
Imagine a empty Channel ch shared between multiple Goroutines (1, 2, and 3), with each Goroutine trying to receive a message on ch. Since ch is empty, each Goroutine will block. When I send a message to ch, will Goroutine 1 unblock first? Or could 2 or 3 possibly receive the first message? (Or vice-versa, with the Goroutines trying to send)
I have a playground that seems to suggest that the order in which Goroutines block is the order in which they are unblocked, but I'm not sure if this is an undefined behavior because of the implementation.

This is a good question - it touches on some important issues when doing concurrent design. As has already been stated, the answer to your specific question is, according to the current implementation, FIFO based. It's unlikely ever to be different, except perhaps if the implementers decided, say, a LIFO was better for some reason.
There is no guarantee, though. So you should avoid creating code that relies on a particular implementation.
The broader question concerns non-determinism, fairness and starvation.
Perhaps surprisingly, non-determinism in a CSP-based system does not come from things happening in parallel. It is possible because of concurrency, but not because of concurrency. Instead, non-determinism arises when a choice is made. In the formal algebra of CSP, this is modelled mathematically. Fortunately, you don't need to know the maths to be able to use Go. But formally, two goroutines code execute in parallel and the outcome could still be deterministic, provided all the choices are eliminated.
Go allows choices that introduce non-determinism explicitly via select and implicitly via ends of channels being shared between goroutines. If you have point-to-point (one reader, one writer) channels, the second kind does not arise. So if it's important in a particular situation, you have a design choice you can make.
Fairness and starvation are typically opposite sides of the same coin. Starvation is one of those dynamic problems (along with deadlock, livelock and race conditions) that result perhaps in poor performance, more likely in wrong behaviour. These dynamic problems are un-testable (more on this) and need some level analysis to solve. Clearly, if part of a system is unresponsive because it is starved of access to certain resources, then there is a need for greater fairness in governing those resources.
Shared access to channel ends may well provide a degree of fairness because of the current FIFO behaviour and this may appear sufficient. But if you want it guaranteed (regardless of implementation uncertainties), it is possible instead to use a select and a bundle of point-to-point channels in an array. Fair indexing is easy to achieve by always preferring them in an order that puts the last-selected at the bottom of the pile. This solution can guarantee fairness, but probably with a small performance penalty.
(aside: see "Wot No Chickens" for a somewhat-amusing discovery made by researchers in Canterbury, UK concerning a fairness flaw in the Java Virtual Machine - which has never been rectified!)

I believe it's unspecified because the memory model document only says "A send on a channel happens before the corresponding receive from that channel completes." The spec sections on send statements and the receive operator don't say anything about what unblocks first. Right now the gc toolchain uses an orderly FIFO queue to control which goroutine unblocks, but I don't see any promises in the spec that it must always be so.
(Just for general background note that Playground code runs with GOMAXPROCS=1, i.e., on one core, so some types of concurrency-related unpredictability just won't come up.)

The order is not specified, but current implementations use a FIFO queue for waiting goroutines.
The authoritative document is the Go Memory Model. The memory model does not define a happens-before relationship for two goroutines sending to the same channel, therefore the order is not specified. Ditto for receive.

Related

How to synchronize multiple goroutines on one critical section?

I need to synchronize a number of goroutines 1..n that run code in an interpreted language (written in Go) on a critical section of interpreted code running in a goroutine k, which may be any of 1..n at runtime. I.e., if the code in k is running in the critical section, all other goroutines i (i≠k; 1≤k,i≤n) need to be suspended until k is finished. The solution also needs to deal with erroneous nesting of critical sections, either by processing at the outermost level or by signalling an error.
The problem is that each of these goroutines may run code in a tight loop and I have no direct control over this code - it's interpreted, supplied by the user. Things I've considered:
Mess with the interpreter to include some kind of yield or checks. This seems like a very inefficient way of dealing with it. The language allows for recursion, so I'd have to check every function call.
Suspend goroutines i while they are running, from k in a non-cooperative way? Is it possible to suspend a goroutine in Go "from the outside (=k)"?
Supply a way in the target language to co-operatively check whether k is running and halt if needed. I'm reluctant to do this, since it basically is a leaky abstraction for channels. It would also make programming very complicated and I want my interpreted language to remain simple.
What would you recommend. Is Option 2 even possible? Is there another way to do this? Any help is greatly appreciated!

Golang main difference from CSP-Language by Hoare

Look at this statement taken from The examples from Tony Hoare's seminal 1978 paper:
Go's design was strongly influenced by Hoare's paper. Although Go differs significantly from the example language used in the paper, the examples still translate rather easily. The biggest difference apart from syntax is that Go models the conduits of concurrent communication explicitly as channels, while the processes of Hoare's language send messages directly to each other, similar to Erlang. Hoare hints at this possibility in section 7.3, but with the limitation that "each port is connected to exactly one other port in another process", in which case it would be a mostly syntactic difference.
I'm confused.
Processes in Hoare's language communicate directly to each other. Go routines communicate also directly to each other but using channels.
So what impact has the limitation in golang. What is the real difference?
The answer requires a fuller understanding of Hoare's work on CSP. The progression of his work can be summarised in three stages:
based on Dijkstra's semaphore's, Hoare developed monitors. These are as used in Java, except Java's implementation contains a mistake (see Welch's article Wot No Chickens). It's unfortunate that Java ignored Hoare's later work.
CSP grew out of this. Initially, CSP required direct exchange from process A to process B. This rendezvous approach is used by Ada and Erlang.
CSP was completed by 1985, when his Book was first published. This final version of CSP includes channels as used in Go. Along with Hoare's team at Oxford, David May concurrently developed Occam, a language deliberately intended to blend CSP into a practical programming language. CSP and Occam influenced each other (for example in The Laws of Occam Programming). For years, Occam was only available on the Transputer processor, which had its architecture tailored to suit CSP. More recently, Occam has developed to target other processors and has also absorbed Pi calculus, along with other general synchronisation primitives.
So, to answer the original question, it is probably helpful to compare Go with both CSP and Occam.
Channels: CSP, Go and Occam all have the same semantics for channels. In addition, Go makes it easy to add buffering into channels (Occam does not).
Choices: CSP defines both the internal and external choice. However, both Go and Occam have a single kind of selection: select in Go and ALT in Occam. The fact that there are two kinds of CSP choice proved to be less important in practical languages.
Occam's ALT allows condition guards, but Go's select does not (there is a workaround: channel aliases can be set to nil to imitate the same behaviour).
Mobility: Go allows channel ends to be sent (along with other data) via channels. This creates a dynamically-changing topology and goes beyond what is possible in CSP, but Milner's Pi calculus was developed (out of his CCS) to describe such networks.
Processes: A goroutine is a forked process; it terminates when it wants to and it doesn't have a parent. This is less like CSP / Occam, in which processes are compositional.
An example will help here: firstly Occam (n.b. indentation matters)
SEQ
PAR
processA()
processB()
processC()
and secondly Go
go processA()
go processB()
processC()
In the Occam case, processC doesn't start until both processA and processB have terminated. In Go, processA and processB fork very quickly, then processC runs straightaway.
Shared data: CSP is not really concerned with data directly. But it is interesting to note there is an important difference between Go and Occam concerning shared data. When multiple goroutines share a common set of data variables, race conditions are possible; Go's excellent race detector helps to eliminate problems. But Occam takes a different stance: shared mutable data is prevented at compilation time.
Aliases: related to the above, Go allows many pointers to refer to each data item. Such aliases are disallowed in Occam, so reducing the effort needed to detect race conditions.
The latter two points are less about Hoare's CSP and more about May's Occam. But they are relevant because they directly concern safe concurrent coding.
That's exactly the point: in the example language used in Hoare's initial paper (and also in Erlang), process A talks directly to process B, while in Go, goroutine A talks to channel C and goroutine B listens to channel C. I.e. in Go the channels are explicit while in Hoare's language and Erlang, they are implicit.
See this article for more info.
Recently, I've been working quite intensively with Go's channels, and have been working with concurrency and parallelism for many years, although I could never profess to know everything about this.
I think what you're asking is what's the subtle difference between sending a message to a channel and sending directly to each other? If I understand you, the quick answer is simple.
Sending to a Channel give the opportunity for parallelism / concurrency on both sides of the channel. Beautiful, and scalable.
We live in a concurrent world. Sending a long continuous stream of messages from A to B (asynchronously) means that B will need to process the messages at pretty much the same pace as A sends them, unless more than one instance of B has the opportunity to process a message taken from the channel, hence sharing the workload.
The good thing about channels is that that you can have a number of producer/receiver go-routines which are able to push messages to the queue, or consume from the queue and process it accordingly.
If you think linearly, like a single-core CPU, concurrency is basically like having a million jobs to do. Knowing a single-core CPU can only do one thing at a time, and yet also see that it gives the illusion that lots of things are happening at the same time. When executing some code, the time the OS needs to wait a while for something to come back from the network, disk, keyboard, mouse, etc, or even some process which sleeps for a while, give the OS the opportunity to do something else in the meantime. This all happens extremely quickly, creating the illusion of parallelism.
Parallelism on the other hand is different in that the job can be run on a completely different CPU independent of what's going with other CPUs, and therefore doesn't run under the same constraints as the other CPU (although most OS's do a pretty good job at ensuring workloads are evenly distributed to run across all of it's CPUs - with perhaps the exception of CPU-hungry, uncooperative non-os-yielding-code, but even then the OS tames them.
The point is, having multi-core CPUs means more parallelism and more concurrency can occur.
Imagine a single queue at a bank which fans-out to a number of tellers who can help you. If no customers are being served by any teller, one teller elects to handle the next customer and becomes busy, until they all become busy. Whenever a customer walks away from a teller, that teller is able to handle the next customer in the queue.

MPI Alltoallv or better individual Send and Recv? (Performance)

I have a number of processes (of the order of 100 to 1000) and each of them has to send some data to some (say about 10) of the other processes. (Typically, but not necessary always, if A sends to B, B also sends to A.) Every process knows how much data it has to receive from which process.
So I could just use MPI_Alltoallv, with many or most of the message lengths zero.
However, I heard that for performance reasons it would be better to use several MPI_send and MPI_recv communications rather than the global MPI_Alltoallv.
What I do not understand: if a series of send and receive calls are more efficient than one Alltoallv call, why is Alltoallv not just implemented as a series of sends and receives?
It would be much more convenient for me (and others?) to use just one global call. Also I might have to be concerned about not running into a deadlock situation with several Send and Recv (fixable by some odd-even strategy or more complex? or by using buffered send/recv?).
Would you agree that MPI_Alltoallv is necessary slower than the, say, 10 MPI_Send and MPI_Recv; and if yes, why and how much?
Usually the default advice with collectives is the opposite: use a collective operation when possible instead of coding your own. The more information the MPI library has about the communication pattern, the more opportunities it has to optimize internally.
Unless special hardware support is available, collective calls are in fact implemented internally in terms of sends and receives. But the actual communication pattern will probably not be just a series of sends and receives. For example, using a tree to broadcast a piece of data can be faster than having the same rank send it to a bunch of receivers. A lot of work goes into optimizing collective communications, and it is difficult to do better.
Having said that, MPI_Alltoallv is somewhat different. It can be difficult to optimize for all irregular communication scenarios at the MPI level, so it is conceivable that some custom communication code can do better. For example, an implementation of MPI_Alltoallv might be synchronizing: it could require that all processes "check in", even if they have to send a 0-length message. I though that such an implementation is unlikely, but here is one in the wild.
So the real answer is "it depends". If the library implementation of MPI_Alltoallv is a bad match for the task, custom communication code will win. But before going down that path, check if the MPI-3 neighbor collectives are a good fit for your problem.

Is Go's buffered channel lockless?

Go's buffered channel is essentially a thread-safe FIFO queue. (See Is it possible to use Go's buffered channel as a thread-safe queue?)
I am wondering how it's implemented. Is it lock-free like described in Is there such a thing as a lockless queue for multiple read or write threads??
greping in Go's src directory (grep -r Lock .|grep chan) gives following output:
./pkg/runtime/chan.c: Lock;
./pkg/runtime/chan_test.go: m.Lock()
./pkg/runtime/chan_test.go: m.Lock() // wait
./pkg/sync/cond.go: L Locker // held while observing or changing the condition
Doesn't to be locking on my machine (MacOS, intel x86_64) though. Is there any official resource to validate this?
If you read the runtime·chansend function in chan.c, you will see that runtime·lock is called before the check to see if the channel is buffered if(c->dataqsiz > 0).
In other words, buffered channels (and all channels in general) use locks.
The reason your search did not find it was you were looking for "Lock" with a capital L. The lock function used for channels is a non-exported C function in the runtime.
You can write lock-free (and even wait-free!) implementations for everything you like. Modern hardware primitives like CMPXCHG are enough to be universally usable. But writing and verifying such algorithms isn't one of the easiest tasks. In addition to that, much faster algorithms might exists: lock free algorithms are just a very small subset of algorithms in general.
As far as I remember, Dmitry Vyukov has written a lock-free MPMC (mutli-producer/multi-consumer) channel implementation for Go in the past, but the patch was abandoned, because of some problems with Go's select statement. Supporting this statement efficiently seems to be really hard.
The main goal of Go's channel type is however, to provide a high-level concurrency primitive that is easily usable for a broad range of problems. Even developers who aren't experts at concurrent programming should be able to write correct programs that can be easily reviewed and maintained in larger software projects. If you are interested in squeezing out every last bit of performance, you would have to write a specialized queue implementation that suits your needs.

Event Scheduling in Twisted

Thus is s a fairly basic question, but I am new to Twisted. if the the reactor loop encounters 2 callLaters for the exact same timeout value and also encounters an incoming packet, how will it schedule the 3?
The callLaters would fire in the order that you registered them. The packet arrival could fire before or after the callLaters depending on the point of execution in the event loop when the packet arrives.
There is no definitive rule here. Different reactors may implement different strategies. In general these implementations are somewhat ad-hoc and not particularly well designed, but there isn't a lot of motivation to fix them, because most applications with deep ordering dependencies on different event sources are actually just buggy, and should be fixed not to care what order these fundamentally non-deterministic events arrive in.

Resources