Is Go's buffered channel lockless? - thread-safety

Go's buffered channel is essentially a thread-safe FIFO queue. (See Is it possible to use Go's buffered channel as a thread-safe queue?)
I am wondering how it's implemented. Is it lock-free like described in Is there such a thing as a lockless queue for multiple read or write threads??
greping in Go's src directory (grep -r Lock .|grep chan) gives following output:
./pkg/runtime/chan.c: Lock;
./pkg/runtime/chan_test.go: m.Lock()
./pkg/runtime/chan_test.go: m.Lock() // wait
./pkg/sync/cond.go: L Locker // held while observing or changing the condition
Doesn't to be locking on my machine (MacOS, intel x86_64) though. Is there any official resource to validate this?

If you read the runtime·chansend function in chan.c, you will see that runtime·lock is called before the check to see if the channel is buffered if(c->dataqsiz > 0).
In other words, buffered channels (and all channels in general) use locks.
The reason your search did not find it was you were looking for "Lock" with a capital L. The lock function used for channels is a non-exported C function in the runtime.

You can write lock-free (and even wait-free!) implementations for everything you like. Modern hardware primitives like CMPXCHG are enough to be universally usable. But writing and verifying such algorithms isn't one of the easiest tasks. In addition to that, much faster algorithms might exists: lock free algorithms are just a very small subset of algorithms in general.
As far as I remember, Dmitry Vyukov has written a lock-free MPMC (mutli-producer/multi-consumer) channel implementation for Go in the past, but the patch was abandoned, because of some problems with Go's select statement. Supporting this statement efficiently seems to be really hard.
The main goal of Go's channel type is however, to provide a high-level concurrency primitive that is easily usable for a broad range of problems. Even developers who aren't experts at concurrent programming should be able to write correct programs that can be easily reviewed and maintained in larger software projects. If you are interested in squeezing out every last bit of performance, you would have to write a specialized queue implementation that suits your needs.

Related

How to learn internals of the Go Programming Language? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 10 months ago.
The community reviewed whether to reopen this question 10 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
Recently I've participated several Go job interviews. The first one asked me How is channel implemented?, then the second one asked How is goroutine implemented?. Well as you can guess, the next one asked How is a Go interface implemented?.
I've been using Go for six months, but to be honest I never did care or know these Go internals.
I tried to learn these by reading the source code of Go, but can't really understand the quintessence.
So the question is, for a noob in Go, how do I learn the Go internals?
The most organized collection of internal resources links is probably this:
Golang Internals Resources
Other than that, answers to these questions aren't collected in once place, but they are scattered in different blog posts.
Slice internals: The Go Blog: usage and internals
String internals: The Go Blog: Strings, bytes, runes and characters in Go
Constants internals: The Go Blog: Constants
Reflection insight: The Go Blog: The Laws of Reflection
Interface internals: RSC: Go Data Structures: Interfaces
Channel implementation: Overview on SO: How are Go channels implemented?
Channel internals: Go channel on steroids
Map implementation: Overview on SO: Golang map internal implementation - how does it search the map for a key?; also related: Go's maps under the hood
Map internals: Macro View of Map Internals In Go
Let me warn you that you may be missing the real point of the interviewers.
(Disclaimer: I do job interviews of Go programmers from time to time,
for a somewhat demanding project, so all of the below is my personal
world view. Still, it is shared by my cow-orkers ;-)).
Most of the time, it's rather worthless for an employee to know precisely how this or that bit of the runtime (or the compiler) is implemented—in part because this may change in any future release, and in part because there
do exist at least two up-to-date implementations of Go ("the gc suite", and
a part of GCC), and they all are free to implement a particular feature
in any way they wish.
What a (sensible) interviewer should really be interested in is
whether you understand "the why" of a particular core feature.
Say, when they ask you to explain how a channel is implemented,
what they should be interested to hear from you is that a channel
provides synchronization and may also provide buffering.
So you may tell them that a channel is like
a variable protected by a mutex—for the case of an unbuffered channel,—or
like a slice protected by a mutex.
And then add that an upshot of using a channel instead of a hand-crafted
solution involving a mutex is that operations on channels
can be easily combined using the select statement, while implementing
a matching functionality without channels is possible (and by that time
they will probably would like to hear from you about sync.Cond)
but is really cumbersome and error-prone.
It's not actually completely worthless to know nitty-gritty details
of channels—say to know that their implementation tries hard to lower
the price paid for the synchronization in the "happy case"
(there is no contention at the
moment a goroutine accesses a channel), and that it is also clever about
not jumping straight into the kernel to sleep on a lock in the
"unhappy case", but I see no point in knowng these details by heart.
The same applies to goroutines. You should maintain a clear picture in what
are the differences between an OS process and a thread running in it,
and what context belongs to a thread—that is, what is needed to be saved
and restored when switching between the threads.
And then the differences between an OS thread and a "green thread" which
a goroutine mostly is. It's okay to just know who schedules OS threads
and who schedules goroutines, and why the latter is faster.
And what are the benefits of having goroutines (the main one is network
poller integrated into the scheduler (see this),
the second is dynamic stacks, the third is low context switching overhead
most of the time).
My recommendation is to read through the list presented by #icza.
And in addition to what you've asked about,
I'd present the following list of what a good candidate
should be familiar with—in the order from easiest to hardest (to grok):
The mechanics of slices and append. You should know arrays exist
and how they are different from slices.
How interfaces are implemented.
The dualistic nature of Go strings (given s contains a string,
what is the difference between iterating over its contents via
for i := 0; i < len(s); i++ { c := s[i] } over for i, c := range s {}).
Also: what kinds of data strings may contain—you should know that it's
perfectly okay to contain arbitrary binary data in them, UTF-8 is not
a requirement.
The differences between string and []byte.
How blocking I/O is implemented (for the network; that's about
the netpoller integrated into the runtime).
Knowing about the difference in handing non-network blocking I/O and
syscalls in general is a bonus.
How the scheduler is implemented (those Ps running Gs on Ms).
This GitHub repo would be helpful to get to know about go internals.
You can get all go-internals related resources here.
A collection of articles and videos to understand Golang internals.
A book about the internals of the Go programming language

Golang main difference from CSP-Language by Hoare

Look at this statement taken from The examples from Tony Hoare's seminal 1978 paper:
Go's design was strongly influenced by Hoare's paper. Although Go differs significantly from the example language used in the paper, the examples still translate rather easily. The biggest difference apart from syntax is that Go models the conduits of concurrent communication explicitly as channels, while the processes of Hoare's language send messages directly to each other, similar to Erlang. Hoare hints at this possibility in section 7.3, but with the limitation that "each port is connected to exactly one other port in another process", in which case it would be a mostly syntactic difference.
I'm confused.
Processes in Hoare's language communicate directly to each other. Go routines communicate also directly to each other but using channels.
So what impact has the limitation in golang. What is the real difference?
The answer requires a fuller understanding of Hoare's work on CSP. The progression of his work can be summarised in three stages:
based on Dijkstra's semaphore's, Hoare developed monitors. These are as used in Java, except Java's implementation contains a mistake (see Welch's article Wot No Chickens). It's unfortunate that Java ignored Hoare's later work.
CSP grew out of this. Initially, CSP required direct exchange from process A to process B. This rendezvous approach is used by Ada and Erlang.
CSP was completed by 1985, when his Book was first published. This final version of CSP includes channels as used in Go. Along with Hoare's team at Oxford, David May concurrently developed Occam, a language deliberately intended to blend CSP into a practical programming language. CSP and Occam influenced each other (for example in The Laws of Occam Programming). For years, Occam was only available on the Transputer processor, which had its architecture tailored to suit CSP. More recently, Occam has developed to target other processors and has also absorbed Pi calculus, along with other general synchronisation primitives.
So, to answer the original question, it is probably helpful to compare Go with both CSP and Occam.
Channels: CSP, Go and Occam all have the same semantics for channels. In addition, Go makes it easy to add buffering into channels (Occam does not).
Choices: CSP defines both the internal and external choice. However, both Go and Occam have a single kind of selection: select in Go and ALT in Occam. The fact that there are two kinds of CSP choice proved to be less important in practical languages.
Occam's ALT allows condition guards, but Go's select does not (there is a workaround: channel aliases can be set to nil to imitate the same behaviour).
Mobility: Go allows channel ends to be sent (along with other data) via channels. This creates a dynamically-changing topology and goes beyond what is possible in CSP, but Milner's Pi calculus was developed (out of his CCS) to describe such networks.
Processes: A goroutine is a forked process; it terminates when it wants to and it doesn't have a parent. This is less like CSP / Occam, in which processes are compositional.
An example will help here: firstly Occam (n.b. indentation matters)
SEQ
PAR
processA()
processB()
processC()
and secondly Go
go processA()
go processB()
processC()
In the Occam case, processC doesn't start until both processA and processB have terminated. In Go, processA and processB fork very quickly, then processC runs straightaway.
Shared data: CSP is not really concerned with data directly. But it is interesting to note there is an important difference between Go and Occam concerning shared data. When multiple goroutines share a common set of data variables, race conditions are possible; Go's excellent race detector helps to eliminate problems. But Occam takes a different stance: shared mutable data is prevented at compilation time.
Aliases: related to the above, Go allows many pointers to refer to each data item. Such aliases are disallowed in Occam, so reducing the effort needed to detect race conditions.
The latter two points are less about Hoare's CSP and more about May's Occam. But they are relevant because they directly concern safe concurrent coding.
That's exactly the point: in the example language used in Hoare's initial paper (and also in Erlang), process A talks directly to process B, while in Go, goroutine A talks to channel C and goroutine B listens to channel C. I.e. in Go the channels are explicit while in Hoare's language and Erlang, they are implicit.
See this article for more info.
Recently, I've been working quite intensively with Go's channels, and have been working with concurrency and parallelism for many years, although I could never profess to know everything about this.
I think what you're asking is what's the subtle difference between sending a message to a channel and sending directly to each other? If I understand you, the quick answer is simple.
Sending to a Channel give the opportunity for parallelism / concurrency on both sides of the channel. Beautiful, and scalable.
We live in a concurrent world. Sending a long continuous stream of messages from A to B (asynchronously) means that B will need to process the messages at pretty much the same pace as A sends them, unless more than one instance of B has the opportunity to process a message taken from the channel, hence sharing the workload.
The good thing about channels is that that you can have a number of producer/receiver go-routines which are able to push messages to the queue, or consume from the queue and process it accordingly.
If you think linearly, like a single-core CPU, concurrency is basically like having a million jobs to do. Knowing a single-core CPU can only do one thing at a time, and yet also see that it gives the illusion that lots of things are happening at the same time. When executing some code, the time the OS needs to wait a while for something to come back from the network, disk, keyboard, mouse, etc, or even some process which sleeps for a while, give the OS the opportunity to do something else in the meantime. This all happens extremely quickly, creating the illusion of parallelism.
Parallelism on the other hand is different in that the job can be run on a completely different CPU independent of what's going with other CPUs, and therefore doesn't run under the same constraints as the other CPU (although most OS's do a pretty good job at ensuring workloads are evenly distributed to run across all of it's CPUs - with perhaps the exception of CPU-hungry, uncooperative non-os-yielding-code, but even then the OS tames them.
The point is, having multi-core CPUs means more parallelism and more concurrency can occur.
Imagine a single queue at a bank which fans-out to a number of tellers who can help you. If no customers are being served by any teller, one teller elects to handle the next customer and becomes busy, until they all become busy. Whenever a customer walks away from a teller, that teller is able to handle the next customer in the queue.

Order of Goroutine Unblocking on Single Channel

Does order in which the Goroutines block on a channel determine the order they will unblock? I'm not concerned with the order of the messages that are sent (they're guaranteed to be ordered), but the order of the Goroutines that'll unblock.
Imagine a empty Channel ch shared between multiple Goroutines (1, 2, and 3), with each Goroutine trying to receive a message on ch. Since ch is empty, each Goroutine will block. When I send a message to ch, will Goroutine 1 unblock first? Or could 2 or 3 possibly receive the first message? (Or vice-versa, with the Goroutines trying to send)
I have a playground that seems to suggest that the order in which Goroutines block is the order in which they are unblocked, but I'm not sure if this is an undefined behavior because of the implementation.
This is a good question - it touches on some important issues when doing concurrent design. As has already been stated, the answer to your specific question is, according to the current implementation, FIFO based. It's unlikely ever to be different, except perhaps if the implementers decided, say, a LIFO was better for some reason.
There is no guarantee, though. So you should avoid creating code that relies on a particular implementation.
The broader question concerns non-determinism, fairness and starvation.
Perhaps surprisingly, non-determinism in a CSP-based system does not come from things happening in parallel. It is possible because of concurrency, but not because of concurrency. Instead, non-determinism arises when a choice is made. In the formal algebra of CSP, this is modelled mathematically. Fortunately, you don't need to know the maths to be able to use Go. But formally, two goroutines code execute in parallel and the outcome could still be deterministic, provided all the choices are eliminated.
Go allows choices that introduce non-determinism explicitly via select and implicitly via ends of channels being shared between goroutines. If you have point-to-point (one reader, one writer) channels, the second kind does not arise. So if it's important in a particular situation, you have a design choice you can make.
Fairness and starvation are typically opposite sides of the same coin. Starvation is one of those dynamic problems (along with deadlock, livelock and race conditions) that result perhaps in poor performance, more likely in wrong behaviour. These dynamic problems are un-testable (more on this) and need some level analysis to solve. Clearly, if part of a system is unresponsive because it is starved of access to certain resources, then there is a need for greater fairness in governing those resources.
Shared access to channel ends may well provide a degree of fairness because of the current FIFO behaviour and this may appear sufficient. But if you want it guaranteed (regardless of implementation uncertainties), it is possible instead to use a select and a bundle of point-to-point channels in an array. Fair indexing is easy to achieve by always preferring them in an order that puts the last-selected at the bottom of the pile. This solution can guarantee fairness, but probably with a small performance penalty.
(aside: see "Wot No Chickens" for a somewhat-amusing discovery made by researchers in Canterbury, UK concerning a fairness flaw in the Java Virtual Machine - which has never been rectified!)
I believe it's unspecified because the memory model document only says "A send on a channel happens before the corresponding receive from that channel completes." The spec sections on send statements and the receive operator don't say anything about what unblocks first. Right now the gc toolchain uses an orderly FIFO queue to control which goroutine unblocks, but I don't see any promises in the spec that it must always be so.
(Just for general background note that Playground code runs with GOMAXPROCS=1, i.e., on one core, so some types of concurrency-related unpredictability just won't come up.)
The order is not specified, but current implementations use a FIFO queue for waiting goroutines.
The authoritative document is the Go Memory Model. The memory model does not define a happens-before relationship for two goroutines sending to the same channel, therefore the order is not specified. Ditto for receive.

MPI Alltoallv or better individual Send and Recv? (Performance)

I have a number of processes (of the order of 100 to 1000) and each of them has to send some data to some (say about 10) of the other processes. (Typically, but not necessary always, if A sends to B, B also sends to A.) Every process knows how much data it has to receive from which process.
So I could just use MPI_Alltoallv, with many or most of the message lengths zero.
However, I heard that for performance reasons it would be better to use several MPI_send and MPI_recv communications rather than the global MPI_Alltoallv.
What I do not understand: if a series of send and receive calls are more efficient than one Alltoallv call, why is Alltoallv not just implemented as a series of sends and receives?
It would be much more convenient for me (and others?) to use just one global call. Also I might have to be concerned about not running into a deadlock situation with several Send and Recv (fixable by some odd-even strategy or more complex? or by using buffered send/recv?).
Would you agree that MPI_Alltoallv is necessary slower than the, say, 10 MPI_Send and MPI_Recv; and if yes, why and how much?
Usually the default advice with collectives is the opposite: use a collective operation when possible instead of coding your own. The more information the MPI library has about the communication pattern, the more opportunities it has to optimize internally.
Unless special hardware support is available, collective calls are in fact implemented internally in terms of sends and receives. But the actual communication pattern will probably not be just a series of sends and receives. For example, using a tree to broadcast a piece of data can be faster than having the same rank send it to a bunch of receivers. A lot of work goes into optimizing collective communications, and it is difficult to do better.
Having said that, MPI_Alltoallv is somewhat different. It can be difficult to optimize for all irregular communication scenarios at the MPI level, so it is conceivable that some custom communication code can do better. For example, an implementation of MPI_Alltoallv might be synchronizing: it could require that all processes "check in", even if they have to send a 0-length message. I though that such an implementation is unlikely, but here is one in the wild.
So the real answer is "it depends". If the library implementation of MPI_Alltoallv is a bad match for the task, custom communication code will win. But before going down that path, check if the MPI-3 neighbor collectives are a good fit for your problem.

Is it possible to create thread-safe collections without locks?

This is pure just for interest question, any sort of questions are welcome.
So is it possible to create thread-safe collections without any locks? By locks I mean any thread synchronization mechanisms, including Mutex, Semaphore, and even Interlocked, all of them. Is it possible at user level, without calling system functions? Ok, may be implementation is not effective, i am interested in theoretical possibility. If not what is the minimum means to do it?
EDIT: Why immutable collections don't work.
This of class Stack with methods Add that returns another Stack.
Now here is program:
Stack stack = new ...;
ThreadedMethod()
{
loop
{
//Do the loop
stack = stack.Add(element);
}
}
this expression stack = stack.Add(element) is not atomic, and you can overwrite new stack from other thread.
Thanks,
Andrey
There seem to be misconceptions by even guru software developers about what constitutes a lock.
One has to make a distinction between atomic operations and locks. Atomic operations like compare and swap perform an operation (which would otherwise require two or more instructions) as a single uninterruptible instruction. Locks are built from atomic operations however they can result in threads busy-waiting or sleeping until the lock is unlocked.
In most cases if you manage to implement an parallel algorithm with atomic operations without resorting to locking you will find that it will be orders of magnitude faster. This is why there is so much interest in wait-free and lock-free algorithms.
There has been a ton of research done on implementing various wait-free data-structures. While the code tends to be short, they can be notoriously hard to prove that they really work due to the subtle race conditions that arise. Debugging is also a nightmare. However a lot of work has been done and you can find wait-free/lock-free hashmaps, queues (Michael Scott's lock free queue), stacks, lists, trees, the list goes on. If you're lucky you'll also find some open-source implementations.
Just google 'lock-free my-data-structure' and see what you get.
For further reading on this interesting subject start from The Art of Multiprocessor Programming by Maurice Herlihy.
Yes, immutable collections! :)
Yes, it is possible to do concurrency without any support from the system. You can use Peterson's algorithm or the more general bakery algorithm to emulate a lock.
It really depends on how you define the term (as other commenters have discussed) but yes, it's possible for many data structures, at least, to be implemented in a non-blocking way (without the use of traditional mutual-exclusion locks).
I strongly recommend, if you're interested in the topic, that you read the blog of Cliff Click -- Cliff is the head guru at Azul Systems, who produce hardware + a custom JVM to run Java systems on massive and massively parallel (think up to around 1000 cores and in the hundreds of gigabytes of RAM area), and obviously in those kinds of systems locking can be death (disclaimer: not an employee or customer of Azul, just an admirer of their work).
Dr Click has famously come up with a non-blocking HashTable, which is basically a complex (but quite brilliant) state machine using atomic CompareAndSwap operations.
There is a two-part blog post describing the algorithm (part one, part two) as well as a talk given at Google (slides, video) -- the latter in particular is a fantastic introduction. Took me a few goes to 'get' it -- it's complex, let's face it! -- but if you presevere (or if you're smarter than me in the first place!) you'll find it very rewarding.
I don't think so.
The problem is that at some point you will need some mutual exclusion primitive (perhaps at the machine level) such as an atomic test-and-set operation. Otherwise, you could always devise a race condition. Once you have a test-and-set, you essentially have a lock.
That being said, in older hardware that did not have any support for this in the instruction set, you could disable interrupts and thus prevent another "process" from taking over but effectively constantly putting the system into a serialized mode and forcing sort of a mutual exclusion for a while.
At the very least you need atomic operations. There are lock free algorithms for single cpu's. I'm not sure about multiple CPU's

Resources