Go goroutine under the hood

Go goroutine under the hood - go

I'm trying to understand golang architecture and what "lightweight thread" means. I've already read something, but want to ask question to clarify it.
Am I right if I'll say what "go" keyword under the hood just puts following function in queue of inner thread pool, but for user it looks like creation of thread?

This is copied from the Go FAQ:
Why goroutines instead of threads?
Goroutines are part of making concurrency easy to use. The idea, which has been around for a while, is to multiplex independently executing functions—coroutines—onto a set of threads. When a coroutine blocks, such as by calling a blocking system call, the run-time automatically moves other coroutines on the same operating system thread to a different, runnable thread so they won't be blocked. The programmer sees none of this, which is the point. The result, which we call goroutines, can be very cheap: they have little overhead beyond the memory for the stack, which is just a few kilobytes.
What's lacking here is the definition of thread. If we resort to Wikipedia, we find:
In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, ...
but that's just a description of, well, the same thing that a goroutine is. The problem here is that the word thread tends to refer to kernel thread and/or user thread (both defined on that same Wikipedia page) and these threads are heavier-weight than the goroutine threads. Which brings us right back to this:
I'm trying to understand golang architecture and what "lightweight thread" means ...
To cut to the chase, this means "lighter than the OS-provided ones". That's really all it means. There are OS-provided threads (on multiple OSes on which Go runs), but they generally do too much and cost too much to switch between so Go provides its own language-level ones that it calls "goroutines" that are much lighter.
From comments:
Why need to move tasks from one thread to another by some planner ...
This is an implementation detail, which involves another aspect of the OS-provided kernel threads:
I can't understand how [a goroutine] can be preempted if single thread process [is] blocked by [a] system call to read [a] long file
The current Go runtime goroutine / thread / processor scheduler (see What is relationship between goroutine and thread in kernel and user state and note that there have been more than just the current implementation) predicts that some system call will block, and makes sure to assign that system call its own OS-level kernel thread (see also JimB's comment). These threads do not count against the GOMAXPROCS setting. This is in fact sometimes a problem, as it's possible for the Go runtime to try to spin off more threads than the OS allows: it might be nice if there were a system-call-thread-pool here (though there are also obvious problems with this).
So, the current runtime creates up to GOMAXPROCS kernel-style OS-level threads and uses those to multiplex up to that many goroutines onto the CPUs, but creates extra kernel-style OS-level threads whenever it wants to. As the blog post linked in the question above notes, the P entities act as queues to hold goroutines (Gs) on a per-processor basis for localized cache lookup (remember that on some systems, especially NUMA ones, it's expensive to reach out "across" CPUs: the scheduler is still willing to do this, but won't do it too often, for some definition of "too often").
Earlier versions of the current scheduler required explicit yields (runtime.Gosched()) calls or various other runtime operations to cause a switch from the current goroutine to some other goroutine. See What exactly does runtime.Gosched do? for example. In Go 1.14, some OSes provide automatic goroutine preemption; see Will Go's scheduler yield control from one goroutine to another for CPU-intensive work?

Related

Why are goroutines much cheaper than threads in other languages?

In his talk - https://blog.golang.org/concurrency-is-not-parallelism, Rob Pike says that go routines are similar to threads but much much cheaper. Can someone explain why?

See "How goroutines work".
They are cheaper in:
memory consumption:
A thread starts with a large memory as opposed to a few Kb.
Setup and teardown costs
(That is why you have to maintain a pool of thread)
Switching costs
Threads are scheduled preemptively, and during a thread switch, the scheduler needs to save/restore ALL registers.
As opposed to Go where the the runtime manages the goroutines throughout from creation to scheduling to teardown. And the number of registers to save is lower.
Plus, as mentioned in "Go’s march to low-latency GC", a GC is easier to implement when the runtime is in charge of managing goroutines:
Since the introduction of its concurrent GC in Go 1.5, the runtime has kept track of whether a goroutine has executed since its stack was last scanned. The mark termination phase would check each goroutine to see whether it had recently run, and would rescan the few that had.
In Go 1.7, the runtime maintains a separate short list of such goroutines. This removes the need to look through the entire list of goroutines while user code is paused, and greatly reduces the number of memory accesses that can trigger the kernel’s NUMA-related memory migration code.

golang: how to handle blocking tasks optimally?

As known, the goroutine is synchronous but non-blocking processing unit.
The golang scheduler handles the non-blocking task, e.g. socket, timer, signal or other events from char devices very well.
But how about block device io or CPU sensitive task? They couldn't be interrupted until finish, and not multiplexed. The OS thread which runs the goroutine would freeze until the goroutine returns or yields. In that case, the scheduling granularity becomes bad.
Of course, you could split the tasks into smaller sub-tasks in your codes, for example, do not copy 1GB file at one time, instead, copy first 10MB, yield, and copy another 10MB, etc, so that the other goroutines within the same OS thread get chance to run. Another example for CPU-bound task: zip a file part by part and merge them finally.
But that breaks the convenience of sequential programming, and the manual scheduling is hard to estimate evenly, compared to the OS scheduling upon the OS threads.
The nginx has similar issue, it's multi-worker-processes program, one process for one CPU core, similar to the best practice of the GOMAXPROCS. It brings in the thread pool to handle the blocking tasks. Maybe it's good for golang too.
I am curious why golang has no OS threading API, which should be good supplement to goroutine for blocking tasks.

Go has specifically chosen to not directly expose OS threads to the user, and instead chose an M:N threading model. Your unit of execution in Go is the goroutine, which will be multiplexed on N number of OS threads.
In the rare case you have a CPU intensive calculation that contains no preemption points and insufficient OS threads to continue running other goroutines, you have 2 choices; increase GOMAXPROCS, or insert runtime.Gosched() calls to yield to other goroutines.
In the case of blocking syscalls, the Go scheduler will automatically dispatch a new OS thread (the time limit to consider a syscall "blocking" has been 20us), and since non-network IO is a series of blocking syscalls, it will almost always be assigned to a dedicated OS thread. Since Go already uses an M:N threading model, the user is usually unaware of the underlying scheduler choices, and can write the program the same as if the runtime used asynchronous IO.
There is an open issue to consider using asynchronous file IO, but there are many issues to overcome, like shortcomings in the Linux aio api, cross-platform compatibility, and interactions with all the various filesystems and devices with which you can do IO.

Goroutines 8kb and windows OS thread 1 mb

As windows user, I know that OS threads consume ~1 Mb of memory due to By default, Windows allocates 1 MB of memory for each thread’s user-mode stack. How does golang use ~8kb of memory for each goroutine, if OS thread is much more gluttonous. Are goroutine sort of virtual threads?

Goroutines are not threads, they are (from the spec):
...an independent concurrent thread of control, or goroutine, within the same address space.
Effective Go defines them as:
They're called goroutines because the existing terms—threads, coroutines, processes, and so on—convey inaccurate connotations. A goroutine has a simple model: it is a function executing concurrently with other goroutines in the same address space. It is lightweight, costing little more than the allocation of stack space. And the stacks start small, so they are cheap, and grow by allocating (and freeing) heap storage as required.
Goroutines don't have their own threads. Instead multiple goroutines are (may be) multiplexed onto the same OS threads so if one should block (e.g. waiting for I/O or a blocking channel operation), others continue to run.
The actual number of threads executing goroutines simultaneously can be set with the runtime.GOMAXPROCS() function. Quoting from the runtime package documentation:
The GOMAXPROCS variable limits the number of operating system threads that can execute user-level Go code simultaneously. There is no limit to the number of threads that can be blocked in system calls on behalf of Go code; those do not count against the GOMAXPROCS limit.
Note that in current implementation by default only 1 thread is used to execute goroutines.

1 MiB is the default, as you correctly noted. You can pick your own stack size easily (however, the minimum is still a lot higher than ~8 kiB).
That said, goroutines aren't threads. They're just tasks with coöperative multi-tasking, similar to Python's. The goroutine itself is just the code and data required to do what you want; there's also a separate scheduler (which runs on one on more OS threads), which actually executes that code.
In pseudo-code:
loop forever
take job from queue
execute job
end loop
Of course, the execute job part can be very simple, or very complicated. The simplest thing you can do is just execute a given delegate (if your language supports something like that). In effect, this is simply a method call. In more complicated scenarios, there can be also stuff like restoring some kind of context, handling continuations and coöperative task yields, for example.
This is a very light-weight approach, and very useful when doing asynchronous programming (which is almost everything nowadays :)). Many languages now support something similar - Python is the first one I've seen with this ("tasklets"), long before go. Of course, in an environment without pre-emptive multi-threading, this was pretty much the default.
In C#, for example, there's Tasks. They're not entirely the same as goroutines, but in practice, they come pretty close - the main difference being that Tasks use threads from the thread pool (usually), rather than a separate dedicated "scheduler" threads. This means that if you start 1000 tasks, it is possible for them to be run by 1000 separate threads; in practice, it would require you to write very bad Task code (e.g. using only blocking I/O, sleeping threads, waiting on wait handles etc.). If you use Tasks for asynchronous non-blocking I/O and CPU work, they come pretty close to goroutines - in actual practice. The theory is a bit different :)
EDIT:
To clear up some confusion, here is how a typical C# asynchronous method might look like:
async Task<string> GetData()
{
var html = await HttpClient.GetAsync("http://www.google.com");
var parsedStructure = Parse(html);
var dbData = await DataLayer.GetSomeStuffAsync(parsedStructure.ElementId);
return dbData.First().Description;
}
From point of view of the GetData method, the entire processing is synchronous - it's just as if you didn't use the asynchronous methods at all. The crucial difference is that you're not using up threads while you're doing the "waiting"; but ignoring that, it's almost exactly the same as writing synchronous blocking code. This also applies to any issues with shared state, of course - there isn't much of a difference between multi-threading issues in await and in blocking multi-threaded I/O. It's easier to avoid with Tasks, but just because of the tools you have, not because of any "magic" that Tasks do.
The main difference from goroutines in this aspect is that Go doesn't really have blocking methods in the usual sense of the word. Instead of blocking, they queue their particular asynchronous request, and yield. When the OS (and any other layers in Go - I don't have deep knowledge about the inner workings) receives the response, it posts it to the goroutine scheduler, which in turns knows that the goroutine that "waits" for the response is now ready to resume execution; when it actually gets a slot, it will continue on from the "blocking" call as if it had really been blocking - but in effect, it's very similar to what C#'s await does. There's no fundamental difference - there's quite a few differences between C#'s approach and Go's, but they're not all that huge.
And also note that this is fundamentally the same approach used on old Windows systems without pre-emptive multi-tasking - any "blocking" method would simply yield the thread's execution back to the scheduler. Of course, on those systems, you only had a single CPU core, so you couldn't execute multiple threads at once, but the principle is still the same.

goroutines are what we call green threads. They are not OS threads, the go scheduler is responsible for them. This is why they can have much smaller memory footprints.

is it possible to force a go routine to be run on a specific CPU?

I am reading about the go package "runtime" and see that i can among other (func GOMAXPROCS(n int)) set the number of CPU units that can be used to run my program. Can I force a goroutine to be run on a specific CPU of my choice?

In modern Go, I wouldn't lock goroutines to threads for efficiency. Go 1.5 added goroutine scheduling affinity, to minimize how often goroutines switch between OS threads. And any cost of the remaining migrations between CPUs has to be weighed against the benefit of the user-mode scheduler avoiding context switches into kernel mode. Finally, when switching costs are a real problem, sometimes a better focus is changing your program logic so it needs to switch less, like by communicating batches of work instead of individual work items.
But even considering all that, sometimes you simply have to lock a goroutine, like when a C API requires it, and I'll assume that's the case below.
If the whole program runs with GOMAXPROCS=1, then it's relatively simple to set a CPU affinity by calling out to the taskset utility from the schedutils package.
I had thought you were out of luck if GOMAXPROCS > 1 because then goroutines are migrated between OS threads at runtime. In fact, James Henstridge points out you can use runtime.LockOSThread() to keep your goroutine from migrating.
That doesn't solve locking the OS thread to a CPU. #yerden points out in a comment that the SchedSeatffinity function in the golang.org/x/sys/unix package, using 0 as the pid, ought to lock the calling thread to its current CPU.
In the "C API requires locking" use case, it might also work to call pthread_setaffinity_np from C code.
I haven't tested either of those ways to lock threads to CPUs, and details will vary by OS there.

Depends on your workload, but sometimes it's beneficial to start a go process per CPU, set gomaxprocs to 1 and pin the process to the CPU with taskset. Here is an excerpt on that topic from the awesome fasthttp library:
Use reuseport
listener.
Run a separate server instance per CPU core with GOMAXPROCS=1.
Pin each server instance to a separate CPU core using taskset.
Ensure the interrupts of multiqueue network card are evenly distributed between CPU cores. See this
article for
details.
Use Go 1.6 as it provides some considerable performance improvements.
Source: https://github.com/valyala/fasthttp#performance-optimization-tips-for-multi-core-systems

WIN32: Yielding execution to another (given) thread

I am looking for a way to yield the remainder of the thread execution's scheduled time slice to a different thread. There is a SwitchToThread function in WINAPI, but it doesn't let the caller specify the thread it wants to switch to. I browsed MSDN for quite some time and haven't found anything that would offer just that.
For an operating-system-internals layman like me, it seems that yielding thread should be able to specify which thread does it want to pass the execution to. Is it possible or is it just my imagination?

The reason you can't yield processor time-slices to a designated thread is that Windows features a preemptive scheduling kernel which pretty much places the responsibility and authority of scheduling the processor time in the hands of the kernel and only the kernel.
As such threads don't have any control over when they run, if they run, and even less over which thread is switched to after their time slice is up.
However, there are a few way you may influence context switches:
by increasing the priority of a certain thread you may force the scheduler to schedule it more often in the detriment of other threads (obviously the reverse applies as well - you can lower the priority of other threads)
you can code your process to place threads in kernel wait mode when they don't have work to do in order to help the scheduler do it's job. When using proper kernel wait constructs such as Critical Sections, Mutexes, Semaphores, and Timers you effectively tell the kernel a certain thread doesn't need to be scheduled until a certain codition is met.
Note: There is rarely a reason you should tamper with task priorities so USE WITH CAUTION

You might use 'fibers' instead of 'threads': for example there's a Win32 API named SwitchToFiber which lets you specify the fiber to be scheduled.

Take a look at UMS (User-mode scheduling) threads in Windows 7
http://msdn.microsoft.com/en-us/library/dd627187(VS.85).aspx

The second thread can simply wait for the yielding thread either by calling WaitForSingleObject() on its handle or periodically polling GetExitCodeThread(). The other answers are correct about altering the operating system's scheduling mechanisms - it is better to design the threads properly in the first place.

This is not possible. Only the kernel can decide what code runs next though you can influence it by reducing the non-waiting threads it has to choose from to run next, and by setting thread priorities with SetThreadPriority.

You can use regular synchronization primitives like events, semaphores, etc. to serialize your two threads. This does not in any form prevent the kernel from scheduling other threads in between, or in parallel on another CPU core, or virtually simultaneously on the same core. This is due to preemtive multitasking nature of modern general purpose operating systems.

If you want to do your own scheduling under Windows, you can use fibers, which essentially are threads that you have to schedule yourself. However, given that you describe yourself as a layman to the OS internals world, that would probably be a bad idea, as fibers are something of an advanced feature.

Can I ask why you want to use SwitchToThread?
If for example it's some form of because thread x is computing some value that you want to wait for on thread Y, then I'd really suggest looking at the Parallel Pattern Library or the Asynchronous Agents Library in Visual Studio 2010 which allows you to do this either with message blocks (receive on an asynchronous value) or simply via tasks : wait for a set of tasks to complete and inline their execution while waiting...
//i.e. on an arbitrary thread
task_group* tasks;
tasks->run(... / some functor/)
a call to tasks->wait() will wait and inline any tasks running.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio