Goroutines 8kb and windows OS thread 1 mb - windows

As windows user, I know that OS threads consume ~1 Mb of memory due to By default, Windows allocates 1 MB of memory for each thread’s user-mode stack. How does golang use ~8kb of memory for each goroutine, if OS thread is much more gluttonous. Are goroutine sort of virtual threads?

Goroutines are not threads, they are (from the spec):
...an independent concurrent thread of control, or goroutine, within the same address space.
Effective Go defines them as:
They're called goroutines because the existing terms—threads, coroutines, processes, and so on—convey inaccurate connotations. A goroutine has a simple model: it is a function executing concurrently with other goroutines in the same address space. It is lightweight, costing little more than the allocation of stack space. And the stacks start small, so they are cheap, and grow by allocating (and freeing) heap storage as required.
Goroutines don't have their own threads. Instead multiple goroutines are (may be) multiplexed onto the same OS threads so if one should block (e.g. waiting for I/O or a blocking channel operation), others continue to run.
The actual number of threads executing goroutines simultaneously can be set with the runtime.GOMAXPROCS() function. Quoting from the runtime package documentation:
The GOMAXPROCS variable limits the number of operating system threads that can execute user-level Go code simultaneously. There is no limit to the number of threads that can be blocked in system calls on behalf of Go code; those do not count against the GOMAXPROCS limit.
Note that in current implementation by default only 1 thread is used to execute goroutines.

1 MiB is the default, as you correctly noted. You can pick your own stack size easily (however, the minimum is still a lot higher than ~8 kiB).
That said, goroutines aren't threads. They're just tasks with coöperative multi-tasking, similar to Python's. The goroutine itself is just the code and data required to do what you want; there's also a separate scheduler (which runs on one on more OS threads), which actually executes that code.
In pseudo-code:
loop forever
take job from queue
execute job
end loop
Of course, the execute job part can be very simple, or very complicated. The simplest thing you can do is just execute a given delegate (if your language supports something like that). In effect, this is simply a method call. In more complicated scenarios, there can be also stuff like restoring some kind of context, handling continuations and coöperative task yields, for example.
This is a very light-weight approach, and very useful when doing asynchronous programming (which is almost everything nowadays :)). Many languages now support something similar - Python is the first one I've seen with this ("tasklets"), long before go. Of course, in an environment without pre-emptive multi-threading, this was pretty much the default.
In C#, for example, there's Tasks. They're not entirely the same as goroutines, but in practice, they come pretty close - the main difference being that Tasks use threads from the thread pool (usually), rather than a separate dedicated "scheduler" threads. This means that if you start 1000 tasks, it is possible for them to be run by 1000 separate threads; in practice, it would require you to write very bad Task code (e.g. using only blocking I/O, sleeping threads, waiting on wait handles etc.). If you use Tasks for asynchronous non-blocking I/O and CPU work, they come pretty close to goroutines - in actual practice. The theory is a bit different :)
EDIT:
To clear up some confusion, here is how a typical C# asynchronous method might look like:
async Task<string> GetData()
{
var html = await HttpClient.GetAsync("http://www.google.com");
var parsedStructure = Parse(html);
var dbData = await DataLayer.GetSomeStuffAsync(parsedStructure.ElementId);
return dbData.First().Description;
}
From point of view of the GetData method, the entire processing is synchronous - it's just as if you didn't use the asynchronous methods at all. The crucial difference is that you're not using up threads while you're doing the "waiting"; but ignoring that, it's almost exactly the same as writing synchronous blocking code. This also applies to any issues with shared state, of course - there isn't much of a difference between multi-threading issues in await and in blocking multi-threaded I/O. It's easier to avoid with Tasks, but just because of the tools you have, not because of any "magic" that Tasks do.
The main difference from goroutines in this aspect is that Go doesn't really have blocking methods in the usual sense of the word. Instead of blocking, they queue their particular asynchronous request, and yield. When the OS (and any other layers in Go - I don't have deep knowledge about the inner workings) receives the response, it posts it to the goroutine scheduler, which in turns knows that the goroutine that "waits" for the response is now ready to resume execution; when it actually gets a slot, it will continue on from the "blocking" call as if it had really been blocking - but in effect, it's very similar to what C#'s await does. There's no fundamental difference - there's quite a few differences between C#'s approach and Go's, but they're not all that huge.
And also note that this is fundamentally the same approach used on old Windows systems without pre-emptive multi-tasking - any "blocking" method would simply yield the thread's execution back to the scheduler. Of course, on those systems, you only had a single CPU core, so you couldn't execute multiple threads at once, but the principle is still the same.

goroutines are what we call green threads. They are not OS threads, the go scheduler is responsible for them. This is why they can have much smaller memory footprints.

Related

Go goroutine under the hood

I'm trying to understand golang architecture and what "lightweight thread" means. I've already read something, but want to ask question to clarify it.
Am I right if I'll say what "go" keyword under the hood just puts following function in queue of inner thread pool, but for user it looks like creation of thread?
This is copied from the Go FAQ:
Why goroutines instead of threads?
Goroutines are part of making concurrency easy to use. The idea, which has been around for a while, is to multiplex independently executing functions—coroutines—onto a set of threads. When a coroutine blocks, such as by calling a blocking system call, the run-time automatically moves other coroutines on the same operating system thread to a different, runnable thread so they won't be blocked. The programmer sees none of this, which is the point. The result, which we call goroutines, can be very cheap: they have little overhead beyond the memory for the stack, which is just a few kilobytes.
What's lacking here is the definition of thread. If we resort to Wikipedia, we find:
In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, ...
but that's just a description of, well, the same thing that a goroutine is. The problem here is that the word thread tends to refer to kernel thread and/or user thread (both defined on that same Wikipedia page) and these threads are heavier-weight than the goroutine threads. Which brings us right back to this:
I'm trying to understand golang architecture and what "lightweight thread" means ...
To cut to the chase, this means "lighter than the OS-provided ones". That's really all it means. There are OS-provided threads (on multiple OSes on which Go runs), but they generally do too much and cost too much to switch between so Go provides its own language-level ones that it calls "goroutines" that are much lighter.
From comments:
Why need to move tasks from one thread to another by some planner ...
This is an implementation detail, which involves another aspect of the OS-provided kernel threads:
I can't understand how [a goroutine] can be preempted if single thread process [is] blocked by [a] system call to read [a] long file
The current Go runtime goroutine / thread / processor scheduler (see What is relationship between goroutine and thread in kernel and user state and note that there have been more than just the current implementation) predicts that some system call will block, and makes sure to assign that system call its own OS-level kernel thread (see also JimB's comment). These threads do not count against the GOMAXPROCS setting. This is in fact sometimes a problem, as it's possible for the Go runtime to try to spin off more threads than the OS allows: it might be nice if there were a system-call-thread-pool here (though there are also obvious problems with this).
So, the current runtime creates up to GOMAXPROCS kernel-style OS-level threads and uses those to multiplex up to that many goroutines onto the CPUs, but creates extra kernel-style OS-level threads whenever it wants to. As the blog post linked in the question above notes, the P entities act as queues to hold goroutines (Gs) on a per-processor basis for localized cache lookup (remember that on some systems, especially NUMA ones, it's expensive to reach out "across" CPUs: the scheduler is still willing to do this, but won't do it too often, for some definition of "too often").
Earlier versions of the current scheduler required explicit yields (runtime.Gosched()) calls or various other runtime operations to cause a switch from the current goroutine to some other goroutine. See What exactly does runtime.Gosched do? for example. In Go 1.14, some OSes provide automatic goroutine preemption; see Will Go's scheduler yield control from one goroutine to another for CPU-intensive work?

Forcing a goroutine to run X times a second

Is there a way to force that a goroutine will run X times a second, no matter if there are other goroutines which may be doing a CPU intensive operation?
A little background on why, I am working on a game server written in go, I have a goroutine that handles the gameloop, the game is updated at X ticks per-second, of course some of the operations the server does are expensive (for example, terrain generation), currently I just spawn a goroutine and let that handle the generation in way that would not block the gameloop goroutine, but after testing on a server with a single vcore, I saw that it still blocks the gameloop while doing CPU intensive operations.
After searching online I found out that go would not reschedule a goroutine while it is not in a blocking syscall, now I could do as suggested which is to just manually call reschedule for the goroutine, but that has two problems, it will make the cpu intensive code more messy, with needing to handle timeouts at specific points, and even after manual reschedule it could just reschedule another cpu intensive goroutine instead of the gameloop...
Is there a way to force that a goroutine will run X times a second, no matter if there are other goroutines which may be doing a CPU intensive operation?
No
after testing on a server with a single vcore, I saw that it still blocks the gameloop while doing CPU intensive operations.
What else do you expect to happen? You have one core and two operations to be performed.
After searching online I found out that go would not reschedule a goroutine while it is not in a blocking syscall
Not true.
From go runtime:
Goroutines are now asynchronously preemptible. As a result, loops without function calls no longer potentially deadlock the scheduler or significantly delay garbage collection. This is supported on all platforms except windows/arm, darwin/arm, js/wasm, and plan9/*.

Does task switching in concurrent code result in faster code than synchronous execution?

I understand that concurrency is not parallelism, but I believe that is my source of confusion about the speed of concurrency in environments that only use a single thread (go/node).
If everything is running in a single process, and a scheduler is constantly switching between different concurrent tasks wouldn't the overhead generated by this constant switching lead to slower execution of code than if everything was done synchronously?
I know that concurrency has it advantages when you want non-blocking code, for example a web server that switches between servicing thousands of requests instead of just focusing on one, and it shines in that regard; however, I've having difficulty understanding if it actually is faster, or if concurrency just appears to be faster.
Concurrent code is efficient when there are some IO-bound activities (e.g. sending to and receiving data from the network). Without concurrency your single thread has to wait doing nothing for the call to complete. Pure CPU-bound activities do not benefit from concurrency on a single thread (which may add unnecessary overhead) but can benefit from multi-threading if the workload can be distributed across multiple CPU's working in parallel.
Another advantage of async IO is thread it is threadless. That saves memory and OS resources. It's the only way to solve, for instance, the C10M problem.

Benefits of green threads vs a simple loop

Is there any benefit to using green threads / lightweight threads over a simple loop or sequential code, assuming only non blocking operations are used in both?
for i := 0; i < 5; i++ {
go doSomethingExpensive() // using golang example
}
// versus
for i := 0; i < 5; i++ {
doSomethingExpensive()
}
As far as I can think of
- green threads help avoid a bit of callback hell on async operations
- allow scheduling of M green threads on N kernel threads
But
- add a bit of complexity and performance requiring a scheduler
- easier cross thread communication when the language supports it and the execution was split to different cpu's (otherwise sequential code is simpler)
No, the green threads have no performance benefits at all.
If the threads are performing non-blocking operations:
Multiple threads have no benefits if you have only one physical core (since the same core has to execute everything, threads only makes things slower because of an overhead)
Up to as many threads as CPU cores you have have a performance benefit, since multiple cores can execute your threads physically parallel (see Play! framework)
Green threads have no benefits, since they are running from the same one real thread by a sub-scheduler, so actually green threads == 1 thread
If the threads are performing blocking operations, things may look different:
multiple threads makes sense, since one thread can be blocked, but the others can go on, so blocking slows down only one thread
you can avoid the callback-hell by just implementing your partially blocking process as one thread. Since you're free to block from one thread while e.g. waiting for IO, you get much simpler code.
Green threads
Green threads are not real threads by design, so they won't be split amongst multiple CPUs and are not indended to work in parallel. This can give a false understading that you can avoid synchronization - however once you upgrade to real threads the lack of proper synchronization will introduce a good set of issues.
Green threads were widely used in early Java days, when the JVM did not support real OS threads. A variant of green threads, called Fibers are part of the Windows operating system, and e.g. the MS SQL server uses them heavily to handle various blocking scenarios without the heavy overhead of using real threads.
You can choose not only amongst green threads and real threads, but may also consider continuations (https://www.playframework.com/documentation/1.3.x/asynchronous)
Continuations give you the best of both worlds:
your code logically looks like if it is a linear code, no callback hells
in reality the code is executed by real threads, however if a thread is getting blocked it suspends its execution and can switch to executing other code. Once the blocking condition signals, the thread can switch back and continue your code.
This approach is quite resource friendly. Play! framework uses as many threads as CPU cores you have (4-8) but beats all high-end Java application servers in terms of performance.

golang: how to handle blocking tasks optimally?

As known, the goroutine is synchronous but non-blocking processing unit.
The golang scheduler handles the non-blocking task, e.g. socket, timer, signal or other events from char devices very well.
But how about block device io or CPU sensitive task? They couldn't be interrupted until finish, and not multiplexed. The OS thread which runs the goroutine would freeze until the goroutine returns or yields. In that case, the scheduling granularity becomes bad.
Of course, you could split the tasks into smaller sub-tasks in your codes, for example, do not copy 1GB file at one time, instead, copy first 10MB, yield, and copy another 10MB, etc, so that the other goroutines within the same OS thread get chance to run. Another example for CPU-bound task: zip a file part by part and merge them finally.
But that breaks the convenience of sequential programming, and the manual scheduling is hard to estimate evenly, compared to the OS scheduling upon the OS threads.
The nginx has similar issue, it's multi-worker-processes program, one process for one CPU core, similar to the best practice of the GOMAXPROCS. It brings in the thread pool to handle the blocking tasks. Maybe it's good for golang too.
I am curious why golang has no OS threading API, which should be good supplement to goroutine for blocking tasks.
Go has specifically chosen to not directly expose OS threads to the user, and instead chose an M:N threading model. Your unit of execution in Go is the goroutine, which will be multiplexed on N number of OS threads.
In the rare case you have a CPU intensive calculation that contains no preemption points and insufficient OS threads to continue running other goroutines, you have 2 choices; increase GOMAXPROCS, or insert runtime.Gosched() calls to yield to other goroutines.
In the case of blocking syscalls, the Go scheduler will automatically dispatch a new OS thread (the time limit to consider a syscall "blocking" has been 20us), and since non-network IO is a series of blocking syscalls, it will almost always be assigned to a dedicated OS thread. Since Go already uses an M:N threading model, the user is usually unaware of the underlying scheduler choices, and can write the program the same as if the runtime used asynchronous IO.
There is an open issue to consider using asynchronous file IO, but there are many issues to overcome, like shortcomings in the Linux aio api, cross-platform compatibility, and interactions with all the various filesystems and devices with which you can do IO.

Resources