Why are goroutines much cheaper than threads in other languages?

Why are goroutines much cheaper than threads in other languages? - go

In his talk - https://blog.golang.org/concurrency-is-not-parallelism, Rob Pike says that go routines are similar to threads but much much cheaper. Can someone explain why?

See "How goroutines work".
They are cheaper in:
memory consumption:
A thread starts with a large memory as opposed to a few Kb.
Setup and teardown costs
(That is why you have to maintain a pool of thread)
Switching costs
Threads are scheduled preemptively, and during a thread switch, the scheduler needs to save/restore ALL registers.
As opposed to Go where the the runtime manages the goroutines throughout from creation to scheduling to teardown. And the number of registers to save is lower.
Plus, as mentioned in "Go’s march to low-latency GC", a GC is easier to implement when the runtime is in charge of managing goroutines:
Since the introduction of its concurrent GC in Go 1.5, the runtime has kept track of whether a goroutine has executed since its stack was last scanned. The mark termination phase would check each goroutine to see whether it had recently run, and would rescan the few that had.
In Go 1.7, the runtime maintains a separate short list of such goroutines. This removes the need to look through the entire list of goroutines while user code is paused, and greatly reduces the number of memory accesses that can trigger the kernel’s NUMA-related memory migration code.

Related

Go goroutine under the hood

I'm trying to understand golang architecture and what "lightweight thread" means. I've already read something, but want to ask question to clarify it.
Am I right if I'll say what "go" keyword under the hood just puts following function in queue of inner thread pool, but for user it looks like creation of thread?

This is copied from the Go FAQ:
Why goroutines instead of threads?
Goroutines are part of making concurrency easy to use. The idea, which has been around for a while, is to multiplex independently executing functions—coroutines—onto a set of threads. When a coroutine blocks, such as by calling a blocking system call, the run-time automatically moves other coroutines on the same operating system thread to a different, runnable thread so they won't be blocked. The programmer sees none of this, which is the point. The result, which we call goroutines, can be very cheap: they have little overhead beyond the memory for the stack, which is just a few kilobytes.
What's lacking here is the definition of thread. If we resort to Wikipedia, we find:
In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, ...
but that's just a description of, well, the same thing that a goroutine is. The problem here is that the word thread tends to refer to kernel thread and/or user thread (both defined on that same Wikipedia page) and these threads are heavier-weight than the goroutine threads. Which brings us right back to this:
I'm trying to understand golang architecture and what "lightweight thread" means ...
To cut to the chase, this means "lighter than the OS-provided ones". That's really all it means. There are OS-provided threads (on multiple OSes on which Go runs), but they generally do too much and cost too much to switch between so Go provides its own language-level ones that it calls "goroutines" that are much lighter.
From comments:
Why need to move tasks from one thread to another by some planner ...
This is an implementation detail, which involves another aspect of the OS-provided kernel threads:
I can't understand how [a goroutine] can be preempted if single thread process [is] blocked by [a] system call to read [a] long file
The current Go runtime goroutine / thread / processor scheduler (see What is relationship between goroutine and thread in kernel and user state and note that there have been more than just the current implementation) predicts that some system call will block, and makes sure to assign that system call its own OS-level kernel thread (see also JimB's comment). These threads do not count against the GOMAXPROCS setting. This is in fact sometimes a problem, as it's possible for the Go runtime to try to spin off more threads than the OS allows: it might be nice if there were a system-call-thread-pool here (though there are also obvious problems with this).
So, the current runtime creates up to GOMAXPROCS kernel-style OS-level threads and uses those to multiplex up to that many goroutines onto the CPUs, but creates extra kernel-style OS-level threads whenever it wants to. As the blog post linked in the question above notes, the P entities act as queues to hold goroutines (Gs) on a per-processor basis for localized cache lookup (remember that on some systems, especially NUMA ones, it's expensive to reach out "across" CPUs: the scheduler is still willing to do this, but won't do it too often, for some definition of "too often").
Earlier versions of the current scheduler required explicit yields (runtime.Gosched()) calls or various other runtime operations to cause a switch from the current goroutine to some other goroutine. See What exactly does runtime.Gosched do? for example. In Go 1.14, some OSes provide automatic goroutine preemption; see Will Go's scheduler yield control from one goroutine to another for CPU-intensive work?

Is it nescessary to limit the number of go routines in an entirely cpu-bound workload?

If yes, how does one determine that maximum? That is the most important part to me. I'd really like to have it be set manually. I considered using runtime.GOMAXPROCS(0), as i doubt that more parallelism will yield any additional benefits. The comment seems to suggest, that it is marked for deprecation at some point.
From what I gather, the only limiting factor when it comes to go routines is memory, as a sleeping go routine still requires memory for its stack.

It's not strictly necessary. The number of threads running these goroutines is by default equal to the number of CPU cores on the machine (configurable through GOMAXPROCS), so there will be no contention at the thread level.
However, you might get performance benefits from having fewer goroutines ready to run, because of memory caching effects. For example, on an 8-core machine, if you have 1000 active goroutines that all touch significant amounts of memory, by the time a goroutine gets to run again, the needed memory pages have probably already been evicted from your CPU caches. With fewer goroutines, the odds of a cache hit are better.
As always with performance questions: the only way to be sure is to measure it yourself with a representative workload.

In our testing, we determined that it is best to spawn a fixed number of worker routines and use those to perform all the work. The creation and destruction of goroutines is lightweight, but not entirely free of overhead. That overhead is usually insignificant if the goroutines spend any amount of time blocked.

goroutines are very lightweight so it depends entirely on the system you are running on. An average process should have no problems with less than a million concurrent routines in 4GB Ram. Whether this goes for your target platform is, of course, something we can't answer without knowing what that platform is.
see this article and this, they are usefull

is it possible to force a go routine to be run on a specific CPU?

I am reading about the go package "runtime" and see that i can among other (func GOMAXPROCS(n int)) set the number of CPU units that can be used to run my program. Can I force a goroutine to be run on a specific CPU of my choice?

In modern Go, I wouldn't lock goroutines to threads for efficiency. Go 1.5 added goroutine scheduling affinity, to minimize how often goroutines switch between OS threads. And any cost of the remaining migrations between CPUs has to be weighed against the benefit of the user-mode scheduler avoiding context switches into kernel mode. Finally, when switching costs are a real problem, sometimes a better focus is changing your program logic so it needs to switch less, like by communicating batches of work instead of individual work items.
But even considering all that, sometimes you simply have to lock a goroutine, like when a C API requires it, and I'll assume that's the case below.
If the whole program runs with GOMAXPROCS=1, then it's relatively simple to set a CPU affinity by calling out to the taskset utility from the schedutils package.
I had thought you were out of luck if GOMAXPROCS > 1 because then goroutines are migrated between OS threads at runtime. In fact, James Henstridge points out you can use runtime.LockOSThread() to keep your goroutine from migrating.
That doesn't solve locking the OS thread to a CPU. #yerden points out in a comment that the SchedSeatffinity function in the golang.org/x/sys/unix package, using 0 as the pid, ought to lock the calling thread to its current CPU.
In the "C API requires locking" use case, it might also work to call pthread_setaffinity_np from C code.
I haven't tested either of those ways to lock threads to CPUs, and details will vary by OS there.

Depends on your workload, but sometimes it's beneficial to start a go process per CPU, set gomaxprocs to 1 and pin the process to the CPU with taskset. Here is an excerpt on that topic from the awesome fasthttp library:
Use reuseport
listener.
Run a separate server instance per CPU core with GOMAXPROCS=1.
Pin each server instance to a separate CPU core using taskset.
Ensure the interrupts of multiqueue network card are evenly distributed between CPU cores. See this
article for
details.
Use Go 1.6 as it provides some considerable performance improvements.
Source: https://github.com/valyala/fasthttp#performance-optimization-tips-for-multi-core-systems

How to reserve a core for one thread on windows?

I am working on a very time sensitive application which polls a region of shared memory taking action when it detects a change has occurred. Changes are rare but I need to minimize the time from change to action. Given the infrequency of changes I think the CPU cache is getting cold. Is there a way to reserve a core for my polling thread so that it does not have to compete with other threads for either cache or CPU?

Thread affinity alone (SetThreadAffinityMask) will not be enough. It does not reserve a CPU core, but it does the opposite, it binds the thread to only the cores that you specify (that is not the same thing!).
By constraining the CPU affinity, you reduce the likelihood that your thread will run. If another thread with higher priority runs on the same core, your thread will not be scheduled until that other thread is done (this is how Windows schedules threads).
Without constraining affinity, your thread has a chance of being migrated to another core (taking the last time it was run as metric for that decision). Thread migration is undesirable if it happens often and soon after the thread has run (or while it is running) but it is a harmless, beneficial thing if a couple of dozen milliseconds have passed since it was last scheduled (caches will have been overwritten by then anyway).
You can "kind of" assure that your thread will run by giving it a higher priority class (no guarantee, but high likelihood). If you then use SetThreadAffinityMask as well, you have a reasonable chance that the cache is always warm on most common desktop CPUs (which luckily are normally VIPT and PIPT). For the TLB, you will probably be less lucky, but there's nothing you can do about it.
The problem with a high priority thread is that it will starve other threads because scheduling is implemented so it serves higher priority classes first, and as long as these are not satisfied, lower classes get zero. So, the solution in this case must be to block. Otherwise, you may impair the system in an unfavorable way.
Try this:
create a semaphore and share it with the other process
set priority to THREAD_PRIORITY_TIME_CRITICAL
block on the semaphore
in the other process, after writing data, call SignalObjectAndWait on the semaphore with a timeout of 1 (or even zero timeout)
if you want, you can experiment binding them both to the same core
This will create a thread that will be the first (or among the first) to get CPU time, but it is not running.
When the writer thread calls SignalObjectAndWait, it atomically signals and blocks (even if it waits for "zero time" that is enough to reschedule). The other thread will wake from the Semaphore and do its work. Thanks to its high priority, it will not be interrupted by other "normal" (that is, non-realtime) threads. It will keep hogging CPU time until done, and then block again on the semaphore. At this point, SignalObjectAndWait returns.

Using the Task Manager, you can set the "affinity" of processes.
You would have to set the affinity of your time-critical app to core 4, and the affinity of all the other processes to cores 1, 2, and 3. Assuming four cores of course.

You could call the SetProcessAffinityMask on every process but yours with a mask that excludes just the core that will "belong" to your process, and use it on your process to set it to run just on this core (or, even better, SetThreadAffinityMask just on the thread that does the time-critical task).

Given the infrequency of changes I think the CPU cache is getting cold.
That sounds very strange.
Let's assume your polling thread and the writing thread are on different cores.
The polling thread will be reading the shared memory address and so will be caching the data. That cache line is probably marked as exclusive. Then the write thread finally writes; first, it reads the cache line of memory in (so that line is now marked as shared on both cores) and then it writes. Writing causes the polling thread CPU's cache line to be marked as invalid. The polling thread then comes to read again; if it reads while the writing thread still has the data cached, it will read from the second cores cache, invalidating its cache line and taking ownership for itself. There's a lot of bus traffic overhead to do this.
Another issue is that the writing thread, if it doesn't write often, will almost certainly lose the TLB entry for the page with the shared memory address. Recalculating the physical address is a long, slow process. Since the polling thread polls often, possibly that page is always in that cores TLB; and in that sense, you might well do better, in latency terms, to have both threads on the same core. (Although if they're both compute intensive, they might interfere destructively and that cost could be much higher - I can't know, as I don't know what the threads are doing).
One thing you could do is use a hyperthread on the writing thread core; if you know early on you're going to write, get the hyperthread to read the shared memory address. This will load the TLB and cache while the writing thread is still busy computing, giving you parallelism.

The Win32 function SetThreadAffinityMask() is what you are looking for.

Does the Task Parallel Library (or PLINQ) take other processes into account?

In particular, I'm looking at using TPL to start (and wait for) external processes. Does the TPL look at total machine load (both CPU and I/O) before deciding to start another task (hence -- in my case -- another external process)?
For example:
I've got about 100 media files that need to be encoded or transcoded (e.g. from WAV to FLAC or from FLAC to MP3). The encoding is done by launching an external process (e.g. FLAC.EXE or LAME.EXE). Each file takes about 30 seconds. Each process is mostly CPU-bound, but there's some I/O in there. I've got 4 cores, so the worst case (transcoding by piping the decoder into the encoder) still only uses 2 cores. I'd like to do something like:
Parallel.ForEach(sourceFiles,
sourceFile =>
TranscodeUsingPipedExternalProcesses(sourceFile));
Will this kick off 100 tasks (and hence 200 external processes competing for the CPU)? Or will it see that the CPU's busy and only do 2-3 at a time?

You're going to run into a couple of issues here. The starvation avoidance mechanism of the scheduler will see your tasks as blocked as they wait on processes. It will find it hard to distinguish between a deadlocked thread and one simply waiting for a process to complete. As a result it may schedule new tasks if your tasks run or a long time (see below). The hillclimbing heuristic should take into account the overall load on the system, both from your application and others. It simply tries to maximize work done, so it will add more work until the overall throughput of the system stops increasing and then it will back off. I don't think this will effect your application but the stavation avoidance issue probably will.
You can find more detail as to how this all works in Parallel Programming with Microsoft®.NET, Colin Campbell, Ralph Johnson, Ade Miller, Stephen Toub (an earlier draft is online).
"The .NET thread pool automatically manages the number of worker
threads in the pool. It adds and removes threads according to built-in
heuristics. The .NET thread pool has two main mechanisms for injecting
threads: a starvation-avoidance mechanism that adds worker
threads if it sees no progress being made on queued items and a hillclimbing
heuristic that tries to maximize throughput while using as
few threads as possible.
The goal of starvation avoidance is to prevent deadlock. This kind
of deadlock can occur when a worker thread waits for a synchronization
event that can only be satisfied by a work item that is still pending
in the thread pool’s global or local queues. If there were a fixed
number of worker threads, and all of those threads were similarly
blocked, the system would be unable to ever make further progress.
Adding a new worker thread resolves the problem.
A goal of the hill-climbing heuristic is to improve the utilization
of cores when threads are blocked by I/O or other wait conditions
that stall the processor. By default, the managed thread pool has one
worker thread per core. If one of these worker threads becomes
blocked, there’s a chance that a core might be underutilized, depending
on the computer’s overall workload. The thread injection logic
doesn’t distinguish between a thread that’s blocked and a thread
that’s performing a lengthy, processor-intensive operation. Therefore,
whenever the thread pool’s global or local queues contain pending
work items, active work items that take a long time to run (more than
a half second) can trigger the creation of new thread pool worker
threads.
The .NET thread pool has an opportunity to inject threads every
time a work item completes or at 500 millisecond intervals, whichever
is shorter. The thread pool uses this opportunity to try adding threads
(or taking them away), guided by feedback from previous changes in
the thread count. If adding threads seems to be helping throughput,
the thread pool adds more; otherwise, it reduces the number of
worker threads. This technique is called the hill-climbing heuristic.
Therefore, one reason to keep individual tasks short is to avoid
“starvation detection,” but another reason to keep them short is to
give the thread pool more opportunities to improve throughput by
adjusting the thread count. The shorter the duration of individual
tasks, the more often the thread pool can measure throughput and
adjust the thread count accordingly.
To make this concrete, consider an extreme example. Suppose
that you have a complex financial simulation with 500 processor-intensive
operations, each one of which takes ten minutes on average
to complete. If you create top-level tasks in the global queue for each
of these operations, you will find that after about five minutes the
thread pool will grow to 500 worker threads. The reason is that the
thread pool sees all of the tasks as blocked and begins to add new
threads at the rate of approximately two threads per second.
What’s wrong with 500 worker threads? In principle, nothing, if
you have 500 cores for them to use and vast amounts of system
memory. In fact, this is the long-term vision of parallel computing.
However, if you don’t have that many cores on your computer, you are
in a situation where many threads are competing for time slices. This
situation is known as processor oversubscription. Allowing many
processor-intensive threads to compete for time on a single core adds
context switching overhead that can severely reduce overall system
throughput. Even if you don’t run out of memory, performance in this
situation can be much, much worse than in sequential computation.
(Each context switch takes between 6,000 and 8,000 processor cycles.)
The cost of context switching is not the only source of overhead.
A managed thread in .NET consumes roughly a megabyte of stack
space, whether or not that space is used for currently executing functions.
It takes about 200,000 CPU cycles to create a new thread, and
about 100,000 cycles to retire a thread. These are expensive operations.
As long as your tasks don’t each take minutes, the thread pool’s
hill-climbing algorithm will eventually realize it has too many threads
and cut back on its own accord. However, if you do have tasks that
occupy a worker thread for many seconds or minutes or hours, that
will throw off the thread pool’s heuristics, and at that point you
should consider an alternative.
The first option is to decompose your application into shorter
tasks that complete fast enough for the thread pool to successfully
control the number of threads for optimal throughput.
A second possibility is to implement your own task scheduler
object that does not perform thread injection. If your tasks are of long
duration, you don’t need a highly optimized task scheduler because
the cost of scheduling will be negligible compared to the execution
time of the task. MSDN® developer program has an example of a
simple task scheduler implementation that limits the maximum degree
of concurrency. For more information, see the section, “Further Reading,”
at the end of this chapter.
As a last resort, you can use the SetMaxThreads method to
configure the ThreadPool class with an upper limit for the number
of worker threads, usually equal to the number of cores (this is the
Environment.ProcessorCount property). This upper limit applies for
the entire process, including all AppDomains."

The short answer is: no.
Internally, the TPL uses the standard ThreadPool to schedule its tasks. So you're actually asking whether the ThreadPool takes machine load into account and it doesn't. The only thing that limits the number of tasks simultaneously running is the number of threads in the thread pool, nothing else.
Is it possible to have the external processes report back to your application once they are ready? In that case you do not have to wait for them (keeping threads occupied).

Ran a test using TPL/ThreadPool to schedule a great number of tasks doing looped spins. Using an external app I've loaded one of the cores to 100% using proc affinity. The number of active tasks never decreased.
Even better, I ran multiple instances of the same CPU intensive .NET TPL enabled app. The number of threads for all the apps was the same, and never went below the number of cores, even though my machine was barely usable.
So theory aside, TPL uses the number of cores available, but never checks on their actual load. A very poor implementation in my opinion.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio