Golang: why runtime.GOMAXPROCS is limited to 256?

Golang: why runtime.GOMAXPROCS is limited to 256? - go

I was playing with golang 1.7.3 on MacBook and Ubuntu and found that runtime.GOMAXPROCS is limited to 256. Does anyone know where this limit comes from? Is this documented anywhere and why would there be a limit? Is this an implementation optimization?
Only reference to 256 I could find is on this page that describes golang's runtime package: https://golang.org/pkg/runtime/. The runtime.MemStats struct has a couple of stat arrays of size 256:
type MemStats struct {
...
PauseNs [256]uint64 // circular buffer of recent GC pause durations, most recent at [(NumGC+255)%256]
PauseEnd [256]uint64 // circular buffer of recent GC pause end times
Here's example golang code I used:
func main() {
runtime.GOMAXPROCS(1000)
log.Printf("GOMAXPROCS %d\n", runtime.GOMAXPROCS(-1))
}
Prints
GOMAXPROCS 256
P.S.
Also, can someone point me to documentation on how this GOMAXPROCS relate to OS thread count used by golang scheduler (if at all). Shall we observe go-compiled code running GOMAXPROCS OS threads?
EDIT: Thanks #twotwotwo for pointing out how GOMAXPROCS relate to OS threads. Still it's interesting that documentation does not mention this 256 limit (other that in the MemStats struct which may or may not be related).
I wonder if anyone is aware of the true reason for this 256 number.

Note that, starting the next Go 1.10 (Q1 2018), GOMAXPROCS will be limited by ... nothing.
The runtime no longer artificially limits GOMAXPROCS (previously it was limited to 1024).
See commit ee55000 by Austin Clements (aclements), which fixes issue 15131.
Now that allp is dynamically allocated, there's no need for a hard cap
on GOMAXPROCS.
allp is defined here.
See also commit e900e27:
runtime: clean up loops over allp
allp now has length gomaxprocs, which means none of allp[i] are nil or in state _Pdead.
This lets replace several different styles of loops over allp with normal range loops.
for i := 0; i < gomaxprocs; i++ { ... } loops can simply range over
allp.
Likewise, range loops over allp[:gomaxprocs] can just range over
allp.
Loops that check for p == nil || p.state == _Pdead don't need to check
this any more.
Loops that check for p == nil don't have to check this if dead Ps
don't affect them. I checked that all such loops are, in fact,
unaffected by dead Ps. One loop was potentially affected, which this
fixes by zeroing p.gcAssistTime in procresize.

The package runtime docs clarify how GOMAXPROCS relates to OS threads:
The GOMAXPROCS variable limits the number of operating system threads that can execute user-level Go code simultaneously. There is no limit to the number of threads that can be blocked in system calls on behalf of Go code; those do not count against the GOMAXPROCS limit. This package's GOMAXPROCS function queries and changes the limit.
So you could see more than GOMAXPROCS OS threads (because some are blocked in system calls, and there's no limit to how many), or fewer (because GOMAXPROCS is only documented to limit the number of threads, not prescribe it exactly).
I think capping GOMAXPROCS is consistent with the spirit of that documentation--you specified you were OK with 1000 OS threads running Go code, but the runtime decided to 'only' run 256. That doesn't limit the number of goroutines active because they're multiplexed onto OS threads--when one goroutine blocks (waiting for a network read to complete, say) Go's internal scheduler starts other work on the same OS thread.
The Go team might have made this choice to minimize the chance that Go programs end up running many times more OS threads than most machines today have cores; that would cause more OS context switches, which can be slower than user-mode goroutine switches that would occur if GOMAXPROCS were kept down to the number of CPU cores present. Or it might just have been convenient for the design Go's internal scheduler to have an upper bound on GOMAXPROCS.
Goroutines vs Threads is not perfect, e.g. goroutines don't have segmented stacks now, but it may help you understand what's going on here under the hood.

Related

Is it nescessary to limit the number of go routines in an entirely cpu-bound workload?

If yes, how does one determine that maximum? That is the most important part to me. I'd really like to have it be set manually. I considered using runtime.GOMAXPROCS(0), as i doubt that more parallelism will yield any additional benefits. The comment seems to suggest, that it is marked for deprecation at some point.
From what I gather, the only limiting factor when it comes to go routines is memory, as a sleeping go routine still requires memory for its stack.

It's not strictly necessary. The number of threads running these goroutines is by default equal to the number of CPU cores on the machine (configurable through GOMAXPROCS), so there will be no contention at the thread level.
However, you might get performance benefits from having fewer goroutines ready to run, because of memory caching effects. For example, on an 8-core machine, if you have 1000 active goroutines that all touch significant amounts of memory, by the time a goroutine gets to run again, the needed memory pages have probably already been evicted from your CPU caches. With fewer goroutines, the odds of a cache hit are better.
As always with performance questions: the only way to be sure is to measure it yourself with a representative workload.

In our testing, we determined that it is best to spawn a fixed number of worker routines and use those to perform all the work. The creation and destruction of goroutines is lightweight, but not entirely free of overhead. That overhead is usually insignificant if the goroutines spend any amount of time blocked.

goroutines are very lightweight so it depends entirely on the system you are running on. An average process should have no problems with less than a million concurrent routines in 4GB Ram. Whether this goes for your target platform is, of course, something we can't answer without knowing what that platform is.
see this article and this, they are usefull

What if GOMAXPROCS is too large?

We all know that runtime.GOMAXPROCS is set to CPU core number by default, what if this property has been set too large?
Will program have more context switches?
Will garbage collector be triggered more frequently?

GOMAXPROCS is set to the number of available logical CPUs by default for a reason: this gives best performance in most cases.
GOMAXPROCS only limits the number of "active" threads, if a thread's goroutine gets blocked (e.g. by a syscall), a new thread might be started. There is no direct correclation, see Number of threads used by Go runtime.
If GOMAXPROCS is greater than the number of available CPUs, then there will be more active threads than CPU cores, which means active threads have to be "multiplexed" to the available processing units, so yes, there will be more context switches if there are more active threads than cores, which is not necessarily the case.
Garbage collections are not directly related to the number of threads, so you shouldn't worry about that. Quoting from package runtime:
The GOGC variable sets the initial garbage collection target percentage. A collection is triggered when the ratio of freshly allocated data to live data remaining after the previous collection reaches this percentage. The default is GOGC=100. Setting GOGC=off disables the garbage collector entirely. The runtime/debug package's SetGCPercent function allows changing this percentage at run time. See https://golang.org/pkg/runtime/debug/#SetGCPercent.
If you have more threads that don't allocate / release memory, that shouldn't effect how frequently collections are triggered.
There might be cases when setting GOMAXPROCS above the number of CPUs increases the performance of your app, but they are rare. Measure to find out if it helps in your case.

Possibility to NOT hook all available CPU power?

I know, most of the beginners of go ask how to have performative go-routines / concurrency, this point I passed a few weeks ago. :-)
I have a real fast trans-coder that uses every cycle available of my 4+4 (i7 HT) CPU. It reads a file into a slice of pointers to structs, does calculations on these and writes the result back to disk. I am using bufio. I am coming from VB so the performance of Go is unbelievable.
I tried to add minimal sleeps (via time.Sleep()), but that drastically decreased performance.
While my trans-coder is working the whole system is lagging. I must change the go task's priority to low or idle to be able to work again.
How could I implement something that keeps the system responsive?
Right now I start thousands of go-routines (loop over a slice of pointers). Should I limit the number of routines?

Lowering the process priority is arguably the correct way to do this. Use your OS's scheduler. That's what it's for. Per this question you can start your process with a specified priority like so:
start "MyApp" /low "C:\myapp.exe"
You may also be able to set process priority from within the application per this question:
err := syscall.Setpriority(syscall.Getpid(), -1, -1)
Lastly, you can use GOMAXPROCS to configure how many CPUs the process is allowed to use. You can pass it in as an environment variable at runtime, or call runtime.GOMAXPROCS() within your code to override it.

Probably the simplest solution is to limit the number of concurrent goroutines using a semaphore, for example:
sem := make(chan int, 10) // limited at go routines
for {
sem <- 1
go doThis()
}
func doThis() {
//do this
<- sem
}
The "sem <- 1" inside the for loop blocks until a goroutine "slot" is freed up by doThis extracting something from the sem channel/

is it possible to force a go routine to be run on a specific CPU?

I am reading about the go package "runtime" and see that i can among other (func GOMAXPROCS(n int)) set the number of CPU units that can be used to run my program. Can I force a goroutine to be run on a specific CPU of my choice?

In modern Go, I wouldn't lock goroutines to threads for efficiency. Go 1.5 added goroutine scheduling affinity, to minimize how often goroutines switch between OS threads. And any cost of the remaining migrations between CPUs has to be weighed against the benefit of the user-mode scheduler avoiding context switches into kernel mode. Finally, when switching costs are a real problem, sometimes a better focus is changing your program logic so it needs to switch less, like by communicating batches of work instead of individual work items.
But even considering all that, sometimes you simply have to lock a goroutine, like when a C API requires it, and I'll assume that's the case below.
If the whole program runs with GOMAXPROCS=1, then it's relatively simple to set a CPU affinity by calling out to the taskset utility from the schedutils package.
I had thought you were out of luck if GOMAXPROCS > 1 because then goroutines are migrated between OS threads at runtime. In fact, James Henstridge points out you can use runtime.LockOSThread() to keep your goroutine from migrating.
That doesn't solve locking the OS thread to a CPU. #yerden points out in a comment that the SchedSeatffinity function in the golang.org/x/sys/unix package, using 0 as the pid, ought to lock the calling thread to its current CPU.
In the "C API requires locking" use case, it might also work to call pthread_setaffinity_np from C code.
I haven't tested either of those ways to lock threads to CPUs, and details will vary by OS there.

Depends on your workload, but sometimes it's beneficial to start a go process per CPU, set gomaxprocs to 1 and pin the process to the CPU with taskset. Here is an excerpt on that topic from the awesome fasthttp library:
Use reuseport
listener.
Run a separate server instance per CPU core with GOMAXPROCS=1.
Pin each server instance to a separate CPU core using taskset.
Ensure the interrupts of multiqueue network card are evenly distributed between CPU cores. See this
article for
details.
Use Go 1.6 as it provides some considerable performance improvements.
Source: https://github.com/valyala/fasthttp#performance-optimization-tips-for-multi-core-systems

CUDA __syncthreads() usage within a warp

If it was absolutely required for all the threads in a block to be at the same point in the code, do we require the __syncthreads function if the number of threads being launched is equal to the number of threads in a warp?
Note: No extra threads or blocks, just a single warp for the kernel.
Example code:
shared _voltatile_ sdata[16];
int index = some_number_between_0_and_15;
sdata[tid] = some_number;
output[tid] = x ^ y ^ z ^ sdata[index];

Updated with more information about using volatile
Presumably you want all threads to be at the same point since they are reading data written by other threads into shared memory, if you are launching a single warp (in each block) then you know that all threads are executing together. On the face of it this means you can omit the __syncthreads(), a practice known as "warp-synchronous programming". However, there are a few things to look out for.
Remember that a compiler will assume that it can optimise providing the intra-thread semantics remain correct, including delaying stores to memory where the data can be kept in registers. __syncthreads() acts as a barrier to this and therefore ensures that the data is written to shared memory before other threads read the data. Using volatile causes the compiler to perform the memory write rather than keep in registers, however this has some risks and is more of a hack (meaning I don't know how this will be affected in the future)
Technically, you should always use __syncthreads() to conform with the CUDA Programming Model
The warp size is and always has been 32, but you can:
At compile time use the special variable warpSize in device code (documented in the CUDA Programming Guide, under "built-in variables", section B.4 in the 4.1 version)
At run time use the warpSize field of the cudaDeviceProp struct (documented in the CUDA Reference Manual)
Note that some of the SDK samples (notably reduction and scan) use this warp-synchronous technique.

You still need __syncthreads() even if warps are being executed in parallel. The actual execution in hardware may not be parallel because the number of cores within a SM (Stream Multiprocessor) can be less than 32. For example, GT200 architecture has 8 cores in each SM, so you can never be sure all threads are in the same point in the code.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio