Factors affecting goroutine's performance - go

I'm experiencing some weird behaviors using goroutine and please provide advise or comments on things I might do wrong. Here is my process:
I am using goroutines to perform a simulation concurrently, but for some reason the time I spent increases with number of goroutines I put in which makes my simulation not possible to finish in a reasonable time. Here are the things I notice:
For every 10k goroutines initiated, the calculation time increases by 5 seconds
I don't see a shortage in CPU or memory. However CPU usage increases only a little bit for every 10k goroutines. For example when I put 200k goroutines, CPU usage is around 70%
I'm not using disks
I ran the simulation without triggering the goroutines and it finishes very fast, so the slowness is inside or due to goroutines
I tried to use additional goroutines in 1 occasion inside each goroutine to run some workload in parallel. CPU usage is boosted to 100% but the overall speed decreased by 50%...
I am passing some large structs to goroutines using pointers. All goroutines use the same data.
Does anyone have a clue on things I might need to optimize, or suggest me any test I can perform? Thanks!

Start few "worker" goroutines and send "jobs" to its via channel. You save many memory allocations that consume a lot of CPU.

Related

Is there some reason why I shouldn't use runtime.NumGoRoutines to wait for exit?

I'm new to Go, and I'm writing some concurrent practice code using GoRoutines.
I've seen a lot of examples of worker pools using things like time.Sleep(), waitgroups, atomic counters, and channels to use various methods of determining when a pool of independent GoRoutines have completed their execution before ending a program.
Going through the GoLang reference I found the following library method:
runtime.NumGoRoutines() which returns a count of the currently executing number of Goroutines.
The following line:
for runtime.NumGoroutine() > 1 {}
Allows me to wait until all the GoRoutines have completed without dealing with any synchronizing code or speculative sleep durations.
In my testing, it's working perfectly to wait until all threads complete.
Is there something wrong with this technique I'm unaware of? It seems like the simplest possible method but since I have I never seen it used in any example code for this very common problem I'm suspicious that there's a reliability problem with it.
I would not recommend this practice, because you have a busy loop there that uses an unnecessary amount of CPU power.
When working with goroutines I always recommend WaitGroup.
You should know the number of goroutines, then wait for them to finish without using 100% CPU.

Is it nescessary to limit the number of go routines in an entirely cpu-bound workload?

If yes, how does one determine that maximum? That is the most important part to me. I'd really like to have it be set manually. I considered using runtime.GOMAXPROCS(0), as i doubt that more parallelism will yield any additional benefits. The comment seems to suggest, that it is marked for deprecation at some point.
From what I gather, the only limiting factor when it comes to go routines is memory, as a sleeping go routine still requires memory for its stack.
It's not strictly necessary. The number of threads running these goroutines is by default equal to the number of CPU cores on the machine (configurable through GOMAXPROCS), so there will be no contention at the thread level.
However, you might get performance benefits from having fewer goroutines ready to run, because of memory caching effects. For example, on an 8-core machine, if you have 1000 active goroutines that all touch significant amounts of memory, by the time a goroutine gets to run again, the needed memory pages have probably already been evicted from your CPU caches. With fewer goroutines, the odds of a cache hit are better.
As always with performance questions: the only way to be sure is to measure it yourself with a representative workload.
In our testing, we determined that it is best to spawn a fixed number of worker routines and use those to perform all the work. The creation and destruction of goroutines is lightweight, but not entirely free of overhead. That overhead is usually insignificant if the goroutines spend any amount of time blocked.
goroutines are very lightweight so it depends entirely on the system you are running on. An average process should have no problems with less than a million concurrent routines in 4GB Ram. Whether this goes for your target platform is, of course, something we can't answer without knowing what that platform is.
see this article and this, they are usefull

Forcing a goroutine to run X times a second

Is there a way to force that a goroutine will run X times a second, no matter if there are other goroutines which may be doing a CPU intensive operation?
A little background on why, I am working on a game server written in go, I have a goroutine that handles the gameloop, the game is updated at X ticks per-second, of course some of the operations the server does are expensive (for example, terrain generation), currently I just spawn a goroutine and let that handle the generation in way that would not block the gameloop goroutine, but after testing on a server with a single vcore, I saw that it still blocks the gameloop while doing CPU intensive operations.
After searching online I found out that go would not reschedule a goroutine while it is not in a blocking syscall, now I could do as suggested which is to just manually call reschedule for the goroutine, but that has two problems, it will make the cpu intensive code more messy, with needing to handle timeouts at specific points, and even after manual reschedule it could just reschedule another cpu intensive goroutine instead of the gameloop...
Is there a way to force that a goroutine will run X times a second, no matter if there are other goroutines which may be doing a CPU intensive operation?
No
after testing on a server with a single vcore, I saw that it still blocks the gameloop while doing CPU intensive operations.
What else do you expect to happen? You have one core and two operations to be performed.
After searching online I found out that go would not reschedule a goroutine while it is not in a blocking syscall
Not true.
From go runtime:
Goroutines are now asynchronously preemptible. As a result, loops without function calls no longer potentially deadlock the scheduler or significantly delay garbage collection. This is supported on all platforms except windows/arm, darwin/arm, js/wasm, and plan9/*.

Java 8 Concurrent Hash Map get Performance/Alternative

I have a high throughput low latency application (3000 Request/Sec, 100ms per request), and we heavily use Java 8 ConcurrentHashMap for performing lookups. Usually these maps are updated by a single background thread and multiple threads read from these maps.
I am seeing a performance bottleneck, and on profiling I find ConcurrentHashMap.get as being the hotspot and taking majority of the time.
I another case, I see ConcurrentHashMap.computeIfAbsent being the hotspot, although the mapping-function has very small latency and the profile shows computeIfAbsent spending 90% of the time executing itself, and very less time in executing the mapping-function.
My question is there any way i could improve the performance? I have around 80 threads concurrently reading from CHM.
I have around 80 threads concurrently reading from CHM.
The simplest things to do are
if you have a CPU bound process, don't have more active threads than you have CPUs, otherwise this will only add overhead and if these threads hold a lock while not running, because you have too many threads, it will really not help.
increase the number of partitions. You will want to have at least 4x the number of segments/partitions and you have threads accessing a single map. However, you will get strange behaviour in CHM if you access it with more than 40 threads due to the way cache coherency works. I suggest using a more efficient data structure for higher degrees of concurrency. In Java 8 the concurrencyLevel is a hint, but it is better than leaving the default initialise size of 16.
don't spend so much time in CHM. Find a way to do useful work without hitting a shared resource and your threads will run much more efficiently.
If you have any latencies you can see in a low latency system, you have a problem IMHO.

Merge sort with goroutines vs normal Mergesort

I am wrote two versions of mergesort in Go. One with goroutines and the other one without. I am comparing the performance of each and I keep seeing
https://github.com/denniss/goplayground/blob/master/src/example/sort.go#L69
That's the one using goroutines. And this is the one without
https://github.com/denniss/goplayground/blob/master/src/example/sort.go#L8
I have been trying to figure out why the goroutine implementation performs much worse than the one without. This is the number that I see locally
go run src/main.go
[5 4 3 2 1]
Normal Mergesort
[1 2 3 4 5]
Time: 724
Mergesort with Goroutines
[1 2 3 4 5]
Time: 26690
Yet I still have not been able to figure out why. Wondering if you guys can give me suggestions or ideas on what to do/look at. It seems to me that the implementation with goroutines should perform at least somewhat better. I say so mainly, because of the following lines
go MergeSortAsync(numbers[0:m], lchan)
go MergeSortAsync(numbers[m:l], rchan)
Using concurrency does not necessarily make an algorithm run faster. In fact, unless the algorithm is inherently parallel, it can slow down the execution.
A processor (CPU) can only do one thing at a time even if, to us, it seems to be doing two things at once. The instructions of two goroutines may be interleaved, but that does not make them run any faster than a single goroutine. A single instruction from only one of the goroutines is ever being executed at any given moment (there are some very low-level exceptions to this, depending on hardware features) -- unless your program is running on more than one core.
As far as I know, the standard merge sort algorithm isn't inherently parallel; some modifications need to be made to optimize it for parallel execution on multiple processors. Even if you're using multiple processors, the algorithm needs to be optimized for it.
These optimizations usually relate to the use of channels. I wouldn't agree that "writing to channels has a big overhead" of itself (Go makes it very performant), however, it does introduce the likely possibility that a goroutine will block. It's not the actual writing to a channel that slows down the program significantly, it's scheduling/synchronizing: the waiting and waking of either goroutine to write or read from the channel is probably the bottleneck.
To complement Not_a_Golfer's answer, I will agree that goroutines definitely shine when executing I/O operations--even on a single core--since these occur far away from the CPU. While one goroutine is waiting on I/O, the scheduler can dispatch another CPU-bound goroutine to run in the meantime. However, goroutines also shine for CPU-intensive operations when deployed across multiple processors/cores.
As others have explained, there is a cost of parallelism. You need to see enough benefit to compensate for that cost. This only happens when the unit of work is larger than the cost of making channels and goroutines receiving the results.
You can experiment to determine what the unit of work should be. Suppose the unit of work is sorting 1000 elements. In that case, you can change your code very easily like so:
func MergeSortAsync(numbers [] int, resultChan chan []int) {
l := len(numbers)
if l <= 1000 {
resultChan <- Mergesort(numbers)
return
}
In other words, once the unit of work is too small to justify using goroutines and channels, use your simple Mergesort without those costs.
There are two main reasons:
Writing to channels has a big overhead. Just as a reference - I tried using channels and goroutines as iterators. They were ~100 times slower than calling methods repeatedly. Of course if the operation that is being piped via the channel takes a long time to perform (say crawling a web page), that difference is negligible.
Goroutines really shine for IO based concurrency, and less so for CPU parallelism.
Considering these 2 issues, you'll need a lot of CPUs or longer, less blocking operations per goroutine, to make this faster.

Resources