Long Running goroutines - parallel-processing

I have 12 to 13 long running goroutines in my app and they are responsible for some thousand short-lived goroutines come and go.
Other than calling runtime.Gosched() periodically, do I need to consider other things to do in long-running ones?
Note: Currently those long-running ones perform some supervisions on collections of resource every 15 to 30 seconds (and some every some minutes) and then they sleep.

No, there's no ongoing maintenance needed for goroutines. They're managed by the go runtime, and will continue to run until they return, or the main goroutine exits. You shouldn't even be calling runtime.Gosched() as it's only needed when a routine won't yield itself, but yours spend most of their time sleeping.

Related

Is there some reason why I shouldn't use runtime.NumGoRoutines to wait for exit?

I'm new to Go, and I'm writing some concurrent practice code using GoRoutines.
I've seen a lot of examples of worker pools using things like time.Sleep(), waitgroups, atomic counters, and channels to use various methods of determining when a pool of independent GoRoutines have completed their execution before ending a program.
Going through the GoLang reference I found the following library method:
runtime.NumGoRoutines() which returns a count of the currently executing number of Goroutines.
The following line:
for runtime.NumGoroutine() > 1 {}
Allows me to wait until all the GoRoutines have completed without dealing with any synchronizing code or speculative sleep durations.
In my testing, it's working perfectly to wait until all threads complete.
Is there something wrong with this technique I'm unaware of? It seems like the simplest possible method but since I have I never seen it used in any example code for this very common problem I'm suspicious that there's a reliability problem with it.
I would not recommend this practice, because you have a busy loop there that uses an unnecessary amount of CPU power.
When working with goroutines I always recommend WaitGroup.
You should know the number of goroutines, then wait for them to finish without using 100% CPU.

Forcing a goroutine to run X times a second

Is there a way to force that a goroutine will run X times a second, no matter if there are other goroutines which may be doing a CPU intensive operation?
A little background on why, I am working on a game server written in go, I have a goroutine that handles the gameloop, the game is updated at X ticks per-second, of course some of the operations the server does are expensive (for example, terrain generation), currently I just spawn a goroutine and let that handle the generation in way that would not block the gameloop goroutine, but after testing on a server with a single vcore, I saw that it still blocks the gameloop while doing CPU intensive operations.
After searching online I found out that go would not reschedule a goroutine while it is not in a blocking syscall, now I could do as suggested which is to just manually call reschedule for the goroutine, but that has two problems, it will make the cpu intensive code more messy, with needing to handle timeouts at specific points, and even after manual reschedule it could just reschedule another cpu intensive goroutine instead of the gameloop...
Is there a way to force that a goroutine will run X times a second, no matter if there are other goroutines which may be doing a CPU intensive operation?
No
after testing on a server with a single vcore, I saw that it still blocks the gameloop while doing CPU intensive operations.
What else do you expect to happen? You have one core and two operations to be performed.
After searching online I found out that go would not reschedule a goroutine while it is not in a blocking syscall
Not true.
From go runtime:
Goroutines are now asynchronously preemptible. As a result, loops without function calls no longer potentially deadlock the scheduler or significantly delay garbage collection. This is supported on all platforms except windows/arm, darwin/arm, js/wasm, and plan9/*.

Factors affecting goroutine's performance

I'm experiencing some weird behaviors using goroutine and please provide advise or comments on things I might do wrong. Here is my process:
I am using goroutines to perform a simulation concurrently, but for some reason the time I spent increases with number of goroutines I put in which makes my simulation not possible to finish in a reasonable time. Here are the things I notice:
For every 10k goroutines initiated, the calculation time increases by 5 seconds
I don't see a shortage in CPU or memory. However CPU usage increases only a little bit for every 10k goroutines. For example when I put 200k goroutines, CPU usage is around 70%
I'm not using disks
I ran the simulation without triggering the goroutines and it finishes very fast, so the slowness is inside or due to goroutines
I tried to use additional goroutines in 1 occasion inside each goroutine to run some workload in parallel. CPU usage is boosted to 100% but the overall speed decreased by 50%...
I am passing some large structs to goroutines using pointers. All goroutines use the same data.
Does anyone have a clue on things I might need to optimize, or suggest me any test I can perform? Thanks!
Start few "worker" goroutines and send "jobs" to its via channel. You save many memory allocations that consume a lot of CPU.

In windows, what does the CPU do while blocking

One has blocking calls whenever the CPU is waiting for some system to respond, e.g. waiting for an internet request. Is the CPU literally wasting time during these calls (I don't know whether there are machine instructions other than no-op that would correspond to the CPU literally wasting time). If not, what is it doing?
The thread is simply skipped when the operating system scheduler looks for work to hand off to a core. With the very common outcome that nothing needs to be done. The processor core then executes the HLT instruction.
In the HALT state it consumes (almost) no power. An interrupt is required to bring it back alive. Most typically that will be the clock interrupt, it ticks 64 times per second by default. It could be a device interrupt. The scheduler then again looks for work to do. Rinse and repeat.
Basically, the kernel maintains run queues or something similar to schedule threads. Each thread receives a time slice where it gets to execute until it expires or it volontarily yields its slice. When a thread yields or its slice expires, the scheduler decides which thread gets to execute next.
A blocking system call would result in a yield. It would also result in the thread being removed from the run queue and placed in a sleep/suspend queue where it is not eligible to receive time slices. It would remain in the sleep/suspend queue until some critiera is met (e.g. timer tick, data available on socket, etc.). Once the criteria is met, it'd be placed back into the run queue.
Sleep(1); // Yield, install a timer, and place the thread in a sleep queue.
As long as there are tasks in any of the run queues (there may be more than one, commonly one per processor core), the scheduler will keep handing out time slices. Depending on scheduler design and hardware constraints, these time slices may vary in length.
When there are no tasks in the run queue, the core can enter a powersaving state until an interrupt is received.
In essence, the processor never wastes time. Its either executing other threads, servicing interrupts or in a powersaving state (even for very short durations).
While a thread is blocked, especially if it is blocked on an efficient wait object that puts the blocked thread to sleep, the CPU is busy servicing other threads in the system. If there are no application threads running, there is always system threads running. The CPU is never truly idle.

Why is a yielding mutex implementation not recommended?

While implementing a mutex there are several architectural choices, like,
Spinning mutex (spinlock)
Sleeping mutex (a FIFO sleep queue is maintained while WAITING)
Yielding mutex (call the scheduler to run another process when WAITING)
Why is the Yielding Mutex least preferred? And how severe the consequences would be in using it?
The sleeping mutex has more fairness. Yielding mutex can cause starvation.
The problem with the yielding model is that a process may be asked to yield over and over, while other processes scoop the mutex (see also barging), or only have to wait a much shorter time.
Depending on how new processes are added to the queue, it could even happen that every time a certain process reaches its turn, it is forced to yield because the mutex is already taken, starving the process.
The FIFO model ensures that waiting processes are served on a first-come/first-served basis. It is harder to implement in the OS, but more fair.
The models can get more complicated. For instance, the OS may also have priorities for processes, and the priorities can change over time. Then the queue system can get even more tricky to implement.

Resources