Is there a way to force that a goroutine will run X times a second, no matter if there are other goroutines which may be doing a CPU intensive operation?
A little background on why, I am working on a game server written in go, I have a goroutine that handles the gameloop, the game is updated at X ticks per-second, of course some of the operations the server does are expensive (for example, terrain generation), currently I just spawn a goroutine and let that handle the generation in way that would not block the gameloop goroutine, but after testing on a server with a single vcore, I saw that it still blocks the gameloop while doing CPU intensive operations.
After searching online I found out that go would not reschedule a goroutine while it is not in a blocking syscall, now I could do as suggested which is to just manually call reschedule for the goroutine, but that has two problems, it will make the cpu intensive code more messy, with needing to handle timeouts at specific points, and even after manual reschedule it could just reschedule another cpu intensive goroutine instead of the gameloop...
Is there a way to force that a goroutine will run X times a second, no matter if there are other goroutines which may be doing a CPU intensive operation?
No
after testing on a server with a single vcore, I saw that it still blocks the gameloop while doing CPU intensive operations.
What else do you expect to happen? You have one core and two operations to be performed.
After searching online I found out that go would not reschedule a goroutine while it is not in a blocking syscall
Not true.
From go runtime:
Goroutines are now asynchronously preemptible. As a result, loops without function calls no longer potentially deadlock the scheduler or significantly delay garbage collection. This is supported on all platforms except windows/arm, darwin/arm, js/wasm, and plan9/*.
Related
I'm new to Go, and I'm writing some concurrent practice code using GoRoutines.
I've seen a lot of examples of worker pools using things like time.Sleep(), waitgroups, atomic counters, and channels to use various methods of determining when a pool of independent GoRoutines have completed their execution before ending a program.
Going through the GoLang reference I found the following library method:
runtime.NumGoRoutines() which returns a count of the currently executing number of Goroutines.
The following line:
for runtime.NumGoroutine() > 1 {}
Allows me to wait until all the GoRoutines have completed without dealing with any synchronizing code or speculative sleep durations.
In my testing, it's working perfectly to wait until all threads complete.
Is there something wrong with this technique I'm unaware of? It seems like the simplest possible method but since I have I never seen it used in any example code for this very common problem I'm suspicious that there's a reliability problem with it.
I would not recommend this practice, because you have a busy loop there that uses an unnecessary amount of CPU power.
When working with goroutines I always recommend WaitGroup.
You should know the number of goroutines, then wait for them to finish without using 100% CPU.
I'm experiencing some weird behaviors using goroutine and please provide advise or comments on things I might do wrong. Here is my process:
I am using goroutines to perform a simulation concurrently, but for some reason the time I spent increases with number of goroutines I put in which makes my simulation not possible to finish in a reasonable time. Here are the things I notice:
For every 10k goroutines initiated, the calculation time increases by 5 seconds
I don't see a shortage in CPU or memory. However CPU usage increases only a little bit for every 10k goroutines. For example when I put 200k goroutines, CPU usage is around 70%
I'm not using disks
I ran the simulation without triggering the goroutines and it finishes very fast, so the slowness is inside or due to goroutines
I tried to use additional goroutines in 1 occasion inside each goroutine to run some workload in parallel. CPU usage is boosted to 100% but the overall speed decreased by 50%...
I am passing some large structs to goroutines using pointers. All goroutines use the same data.
Does anyone have a clue on things I might need to optimize, or suggest me any test I can perform? Thanks!
Start few "worker" goroutines and send "jobs" to its via channel. You save many memory allocations that consume a lot of CPU.
As known, the goroutine is synchronous but non-blocking processing unit.
The golang scheduler handles the non-blocking task, e.g. socket, timer, signal or other events from char devices very well.
But how about block device io or CPU sensitive task? They couldn't be interrupted until finish, and not multiplexed. The OS thread which runs the goroutine would freeze until the goroutine returns or yields. In that case, the scheduling granularity becomes bad.
Of course, you could split the tasks into smaller sub-tasks in your codes, for example, do not copy 1GB file at one time, instead, copy first 10MB, yield, and copy another 10MB, etc, so that the other goroutines within the same OS thread get chance to run. Another example for CPU-bound task: zip a file part by part and merge them finally.
But that breaks the convenience of sequential programming, and the manual scheduling is hard to estimate evenly, compared to the OS scheduling upon the OS threads.
The nginx has similar issue, it's multi-worker-processes program, one process for one CPU core, similar to the best practice of the GOMAXPROCS. It brings in the thread pool to handle the blocking tasks. Maybe it's good for golang too.
I am curious why golang has no OS threading API, which should be good supplement to goroutine for blocking tasks.
Go has specifically chosen to not directly expose OS threads to the user, and instead chose an M:N threading model. Your unit of execution in Go is the goroutine, which will be multiplexed on N number of OS threads.
In the rare case you have a CPU intensive calculation that contains no preemption points and insufficient OS threads to continue running other goroutines, you have 2 choices; increase GOMAXPROCS, or insert runtime.Gosched() calls to yield to other goroutines.
In the case of blocking syscalls, the Go scheduler will automatically dispatch a new OS thread (the time limit to consider a syscall "blocking" has been 20us), and since non-network IO is a series of blocking syscalls, it will almost always be assigned to a dedicated OS thread. Since Go already uses an M:N threading model, the user is usually unaware of the underlying scheduler choices, and can write the program the same as if the runtime used asynchronous IO.
There is an open issue to consider using asynchronous file IO, but there are many issues to overcome, like shortcomings in the Linux aio api, cross-platform compatibility, and interactions with all the various filesystems and devices with which you can do IO.
As windows user, I know that OS threads consume ~1 Mb of memory due to By default, Windows allocates 1 MB of memory for each thread’s user-mode stack. How does golang use ~8kb of memory for each goroutine, if OS thread is much more gluttonous. Are goroutine sort of virtual threads?
Goroutines are not threads, they are (from the spec):
...an independent concurrent thread of control, or goroutine, within the same address space.
Effective Go defines them as:
They're called goroutines because the existing terms—threads, coroutines, processes, and so on—convey inaccurate connotations. A goroutine has a simple model: it is a function executing concurrently with other goroutines in the same address space. It is lightweight, costing little more than the allocation of stack space. And the stacks start small, so they are cheap, and grow by allocating (and freeing) heap storage as required.
Goroutines don't have their own threads. Instead multiple goroutines are (may be) multiplexed onto the same OS threads so if one should block (e.g. waiting for I/O or a blocking channel operation), others continue to run.
The actual number of threads executing goroutines simultaneously can be set with the runtime.GOMAXPROCS() function. Quoting from the runtime package documentation:
The GOMAXPROCS variable limits the number of operating system threads that can execute user-level Go code simultaneously. There is no limit to the number of threads that can be blocked in system calls on behalf of Go code; those do not count against the GOMAXPROCS limit.
Note that in current implementation by default only 1 thread is used to execute goroutines.
1 MiB is the default, as you correctly noted. You can pick your own stack size easily (however, the minimum is still a lot higher than ~8 kiB).
That said, goroutines aren't threads. They're just tasks with coöperative multi-tasking, similar to Python's. The goroutine itself is just the code and data required to do what you want; there's also a separate scheduler (which runs on one on more OS threads), which actually executes that code.
In pseudo-code:
loop forever
take job from queue
execute job
end loop
Of course, the execute job part can be very simple, or very complicated. The simplest thing you can do is just execute a given delegate (if your language supports something like that). In effect, this is simply a method call. In more complicated scenarios, there can be also stuff like restoring some kind of context, handling continuations and coöperative task yields, for example.
This is a very light-weight approach, and very useful when doing asynchronous programming (which is almost everything nowadays :)). Many languages now support something similar - Python is the first one I've seen with this ("tasklets"), long before go. Of course, in an environment without pre-emptive multi-threading, this was pretty much the default.
In C#, for example, there's Tasks. They're not entirely the same as goroutines, but in practice, they come pretty close - the main difference being that Tasks use threads from the thread pool (usually), rather than a separate dedicated "scheduler" threads. This means that if you start 1000 tasks, it is possible for them to be run by 1000 separate threads; in practice, it would require you to write very bad Task code (e.g. using only blocking I/O, sleeping threads, waiting on wait handles etc.). If you use Tasks for asynchronous non-blocking I/O and CPU work, they come pretty close to goroutines - in actual practice. The theory is a bit different :)
EDIT:
To clear up some confusion, here is how a typical C# asynchronous method might look like:
async Task<string> GetData()
{
var html = await HttpClient.GetAsync("http://www.google.com");
var parsedStructure = Parse(html);
var dbData = await DataLayer.GetSomeStuffAsync(parsedStructure.ElementId);
return dbData.First().Description;
}
From point of view of the GetData method, the entire processing is synchronous - it's just as if you didn't use the asynchronous methods at all. The crucial difference is that you're not using up threads while you're doing the "waiting"; but ignoring that, it's almost exactly the same as writing synchronous blocking code. This also applies to any issues with shared state, of course - there isn't much of a difference between multi-threading issues in await and in blocking multi-threaded I/O. It's easier to avoid with Tasks, but just because of the tools you have, not because of any "magic" that Tasks do.
The main difference from goroutines in this aspect is that Go doesn't really have blocking methods in the usual sense of the word. Instead of blocking, they queue their particular asynchronous request, and yield. When the OS (and any other layers in Go - I don't have deep knowledge about the inner workings) receives the response, it posts it to the goroutine scheduler, which in turns knows that the goroutine that "waits" for the response is now ready to resume execution; when it actually gets a slot, it will continue on from the "blocking" call as if it had really been blocking - but in effect, it's very similar to what C#'s await does. There's no fundamental difference - there's quite a few differences between C#'s approach and Go's, but they're not all that huge.
And also note that this is fundamentally the same approach used on old Windows systems without pre-emptive multi-tasking - any "blocking" method would simply yield the thread's execution back to the scheduler. Of course, on those systems, you only had a single CPU core, so you couldn't execute multiple threads at once, but the principle is still the same.
goroutines are what we call green threads. They are not OS threads, the go scheduler is responsible for them. This is why they can have much smaller memory footprints.
One has blocking calls whenever the CPU is waiting for some system to respond, e.g. waiting for an internet request. Is the CPU literally wasting time during these calls (I don't know whether there are machine instructions other than no-op that would correspond to the CPU literally wasting time). If not, what is it doing?
The thread is simply skipped when the operating system scheduler looks for work to hand off to a core. With the very common outcome that nothing needs to be done. The processor core then executes the HLT instruction.
In the HALT state it consumes (almost) no power. An interrupt is required to bring it back alive. Most typically that will be the clock interrupt, it ticks 64 times per second by default. It could be a device interrupt. The scheduler then again looks for work to do. Rinse and repeat.
Basically, the kernel maintains run queues or something similar to schedule threads. Each thread receives a time slice where it gets to execute until it expires or it volontarily yields its slice. When a thread yields or its slice expires, the scheduler decides which thread gets to execute next.
A blocking system call would result in a yield. It would also result in the thread being removed from the run queue and placed in a sleep/suspend queue where it is not eligible to receive time slices. It would remain in the sleep/suspend queue until some critiera is met (e.g. timer tick, data available on socket, etc.). Once the criteria is met, it'd be placed back into the run queue.
Sleep(1); // Yield, install a timer, and place the thread in a sleep queue.
As long as there are tasks in any of the run queues (there may be more than one, commonly one per processor core), the scheduler will keep handing out time slices. Depending on scheduler design and hardware constraints, these time slices may vary in length.
When there are no tasks in the run queue, the core can enter a powersaving state until an interrupt is received.
In essence, the processor never wastes time. Its either executing other threads, servicing interrupts or in a powersaving state (even for very short durations).
While a thread is blocked, especially if it is blocked on an efficient wait object that puts the blocked thread to sleep, the CPU is busy servicing other threads in the system. If there are no application threads running, there is always system threads running. The CPU is never truly idle.