Kotlin Coroutines' utilisation of IO threads - kotlin-coroutines

I need to understand that if I have a single IO thread in a system, I run a multiple IO operations on multiple coroutines, can those coroutines use the very same thread in a suspended manner (meaning when coroutine A is waiting for IO result, coroutine B can utilise that thread for its IO operation) or the Thread will be blocked by the first IO operation?

It depends on what kind of IO operation you are doing. If you are doing an asynchronous IO operation, then this IO operation does not block the thread and let other coroutines use it. If you are doing a blocking IO operation, then it blocks the thread and other coroutines cannot use it.

Related

Waiting for GUI events and IOCP notifications at once

Can I wait for GUI events — that is, pump a message loop — and on an I/O completion port at the same time? I would like to integrate libuv with the Windows GUI.
There are two solutions that I know of. One works on all versions of Windows, but involves multiple threads. The other is faster, but only supports Windows 10+ (thank you #RbMm for this fact).
Another thread calls GetQueuedCompletionStatusEx in a loop, and sends messages to the main thread with SendMessage. The main thread reads the messages from it's message loop, notes the custom message type, and dispatches the I/O completions.
This solution works on all versions of Windows, but is slower. However, if one is willing to trade latency for throughput, one can increase the GetQueuedCompletionStatusEx receive buffer to recover nearly the same throughput as the second solution. For best performance, both threads should use the same CPU, to avoid playing cache ping-pong with the I/O completions.
The main thread uses MsgWaitForMultipleObjectsEx to wait for the completion port to be signaled or user input to arrive. Once it is signaled, the main thread calls GetQueuedCompletionStatusEx with a zero timeout.
This assumes that an IOCP that is used by only one thread becomes signaled precisely when an I/O completion arrives. This is only true on Windows 10 and up. Otherwise, you will busyloop, since the IOCP will always be signaled. On systems that support this method, it should be faster, since it reduces scheduling overhead.

golang: how to handle blocking tasks optimally?

As known, the goroutine is synchronous but non-blocking processing unit.
The golang scheduler handles the non-blocking task, e.g. socket, timer, signal or other events from char devices very well.
But how about block device io or CPU sensitive task? They couldn't be interrupted until finish, and not multiplexed. The OS thread which runs the goroutine would freeze until the goroutine returns or yields. In that case, the scheduling granularity becomes bad.
Of course, you could split the tasks into smaller sub-tasks in your codes, for example, do not copy 1GB file at one time, instead, copy first 10MB, yield, and copy another 10MB, etc, so that the other goroutines within the same OS thread get chance to run. Another example for CPU-bound task: zip a file part by part and merge them finally.
But that breaks the convenience of sequential programming, and the manual scheduling is hard to estimate evenly, compared to the OS scheduling upon the OS threads.
The nginx has similar issue, it's multi-worker-processes program, one process for one CPU core, similar to the best practice of the GOMAXPROCS. It brings in the thread pool to handle the blocking tasks. Maybe it's good for golang too.
I am curious why golang has no OS threading API, which should be good supplement to goroutine for blocking tasks.
Go has specifically chosen to not directly expose OS threads to the user, and instead chose an M:N threading model. Your unit of execution in Go is the goroutine, which will be multiplexed on N number of OS threads.
In the rare case you have a CPU intensive calculation that contains no preemption points and insufficient OS threads to continue running other goroutines, you have 2 choices; increase GOMAXPROCS, or insert runtime.Gosched() calls to yield to other goroutines.
In the case of blocking syscalls, the Go scheduler will automatically dispatch a new OS thread (the time limit to consider a syscall "blocking" has been 20us), and since non-network IO is a series of blocking syscalls, it will almost always be assigned to a dedicated OS thread. Since Go already uses an M:N threading model, the user is usually unaware of the underlying scheduler choices, and can write the program the same as if the runtime used asynchronous IO.
There is an open issue to consider using asynchronous file IO, but there are many issues to overcome, like shortcomings in the Linux aio api, cross-platform compatibility, and interactions with all the various filesystems and devices with which you can do IO.

Goroutines 8kb and windows OS thread 1 mb

As windows user, I know that OS threads consume ~1 Mb of memory due to By default, Windows allocates 1 MB of memory for each thread’s user-mode stack. How does golang use ~8kb of memory for each goroutine, if OS thread is much more gluttonous. Are goroutine sort of virtual threads?
Goroutines are not threads, they are (from the spec):
...an independent concurrent thread of control, or goroutine, within the same address space.
Effective Go defines them as:
They're called goroutines because the existing terms—threads, coroutines, processes, and so on—convey inaccurate connotations. A goroutine has a simple model: it is a function executing concurrently with other goroutines in the same address space. It is lightweight, costing little more than the allocation of stack space. And the stacks start small, so they are cheap, and grow by allocating (and freeing) heap storage as required.
Goroutines don't have their own threads. Instead multiple goroutines are (may be) multiplexed onto the same OS threads so if one should block (e.g. waiting for I/O or a blocking channel operation), others continue to run.
The actual number of threads executing goroutines simultaneously can be set with the runtime.GOMAXPROCS() function. Quoting from the runtime package documentation:
The GOMAXPROCS variable limits the number of operating system threads that can execute user-level Go code simultaneously. There is no limit to the number of threads that can be blocked in system calls on behalf of Go code; those do not count against the GOMAXPROCS limit.
Note that in current implementation by default only 1 thread is used to execute goroutines.
1 MiB is the default, as you correctly noted. You can pick your own stack size easily (however, the minimum is still a lot higher than ~8 kiB).
That said, goroutines aren't threads. They're just tasks with coöperative multi-tasking, similar to Python's. The goroutine itself is just the code and data required to do what you want; there's also a separate scheduler (which runs on one on more OS threads), which actually executes that code.
In pseudo-code:
loop forever
take job from queue
execute job
end loop
Of course, the execute job part can be very simple, or very complicated. The simplest thing you can do is just execute a given delegate (if your language supports something like that). In effect, this is simply a method call. In more complicated scenarios, there can be also stuff like restoring some kind of context, handling continuations and coöperative task yields, for example.
This is a very light-weight approach, and very useful when doing asynchronous programming (which is almost everything nowadays :)). Many languages now support something similar - Python is the first one I've seen with this ("tasklets"), long before go. Of course, in an environment without pre-emptive multi-threading, this was pretty much the default.
In C#, for example, there's Tasks. They're not entirely the same as goroutines, but in practice, they come pretty close - the main difference being that Tasks use threads from the thread pool (usually), rather than a separate dedicated "scheduler" threads. This means that if you start 1000 tasks, it is possible for them to be run by 1000 separate threads; in practice, it would require you to write very bad Task code (e.g. using only blocking I/O, sleeping threads, waiting on wait handles etc.). If you use Tasks for asynchronous non-blocking I/O and CPU work, they come pretty close to goroutines - in actual practice. The theory is a bit different :)
EDIT:
To clear up some confusion, here is how a typical C# asynchronous method might look like:
async Task<string> GetData()
{
var html = await HttpClient.GetAsync("http://www.google.com");
var parsedStructure = Parse(html);
var dbData = await DataLayer.GetSomeStuffAsync(parsedStructure.ElementId);
return dbData.First().Description;
}
From point of view of the GetData method, the entire processing is synchronous - it's just as if you didn't use the asynchronous methods at all. The crucial difference is that you're not using up threads while you're doing the "waiting"; but ignoring that, it's almost exactly the same as writing synchronous blocking code. This also applies to any issues with shared state, of course - there isn't much of a difference between multi-threading issues in await and in blocking multi-threaded I/O. It's easier to avoid with Tasks, but just because of the tools you have, not because of any "magic" that Tasks do.
The main difference from goroutines in this aspect is that Go doesn't really have blocking methods in the usual sense of the word. Instead of blocking, they queue their particular asynchronous request, and yield. When the OS (and any other layers in Go - I don't have deep knowledge about the inner workings) receives the response, it posts it to the goroutine scheduler, which in turns knows that the goroutine that "waits" for the response is now ready to resume execution; when it actually gets a slot, it will continue on from the "blocking" call as if it had really been blocking - but in effect, it's very similar to what C#'s await does. There's no fundamental difference - there's quite a few differences between C#'s approach and Go's, but they're not all that huge.
And also note that this is fundamentally the same approach used on old Windows systems without pre-emptive multi-tasking - any "blocking" method would simply yield the thread's execution back to the scheduler. Of course, on those systems, you only had a single CPU core, so you couldn't execute multiple threads at once, but the principle is still the same.
goroutines are what we call green threads. They are not OS threads, the go scheduler is responsible for them. This is why they can have much smaller memory footprints.

How to check if a thread is waiting?

For purposes of profiling and monitoring, I would like to know if a thread is currently active (using CPU time) or if it is waiting or sleeping.
Is there a way to find out if a thread is currently in one of the various Windows kernel waiting functions?
From WaitForSingleObject to various mutex, sleep, critical section, IOCP GetQueuedCompletionStatus, and other I/O functions etc. there are quite a few functions that can result in a thread waiting.
Is there a standard way to know if a thread is waiting?
The Wait Chain Traversal API does what you ask. And a whole lot more besides.

I/O Completion Ports vs. RegisterWaitForSingleObject?

What's the difference between using I/O completion ports, versus just using RegisterWaitForSingleObject to have a thread pool thread wait for I/O to complete?
Is one of them faster, and if so, why?
IOCP's are generally the fastest performing IO turn-around mechanism you will find for one reason above all else: blocking detection.
The simple example of this is a server that is responsible for serving up files from a disk. An IOCP is generally made up of three primary things:
The pool of N threads for servicing the IOCP requests.
A limit of M threads (M is always < N) the tells the IOCP how many concurrent, non-blocked threads to allow.
A completion-status loop that all threads run on.
The difference between N and M in this is very important. The general philosophy is to configure M to be the number of cores on the machine, and N to be larger. How much larger depends on the amount of time your worker threads spend in a blocked-state. If you're reading disk files, your threads will be bound to the speed of the disk IO channel. When you make that call to ReadFile() you've just introduced a blocking call. If M == N, then as soon as you hit all threads reading disk files, you're utterly stalled, with all threads on the disk IO channel.
But what if there was a way for some fancy scheduler to "know" that this thread is (a) participating in an IOCP thread pool, and (b) just stalled because it issued an API call that will be time consuming? What if, when that happens, that fancy scheduler could temporarily "move" that thread into a special "running-but-stalled" group, and then "release" an extra thread that has volunteered to work while there are threads stalled?
That is exactly what IOCP brings. When N is greater than M, The IOCP will put the thread that just issued the stall into a special running-but-stalled state, and then temporarily "borrow" an additional thread from your pool of N. It will continue to do this until the N pool is exhausted, or threads that were stalled begin returning from their blocking requests.
So under that light, an IOCP configured to have, say 8 threads concurrently running on an 8-core machine could actually have a few hundred threads in the real pool. Only 8 will ever be "allowed" to be concurrently running in non-blocked state, though you may pop over that temporarily when blocked threads return from their blocks and you already have borrowed threads servicing additional requests.
Finally, though not as important for your cause, it is still important: An IOCP thread will NOT block, nor context switch, if there is pending work on the queue when it finishes its current work and issues its next GetQueueCompletionStatus() call. If there is work waiting, it will pick it up and continue executing with no mandated preemption. Of course the OS scheduler may preempt anyway, but only as part of the general scheduler; not because of the specific call to GetQueueCompletionStatus(). The lone exception to this is if there are already over M threads running and non-blocked. In that case, GetQueueCompletionStatus() will block the calling thread until it is needed again for slack-work when enough threads once-again become blocked.
The description you gave indicates you will be heavily disk-io-bound. For absolute performance-critical io-server architectures, it is near-impossible to beat the benefits of IOCP, especially the OS-level block-detection that allows the scheduler to know it can temporarily release extra threads from your master-pool to keep things pumping while other threads are stalled.
You simply cannot replicate that specific feature of IOCPs using Windows thread pools. If all of your threads were number crunchers with little or no IO, I would say thread-pools would be a better fit, but your specificity of disk-IO tells me you should be using an IOCP instead.

Resources