Why we need `CreateThreadpoolIo` and `StartThreadpoolIo` for IOCP? - winapi

I'm trying to use IOCP relying on Windows API CreateThreadpoolIo and StartThreadpoolIo, but I found the thread pool is just to make the code behind the IO completed parallel. The async IO submit operations are also execute sequentially in the main thread. So why we need this? I think make the IO submit operations parallel can improve the throughput even if they are async operations, right?
The other cost is if we make them parallel, we might need to lock something to guarantee data consistency (thread safe operation).

It is possible to do IOCP without using CreateThreadpool / StartThreadpoolIo, in that case you have to manage calling GetQueuedCompletionStatus yourself (whether in a self-managed thread pool or otherwise - it is even conceivable that it could be interleaved into the actions of the thread that started the I/O, but in that case why bother with IOCP?). StartThreadpoolIO is needed in order to have a thread waiting on GetQueuedCompletionStatus instead of WaitForMultipleObjects (or one of its variants). CancelThreadpoolIo decrements a counter saying how many IOCP operations are outstanding and if that counter reaches 0 the thread pool knows it can stop waiting on GetQueuedCompletionStatus.

CreateThreadpoolIo - create object TP_IO and call ZwSetInformationFile with FileCompletionInformation and FILE_COMPLETION_INFORMATION for set CompletionContext in FILE_OBJECT. as result - if we do I/O on file, when it finished (if no synchronous error returned and we pass not zero ApcContext ) - system queue packet to I/O port ( which we provide in FILE_COMPLETION_INFORMATION ) with Key (from FILE_COMPLETION_INFORMATION ) and ApcContext (form concrete I/O call. win32 api always pass pointer to OVERLAPPED here). the user callback address (IoCompletionCallback ) stored inside TP_IO
StartThreadpoolIO increment reference count on TP_IO and CancelThreadpoolIo (and CloseThreadpoolIo) decrement this reference count. this need for manage life time of TP_IO - before we start any I/O operation - need increment reference count on TP_IO. when I/O finished - packet will be queued to I/O port. one of Threads from pool pop this packet. got Key ( lpCompletionKey) convert it to pointer to TP_IO and call user callback IoCompletionCallback. after callback return - system decrement reference count to TP_IO. if the I/O fail synchronous - will be no packet, no callback. so need direct decrement reference count to TP_IO - for this need call CancelThreadpoolIo

Related

How to allocated thread pool while using asyncio ProactorEventLoop

I'm currently using asyncio in Python 3.7 and write a TCP server using the asyncio.start_server() function
refer to this example: https://docs.python.org/3/library/asyncio-stream.html
Also try asyncio.ProactorEventLoop that uses “I/O Completion Ports” (IOCP)
According to this Microsoft official doc https://learn.microsoft.com/en-ca/windows/win32/fileio/i-o-completion-ports, it mention that using I/O completion ports with a pre-allocated thread pool but I cannot find where can allocate the number of thread
Where can I allocate the number of thread in thread pool?
Can anyone please help me here? Thanks a lot!
at first general info about I/O completion ports (iocp) and thread pool(s). we have 2 options here:
create all by self:
create iocp by self via CreateIoCompletionPort (or NtCreateIoCompletion).
by self create threads, which will be call GetQueuedCompletionStatus (or NtRemoveIoCompletion).
every file you need bind to your iocp again by self via NtSetInformationFile with FileCompletionInformation and FILE_COMPLETION_INFORMATION or via CreateIoCompletionPort (this win32 api combine functional of NtCreateIoCompletion and NtSetInformationFile).
use system iocp(s) and thread pool(s).
system (ntdll.dll) create default thread pool (now it named TppPoolpGlobalPool) when process startup. you have week control for this pool. you can not got it direct pointer PTP_POOL. exist undocumented TpSetDefaultPoolMaxThreads (for set the maximum number of threads in this pool) but no for minimum.
if want - you can create additional thread pools via CreateThreadpool function.
After creating the new thread pool, you can (but not should!) call SetThreadpoolThreadMaximum to specify the maximum number of threads that the pool can allocate and SetThreadpoolThreadMinimum to specify the minimum number of threads available in the pool.
The thread pool maintains an I/O completion port. the iocp created inside call CreateThreadpool - we have no direct access to it.
so initially in process exist one global/default thread pool (TppPoolpGlobalPool) and iocp (windows 10 for parallel loader create else one thread pool LdrpThreadPool but this of course only for internal use - while DDLs loading)
finally you bind self files to iocp by call CreateThreadpoolIo
note that msdn documentation is wrong here -
Creates a new I/O completion object.
really CreateThreadpoolIo function not create new I/O completion object - it created only inside call CreateThreadpool. this api bind file (not handle but file!) to I/O completion object which is associated to pool. to which pool ? look for last parameter - optional pointer to the TP_CALLBACK_ENVIRON.
you can specify a thread pool in next way - allocate callback environment, call InitializeThreadpoolEnvironment for it and then SetThreadpoolCallbackPool.
If you do not specify a thread pool, the global thread pool will be used in call CreateThreadpoolIo - so file will be bind to default/global process iocp
and you not need by self call GetQueuedCompletionStatus (or NtRemoveIoCompletion) in this case - system do this for you from pool. and then call your IoCompletionCallback callback function, which you pass to system inside CreateThreadpoolIo call
we can also use system global thread pool and iocp via BindIoCompletionCallback (
or RtlSetIoCompletionCallback) - it associates the I/O completion port owned by the global (TppPoolpGlobalPool) thread pool with the specified file handle. this is old api and variant of case 2. here we can not use custom poll - only process global.
now let back to concrete Python code. which case it use ? are it create iocp and thread pool by self ? or it use system thread pool ? if use system - use it global or custom thread pool allocated by CreateThreadpool ? if you dont know this - nothing can be done here. and even if know.. or this library have special api/interface (or how this in python called) for control this (in case self or custom pool used) or you only can use it as is. and really hard decide how many threads you really need in pool

I/O Completion ports when to increase/decrease the RefCount of Per socket in a multi-threaded design?

I read this question
I/O Completion Ports *LAST* called callback, or: where it's safe to cleanup things
And i can not get my issue solved. The answer does not fully cover this method.
I have also searched a lot here and in google but can not find a solution so i am opening a question here hope that is not duplicated.
In a multi-threaded IO Completion ports design when to increase the RefCount of the Per Socket structure? ie the CompletionKey. Currently i do increase it before calling WSARecv and if the return value of the call is not 0 or ERROR_IO_PENDING by last error, i decrease it and call a cleanup function, this function will check if the RefCount is 0, if it is then it will free the Per Socket structure. Else it will just free the Per IO structure (the one of OVERLAPPED), i also increase it before issuing any WSASend using the same way mentioned above. This RefCount is atomic using CRITCAL_SECTION. Upon returning from GetQueuedCompletionStatus i also decrease the RefCount.
However i have some questions about this method
I have a function that sends files from the main thread, the function is reading the file and issuing a PostQueuedCompletionStatus to do a send using WSASend through the IO Worker threads, the function sends file in chunks and when each chunk completes the IO Worker threads will inform the main thread with PostMessage to issue another send of the next chunk.
Now where i am supposed to increase this RefCount? in the main thread just before issuing a call to PostQueuedCompletionStatus? but what if a return from GetQueuedCompletionStatus returned and freed the Per Socket structure and the main thread still using it? (Example the main thread is executing the send function but not yet increased the RefCount) i tried to increase RefCount in the WSASend function in the IO Worker threads but it is the same issue
For instance: what if a thread woke up from GetQueuedCompletionStatus as a socket closure(caused by the outstanding WSARecv) and decrement the RefCount and it became 0 so it will free the Per Socket structure while a WSASend is executing in another IO Worker thread but not yet increased the RefCount? then obviously the thread that is about to issue a WSASend call will crash with access violation whenever it tries to enter the critical section.
Any idea how to synchronize access to this structure between IO Worker threads and the main thread?

How to force GetQueuedCompletionStatus() to return immediately?

I have hand-made thread pool. Threads read from completion port and do some other stuff. One particular thread has to be ended. How to interrupt it's waiting if it hangs on GetQueuedCompletionStatus() or GetQueuedCompletionStatusEx()?
Finite timeout (100-1000 ms) and exiting variable are far from elegant, cause delays and left as last resort.
CancelIo(completionPortHandle) within APC in target thread causes ERROR_INVALID_HANDLE.
CancelSynchronousIo(completionPortHandle) causes ERROR_NOT_FOUND.
PostQueuedCompletionStatus() with termination packet doesn't allow to choose thread.
Rough TerminateThread() with mutex should work. (I haven't tested it.) But is it ideologically good?
I tried to wait on special event and completion port. WaitForMultipleObjects() returned immediately as if completion port was signalled. GetQueuedCompletionStatus() shows didn't return anything.
I read Overlapped I/O: How to wake a thread on a completion port event or a normal event? and googled a lot.
Probably, the problem itself – ending thread's work – is sign of bad design and all my threads should be equal and compounded into normal thread pool. In this case, PostQueuedCompletionStatus() approach should work. (Although I have doubts that this approach is beautiful and laconic especially if threads use GetQueuedCompletionStatusEx() to get multiple packets at once.)
If you just want to reduce the size of the thread pool it doesn't matter which thread exits.
However if for some reason you need to signal to an particular thread that it needs to exit, rather than allowing any thread to exit, you can use this method.
If you use GetQueuedCompletionStatusEx you can do an alertable wait, by passing TRUE for fAlertable. You can then use QueueUserAPC to queue an APC to the thread you want to quit.
https://msdn.microsoft.com/en-us/library/windows/desktop/ms684954(v=vs.85).aspx
If the thread is busy then you will still have to wait for the current work item to be completed.
Certainly don't call TerminateThread.
Unfortunately, I/O completion port handles are always in a signaled state and as such cannot really be used in WaitFor* functions.
GetQueuedCompletionStatus[Ex] is the only way to block on the completion port. With an empty queue, the function will return only if the thread becomes alerted. As mentioned by #Ben, the QueueUserAPC will make the the thread alerted and cause GetQueuedCompletionStatus to return.
However, QueueUserAPC allocates memory and thus can fail in low-memory conditions or when memory quotas are in effect. The same holds for PostQueuedCompletionStatus. As such, using any of these functions on an exit path is not a good idea.
Unfortunately, the only robust way seems to be calling the undocumented NtAlertThread exported by ntdll.dll.
extern "C" NTSTATUS __stdcall NtAlertThread(HANDLE hThread);
Link with ntdll.lib. This function will put the target thread into an alerted state without queuing anything.

When a thread that calls SetWaitableTimer exits while another thread is waiting on the timer, is the timer cancelled?

http://msdn.microsoft.com/en-us/library/windows/desktop/ms686289%28v=vs.85%29.aspx
According to msdn, in the remarks sections, it states:
"If the thread that set the timer terminates and there is an associated completion routine, the timer is canceled. However, the state of the timer remains unchanged. If there is no completion routine, then terminating the thread has no effect on the timer."
Then further down, it states:
"If the thread that called SetWaitableTimer exits, the timer is canceled. This stops the timer before it can be set to the signaled state and cancels outstanding APCs; it does not change the signaled state of the timer."
Hence my question,
if I have one thread calling SetWaitableTimer without an associated completion routine and another thread calling WaitOnMultipleObjects(passing in the timer object handle) and the thread that calls SetWaitiableTmer exits shortly thereafter, would the timer object be cancelled or would it still become signaled when the period expires?
To give more information directly from the implementation of waitable timers: if you use a CompletionRoutine, the timer is placed on a linked list chained off the thread which called SetWaitableTimer. When the thread is terminated, the kernel walks the dying thread's linked list and cancels are timers which are still queued.
If you're not using a completion routine, the timer is never added to any thread's linked list and thus isn't cancelled when any particular thread dies.
The documentation is somewhat unclear. I think the best you can do is test it yourself. I believe however that the timer cancels automatically only if the I/O completion routine is used.
I can give some "theoretical" background about windows APCs, to justify my (educated) guess.
APC = "asynchronous procedure call". In windows every user-mode thread is equipped with a so-called APC queue, a system-managed queue of procedures that must be called on this thread. A thread may enter a so-called "alertable wait" state (on purpose), during which it may execute one or more of the procedures in this queue. You may either put the procedure call in the APC queue manually, or issue an I/O, which on completion will "put" the procedure call there.
In simple words the scenario is the following: you issue several I/Os, and then you wait for either of them to complete (or fail), and, perhaps, some other events. You then call one of the alertable-waiting functions: SleepEx, WaitForMultipleObjectsEx or similar.
Important note: this mechanism is designed to support a single-threaded concurrency. That is, the same thread issues several I/Os, waits for something to happen, and responds appropriately. All the APC routines are guaranteed to be called in the same thread. Hence - if this thread exits - there's no way to call them. Hence - all the outstanding I/Os are also cancelled.
There are several Windows API functions that deal with asynchronous I/O, whereas they allow a choice of several completion mechanisms (such as ReadFileEx): APC, setting an event, or putting a completion in the I/O completion port. If those functions are used with APC - they automatically cancel the I/O if the issuing thread exits.
Hence, I guess that waitable timer auto-cancels only if used with APC.

Correct usage of boost::asio for a multi-client process

I'm trying to use boost::asio for the first time to write a process that connects to N servers reads data from them.
My question regards the way in which asynchronicity works. My design goal is to connect to all servers in parallel, and also read data from every server in parallel. This should be done with async_connect and async_read, and calling io_service::run() N times, then reading the results. And the question is: is it enough to call io_service::run() from a single thread, sequentially, N times, in order to achieve parallelism?
Note that this is a matter of the implementation of asio: specifically, when calling connect_async and write_async, does the call signal the OS to begin connecting/reading before returning, or does it simply delegate a synchronous connect/read task to the worker thread and returns immediately? - case in which calling io_service::run() from a single thread means serial execution of tasks.
My guess is the former, of course, but I need someone to please confirm. I find it off that the documentation for async stuff (http://think-async.com/Asio/boost_asio_1_3_1/doc/html/boost_asio/overview/core/basics.html) doesn't mention when the async_xxx calls return, which would clarify my question.
The heart of asio is an event loop, which begins with the call to io_service::run(), which is a blocking call. When you call async_connect, you queue up the connect operation in the io_services event queue. To achieve parallelism, you must create a thread pool and have each thread call run() on the same io_service instance.

Resources