I'm sending an integer that triggers termination via MPI_Bcast. The root sets a variable called "running" to zero and sends the BCast. The Bcast seems to complete but I can't see that the value is sent to the other processes. The other processes seem to be waiting for an MPI_Scatter to complete. They shouldn't even be able to arrive here.
I have done much research on MPI_Bcast and from what I understand it should be blocking. This is confusing me since the MPI_Bcast from the root seems to complete even though I can't find the matching (receiving) MPI_Bcasts for the other processes. I have surrounded all of my MPI_Bcasts with printfs and the output of those printfs 1) print and 2) print the correct values from the root.
The root looks as follows:
while (running || ...) {
/*Do stuff*/
if (...) {
running = 0;
printf("Running = %d and Bcast from root\n", running);
MPI_Bcast(&running, 1, MPI_INT, 0, MPI_COMM_WORLD);
printf("Root 0 Bcast complete. Running %d\n", running);
/* Do some more stuff and eventually reach Finalize */
printf("Root is Finalizing\n");
MPI_Finalize();
}
}
The other processes have the following code:
while (running) {
doThisFunction(rank);
printf("Waiting on BCast from root with myRank: %d\n", rank);
MPI_Bcast(&running, 1, MPI_INT, 0, MPI_COMM_WORLD);
printf("P%d received running = %d\n", rank, running);
if (running == 0) { // just to make sure.
break;
}
}
MPI_Finalize();
I also have the following in the function "doThisFunction()". This is where the processes seem to be waiting for process 0:
int doThisFunction(...) {
/*Do stuff*/
printf("P%d waiting on Scatter\n", rank);
MPI_Scatter(buffer, 130, MPI_BYTE, encoded, 130, MPI_BYTE, 0, MPI_COMM_WORLD);
printf("P%d done with Scatter\n", rank);
/*Do stuff*/
printf("P%d waiting on gather\n", rank);
MPI_Gather(encoded, 1, MPI_INT, buffer, 1, MPI_INT, 0, MPI_COMM_WORLD);
printf("P%d done with gater\n", rank);
/*Do Stuff*/
return aValue;
}
The output in the command line looks as follows:
P0 waiting on Scatter
P0 done with Scatter
P0 waiting on gather
P0 done with gather
Waiting on BCast from root with myRank: 1
P1 received running = 1
P1 waiting on Scatter
P0 waiting on Scatter
P0 done with Scatter
P0 waiting on gather
P0 done with gather
P1 done with Scatter
P1 waiting on gather
P1 done with gather
Waiting on BCast from root with myRank: 1
P1 received running = 1
P1 waiting on Scatter
Running = 0 and Bcast from root
Root 0 Bcast complete. Running 0
/* Why does it say the Bcast is complete
/* even though P1 didn't output that it received it?
Root is Finalizing
/* Deadlocked...
I'm expecting that P1 receives running as zero and then goes into MPI_Finalize() but rather it gets stuck at the scatter which will not be accessed by the root which is already trying to finalize.
In actuality, the program is in deadlock and won't terminate MPI.
I doubt that the problem is that the scatter is accepting the Bcast value because this doesn't even make sense since the root doesn't call scatter.
Does anyone please have any tips on how to resolve this problem?
Your help is greatly appreciated.
Why does it say the Bcast is complete even though P1 didn't output that it received it?
Note the following definitions from the MPI Standard:
Collective operations can (but are not required to) complete as soon as the caller's participation in the collective communication is finished. ... The completion of a collective operation indicates that the caller is free to modify locations in the communication buffer. It does not indicate that other processes in the group have completed or even started the operation (unless otherwise implied by the description of the operation). Thus, a collective communication operation may, or may not, have the effect of synchronizing all calling processes. This statement excludes, of course, the barrier operation.
According to this definition, your MPI_Bcast on the root process can finish even if there is no MPI_Bcast called by slaves.
(For point-to-point operations, we have different communication modes, such as the synchronous one, to address these issues. Unfortunately, there is no synchronous mode for collectives.)
There seems to be some problem in your code with the order of operations. The root called MPI_Bcast, but process #1 did not and was waiting on MPI_Scatter as your log output indicates.
Related
NOTE: I have added the C++ tag to this because a) the code is C++ and b) people using C++ may well have used IO completion ports. So please don't shout.
I am playing with IO completion ports, and have eventually fully understood (and tested, to prove) - both with help from RbMm - the meaning of the NumberOfConcurrentThreads parameter within CreateIoCompletionPort().
I have the following small program which creates 10 threads all waiting on the completion port. I tell my completion port to only allow 4 threads to be runnable at once (I have four CPUs). I then enqueue 8 packets to the port. My thread function outputs a message if it dequeues a packet with an ID > 4; in order for this message to be output, I have to stop at least one of the four currently running threads, which happens when I enter '1' at the console.
Now this is all fairly simple code. I have one big concern however, and that is that if all of the threads that are processing a completion packet get bogged down, it will mean no more packets can be dequeued and processed. That is what I am simulating with my infinite loop - the fact that no more packets are dequeued until I enter '1' at the console highlights this potential problem!
Would a better solution not be to have my four threads dequeuing packets (or as many threads as CPUs), then when one is dequeued, farm the processing of that packet off to a worker thread from a separate pool, thereby removing the risk of all threads in the IOCP being bogged down thus no more packets being dequeued?
I ask this as all the examples of IO completion port code I have seen use a method similar to what I show below, not using a separate thread pool which I propose. This is what makes me think that I am missing something because I am outnumbered!
Note: this is a somewhat contrived example, because Windows will allow an additional packet to be dequeued if one of the runnable threads enters a wait state; I show this in my code with a commented out cout call:
The system also allows a thread waiting in GetQueuedCompletionStatus
to process a completion packet if another running thread associated
with the same I/O completion port enters a wait state for other
reasons, for example the SuspendThread function. When the thread in
the wait state begins running again, there may be a brief period when
the number of active threads exceeds the concurrency value. However,
the system quickly reduces this number by not allowing any new active
threads until the number of active threads falls below the concurrency
value.
But I won't be calling SuspendThread in my thread functions, and I don't know which functions other than cout will cause the thread to enter a wait state, thus I can't predict if one or more of my threads will ever get bogged down! Hence my idea of a thread pool; at least context switching would mean that other packets get a chance to be dequeued!
#define _CRT_SECURE_NO_WARNINGS
#include <windows.h>
#include <thread>
#include <vector>
#include <algorithm>
#include <atomic>
#include <ctime>
#include <iostream>
using namespace std;
int main()
{
HANDLE hCompletionPort1;
if ((hCompletionPort1 = CreateIoCompletionPort(INVALID_HANDLE_VALUE, NULL, 0, 4)) == NULL)
{
return -1;
}
vector<thread> vecAllThreads;
atomic_bool bStop(false);
// Fill our vector with 10 threads, each of which waits on our IOCP.
generate_n(back_inserter(vecAllThreads), 10, [hCompletionPort1, &bStop] {
thread t([hCompletionPort1, &bStop]()
{
// Thread body
while (true)
{
DWORD dwBytes = 0;
LPOVERLAPPED pOverlapped = 0;
ULONG_PTR uKey;
if (::GetQueuedCompletionStatus(hCompletionPort1, &dwBytes, &uKey, &pOverlapped, INFINITE) == 1)
{
if (dwBytes == 0 && uKey == 0 && pOverlapped == 0)
break; // Special completion packet; end processing.
//cout << uKey; // EVEN THIS WILL CAUSE A "wait" which causes MORE THAN 4 THREADS TO ENTER!
if (uKey >4)
cout << "Started processing packet ID > 4!" << endl;
while (!bStop)
; // INFINITE LOOP
}
}
});
return move(t);
}
);
// Queue 8 completion packets to our IOCP...only four will be processed until we set our bool
for (int i = 1; i <= 8; ++i)
{
PostQueuedCompletionStatus(hCompletionPort1, 0, i, new OVERLAPPED);
}
while (!bStop)
{
int nVal;
cout << "Enter 1 to cause current processing threads to end: ";
cin >> nVal;
bStop = (nVal == 1);
}
for (int i = 0; i < 10; ++i) // Tell all 10 threads to stop processing on the IOCP
{
PostQueuedCompletionStatus(hCompletionPort1, 0, 0, 0); // Special packet marking end of IOCP usage
}
for_each(begin(vecAllThreads), end(vecAllThreads), mem_fn(&thread::join));
return 0;
}
EDIT #1
What I mean by "separate thread pool" is something like the following:
class myThread {
public:
void SetTask(LPOVERLAPPED pO) { /* start processing pO*/ }
private:
thread m_thread; // Actual thread object
};
// The threads in this thread pool are not associated with the IOCP in any way whatsoever; they exist
// purely to be handed a completion packet which they then process!
class ThreadPool
{
public:
void Initialise() { /* create 100 worker threads and add them to some internal storage*/}
myThread& GetNextFreeThread() { /* return one of the 100 worker thread we created*/}
} g_threadPool;
The code that each of my four threads associated with the IOCP then change to
if (::GetQueuedCompletionStatus(hCompletionPort1, &dwBytes, &uKey, &pOverlapped, INFINITE) == 1)
{
if (dwBytes == 0 && uKey == 0 && pOverlapped == 0)
break; // Special completion packet; end processing.
// Pick a new thread from a pool of pre-created threads and assign it the packet to process
myThread& thr = g_threadPool.GetNextFreeThread();
thr.SetTask(pOverlapped);
// Now, this thread can immediately return to the IOCP; it doesn't matter if the
// packet we dequeued would take forever to process; that is happening in the
// separate thread thr *that will not intefere with packets being dequeued from IOCP!*
}
This way, there is no possible way that I can end up in the situation where no more packets are being dequeued!
It seems there is conflicting opinion on whether a separate thread pool should be used. Clearly, as the sample code I have posted shows, there is potential for packets to stop being dequeued from the IOCP if the processing of the packets does not enter a wait state; given, the infinite loop is perhaps unrealistic but it does demonstrate the point.
I have implemented an interprocess message queue in shared memory for one producer and one consumer on Windows.
I am using one named semaphore to count empty slots, one named semaphore to count full slots and one named mutex to protect the data structure in shared memory.
Consider, for example the consumer side. The producer side is similar.
First it waits on the full semaphore then (1) it takes a message from the queue under the mutex and then it signals the empty semaphore (2)
The problem:
If the consumer process crashes between (1) and (2) then effectively the number of slots in the queue that can be used by the process is reduced by one.
Assume that while the consumer is down, the producer can handle the queue getting filled up. (it can either specify a timeout when waiting on the empty semaphore or even specify 0 for no wait).
When the consumer restarts it can continue to read data from the queue. Data will not have been overrun but even after it empties all full slots, the producer will have one less empty slot to use.
After multiple such restarts the queue will have no slots that can be used and no messages can be sent.
Question:
How can this situation be avoided or recovered from?
Here's an outline of one simple approach, using events rather than semaphores:
DWORD increment_offset(DWORD offset)
{
offset++;
if (offset == QUEUE_LENGTH*2) offset = 0;
return offset;
}
void consumer(void)
{
for (;;)
{
DWORD current_write_offset = InterlockedCompareExchange(write_offset, 0, 0);
if ((current_write_offset != *read_offset + QUEUE_LENGTH) &&
(current_write_offset + QUEUE_LENGTH != *read_offset))
{
// Queue is not full, make sure producer is awake
SetEvent(signal_producer_event);
}
if (*read_offset == current_write_offset)
{
// Queue is empty, wait for producer to add a message
WaitForSingleObject(signal_consumer_event, INFINITE);
continue;
}
MemoryBarrier();
_ReadWriteBarrier;
consume((*read_offset) % QUEUE_LENGTH);
InterlockedExchange(read_offset, increment_offset(*read_offset));
}
}
void producer(void)
{
for (;;)
{
DWORD current_read_offset = InterlockedCompareExchange(read_offset, 0, 0);
if (current_read_offset != *write_offset)
{
// Queue is not empty, make sure consumer is awake
SetEvent(signal_consumer_event);
}
if ((*write_offset == current_read_offset + QUEUE_LENGTH) ||
(*write_offset + QUEUE_LENGTH == current_read_offset))
{
// Queue is full, wait for consumer to remove a message
WaitForSingleObject(signal_producer_event, INFINITE);
continue;
}
produce((*write_offset) % QUEUE_LENGTH);
MemoryBarrier();
_ReadWriteBarrier;
InterlockedExchange(write_offset, increment_offset(*write_offset));
}
}
Notes:
The code as posted compiles (given the appropriate declarations) but I have not otherwise tested it.
read_offset is a pointer to a DWORD in shared memory, indicating which slot should be read from next. Similarly, write_offset points to a DWORD in shared memory indicating which slot should be written to next.
An offset of QUEUE_LENGTH + x refers to the same slot as an offset of x so as to disambiguate between a full queue and an empty queue. That's why the increment_offset() function checks for QUEUE_LENGTH*2 rather than just QUEUE_LENGTH and why we take the modulo when calling the consume() and produce() functions. (One alternative to this approach would be to modify the producer to never use the last available slot, but that wastes a slot.)
signal_consumer_event and signal_producer_event must be automatic-reset events. Note that setting an event that is already set is a no-op.
The consumer only waits on its event if the queue is actually empty, and the producer only waits on its event if the queue is actually full.
When either process is woken, it must recheck the state of the queue, because there is a race condition that can lead to a spurious wakeup.
Because I use interlocked operations, and because only one process at a time is using any particular slot, there is no need for a mutex. I've included memory barriers to ensure that the changes the producer writes to a slot will be seen by the consumer. If you're not comfortable with lock-free code, you'll find that it is trivial to convert the algorithm shown to use a mutex instead.
Note that InterlockedCompareExchange(pointer, 0, 0); looks a bit complicated but is just a thread-safe equivalent to *pointer, i.e., it reads the value at the pointer. Similarly, InterlockedExchange(pointer, value); is the same as *pointer = value; but thread-safe. Depending on the compiler and target architecture, interlocked operations may not be strictly necessary, but the performance impact is negligible so I recommend programming defensively.
Consider the case when the consumer crashes during (or before) the call to the consume() function. When the consumer is restarted, it will pick up the same message again and process it as normal. As far as the producer is concerned, nothing unusual has happened, except that the message took longer than usual to be processed. An analogous situation occurs if the producer crashes while creating a message; when restarted, the first message generated will overwrite the incomplete one, and the consumer won't be affected.
Obviously, if the crash occurs after the call to InterlockedExchange but before the call to SetEvent in either the producer or consumer, and if the queue was previously empty or full respectively, then the other process will not be woken up at that point. However, it will be woken up as soon as the crashed process is restarted. You cannot lose slots in the queue, and the processes cannot deadlock.
I think the simple multiple-producer single-consumer case would look something like this:
void producer(void)
{
for (;;)
{
DWORD current_read_offset = InterlockedCompareExchange(read_offset, 0, 0);
if (current_read_offset != *write_offset)
{
// Queue is not empty, make sure consumer is awake
SetEvent(signal_consumer_event);
}
produce_in_local_cache();
claim_mutex();
// read offset may have changed, re-read it
current_read_offset = InterlockedCompareExchange(read_offset, 0, 0);
if ((*write_offset == current_read_offset + QUEUE_LENGTH) ||
(*write_offset + QUEUE_LENGTH == current_read_offset))
{
// Queue is full, wait for consumer to remove a message
WaitForSingleObject(signal_producer_event, INFINITE);
continue;
}
copy_from_local_cache_to_shared_memory((*write_offset) % QUEUE_LENGTH);
MemoryBarrier();
_ReadWriteBarrier;
InterlockedExchange(write_offset, increment_offset(*write_offset));
release_mutex();
}
}
If the active producer crashes, the mutex will be detected as abandoned; you can treat this case as if the mutex were properly released. If the crashed process got as far as incrementing the write offset, the entry it added will be processed as usual; if not, it will be overwritten by whichever producer next claims the mutex. In neither case is any special action needed.
I am trying to understand how wait_event is implemented in linux kernel. There is a code example in ldd3 where the internal implementation is explained using prepare_to_wait (http://www.makelinux.net/ldd3/chp-6-sect-2).
static int scull_getwritespace(struct scull_pipe *dev, struct file *filp)
{
while (spacefree(dev) == 0) {
DEFINE_WAIT(wait);
up(&dev->sem);
if (filp->f_flags & O_NONBLOCK)
return -EAGAIN;
PDEBUG("\"%s\" writing: going to sleep\n",current->comm);
prepare_to_wait(&dev->outq, &wait, TASK_INTERRUPTIBLE);
if (spacefree(dev) == 0) // Why is this check necessary ??
schedule( );
finish_wait(&dev->outq, &wait);
if (signal_pending(current))
return -ERESTARTSYS; /* signal: tell the fs layer to handle it */
if (down_interruptible(&dev->sem))
return -ERESTARTSYS;
}
return 0;
}
In the book, it is explained as below.
Then comes the obligatory check on the buffer; we must handle the case
in which space becomes available in the buffer after we have entered
the while loop (and dropped the semaphore) but before we put ourselves
onto the wait queue. Without that check, if the reader processes were
able to completely empty the buffer in that time, we could miss the
only wakeup we would ever get and sleep forever. Having satisfied
ourselves that we must sleep, we can call schedule.
I am not able to understand this piece of explanation. How we would go to a indefinite sleep if the if (spacefree(dev) == 0) is not done before calling schedule() ?
if this obligatory check is not present, wakeup() still resets the process state to TASK_RUNNING and schedule returns as explained in the next paragraph.
It is worth looking again at this case: what happens if the wakeup
happens between the test in the if statement and the call to schedule?
In that case, all is well. The wakeup resets the process state to
TASK_RUNNING and schedule returns—although not necessarily right away.
As long as the test happens after the process has put itself on the
wait queue and changed its state, things will work.
The important thing is that the (last) check is done after prepare_to_wait() was called.
prepare_to_wait() puts a pointer to the current process into the wait queue. If the wakeup happens before the prepare_to_wait() call, the wakeup would not be able to affect the current process.
I use MPI non-blocking communication(MPI_Irecv, MP_Isend) to monitor the slaves' idle states, the code is like bellow.
rank 0:
int dest = -1;
while( dest <= 0){
int i;
for(i=1;i<=slaves_num;i++){
printf("slave %d, now is %d \n",i,idle_node[i]);
if (idle_node[i]== 1) {
idle_node[i] = 0;
dest = i;
break;
}
}
if(dest <= 0){
MPI_Irecv(&idle_node[1],1,MPI_INT,1,MSG_IDLE,MPI_COMM_WORLD,&request);
MPI_Irecv(&idle_node[2],1,MPI_INT,2,MSG_IDLE,MPI_COMM_WORLD,&request);
MPI_Irecv(&idle_node[3],1,MPI_INT,3,MSG_IDLE,MPI_COMM_WORLD,&request);
// MPI_Wait(&request,&status);
}
usleep(100000);
}
idle_node[dest] = 0;//indicates this slave is busy now
rank 1,2,3:
while(1)
{
...//do something
MPI_Isend(&idle,1,MPI_INT,0,MSG_IDLE,MPI_COMM_WORLD,&request);
MPI_Wait(&request,&status);
}
it works, but I want it to be faster, so I delete the line:
usleep(100000);
then rank 0 goes into a dead while like this:
slave 1, now is 0
slave 2, now is 0
slave 3, now is 0
slave 1, now is 0
slave 2, now is 0
slave 3, now is 0
...
So does it indicate that when I use the MPI_Irecv, it just tells MPI I want to receive a message here(haven't received message), and MPI needs other time to receive the real data? or some reasons else?
The use of non-blocking operations has been discussed over and over again here. From the MPI specification (section Nonblocking Communication):
Similarly, a nonblocking receive start call initiates the receive operation, but does not complete it. The call can return before a message is stored into the receive buffer. A separate receive complete call is needed to complete the receive operation and verify that the data has been received into the receive buffer. With suitable hardware, the transfer of data into the receiver memory may proceed concurrently with computations done after the receive was initiated and before it completed.
(the bold text is copied verbatim from the standard; the emphasis in italic is mine)
The key sentence is the last one. The standard does not give any guarantee that a non-blocking receive operation will ever complete (or even start) unless MPI_WAIT[ALL|SOME|ANY] or MPI_TEST[ALL|SOME|ANY] was called (with MPI_TEST* setting a value of true for the completion flag).
By default Open MPI comes as a single-threaded library and without special hardware acceleration the only way to progress non-blocking operations is to either call periodically into some non-blocking calls (with the primary example of MPI_TEST*) or call into a blocking one (with the primary example being MPI_WAIT*).
Also your code leads to a nasty leak that will sooner or later result in resource exhaustion: you are calling MPI_Irecv multiple times with the same request variable, effectively overwriting its value and losing the reference to the previously started requests. Requests that are not waited upon are never freed and therefore remain in memory.
There is absolutely no need to use non-blocking operations in your case. If I understand the logic correctly, you can achieve what you want with code as simple as:
MPI_Recv(&dummy, 1, MPI_INT, MPI_ANY_SOURCE, MSG_IDLE, MPI_COMM_WORLD, &status);
idle_node[status.MPI_SOURCE] = 0;
If you'd like to process more than one worker processes at the same time, it is a bit more involving:
MPI_Request reqs[slaves_num];
int indices[slaves_num], num_completed;
for (i = 0; i < slaves_num; i++)
reqs[i] = MPI_REQUEST_NULL;
while (1)
{
// Repost all completed (or never started) receives
for (i = 1; i <= slaves_num; i++)
if (reqs[i-1] == MPI_REQUEST_NULL)
MPI_Irecv(&idle_node[i], 1, MPI_INT, i, MSG_IDLE,
MPI_COMM_WORLD, &reqs[i-1]);
MPI_Waitsome(slaves_num, reqs, &num_completed, indices, MPI_STATUSES_IGNORE);
// Examine num_completed and indices and feed the workers with data
...
}
After the call to MPI_Waitsome there will be one or more completed requests. The exact number will be in num_completed and the indices of the completed requests will be filled in the first num_completed elements of indices[]. The completed requests will be freed and the corresponding elements of reqs[] will be set to MPI_REQUEST_NULL.
Also, there appears to be a common misconception about using non-blocking operations. A non-blocking send can be matched by a blocking receive and also a blocking send can be equally matched by a non-blocking receive. That makes such constructs nonsensical:
// Receiver
MPI_Irecv(..., &request);
... do something ...
MPI_Wait(&request, &status);
// Sender
MPI_Isend(..., &request);
MPI_Wait(&request, MPI_STATUS_IGNORE);
MPI_Isend immediately followed by MPI_Wait is equivalent to MPI_Send and the following code is perfectly valid (and easier to understand):
// Receiver
MPI_Irecv(..., &request);
... do something ...
MPI_Wait(&request, &status);
// Sender
MPI_Send(...);
I'm trying and failing to cancel a call to WNetAddConnection2 with CancelSynchronousIo.
The call to CancelSynchronousIo succeeds but nothing is actually cancelled.
I'm using a 32-bit console app running on Windows 7 x64.
Has anyone done this successfully? Am I doing something dumb? Here's a sample console app (which needs to be linked with mpr.lib):
DWORD WINAPI ConnectThread(LPVOID param)
{
NETRESOURCE nr;
memset(&nr, 0, sizeof(nr));
nr.dwType = RESOURCETYPE_ANY;
nr.lpRemoteName = L"\\\\8.8.8.8\\bog";
// result is ERROR_BAD_NETPATH (i.e. the call isn't cancelled)
DWORD result = WNetAddConnection2(&nr, L"pass", L"user", CONNECT_TEMPORARY);
return 0;
}
int _tmain(int argc, _TCHAR* argv[])
{
// Create a new thread to run WNetAddConnection2
HANDLE hThread = CreateThread(0, 0, ConnectThread, 0, 0, 0);
if (!hThread)
return 1;
// Retry the cancel until it fails; keep track of how often
int count = 0;
BOOL ok;
do
{
// Sleep to give the thread a chance to start
Sleep(1000);
ok = CancelSynchronousIo(hThread);
++count;
}
while (ok);
// count will equal two here (i.e. one successful cancellation and
// one failed cancellation)
// err is ERROR_NOT_FOUND (i.e. nothing to cancel) which makes
// sense for the second call
DWORD err = GetLastError();
// Wait for the thread to finish; this takes ages (i.e. the
// WNetAddConnection2 call is not cancelled)
WaitForSingleObject(hThread, INFINITE);
return 0;
}
According to Larry Osterman (I hope he doesn't mind me quoting him): "The question was answered in the comments: wnetaddconnection2 isn’t a simple IOCTL call." So the answer (unfortunately) is no.
First, WNetAddConnection2 is system-wide, not per-process. This is important, as calling WNetAddConnection2 many times can wreck system stability - particularly with explorer.
I use WNetGetResourceInformation first to check if the connection already exists before even thinking of calling it - my process may have previously run and then shutdown. The connection may still exist. When my Windows service(s) needs to add such a connection I use a nasty little trick in order to prevent these totally non-abortable API's from stalling my own service shutdown.
The trick is to run these calls in a separate process: they are system-wide, after all. You can normally wait for the process to complete as if you called the functions yourself but you can terminate the process and give up waiting if you need to abort in order to shutdown.
Sadly, however, certain Windows resources, such as named pipe handles and handles to files open on remote computers, can take about 16 seconds to close following failure or shutdown of a remote machine. CancelSynchronousIo does not seem to even help with those but will likely add additional long delay.