Should InterlockedExchange be used on all setting of a variable? - windows

I'm using InterlockedExchange in Windows and I have two questions that put together are basically my title question.
InterlockedExchange uses type LONG (32-bits). According to Microsoft's documentation Interlocked Variable Access: "simple reads and writes to 32-bit variables are atomic operations without InterlockedExchange". According to function documentation InterlockedExchange: "This function is atomic with respect to calls to other interlocked functions". Is it or is it not atomic to read/write a LONG in Windows without the interlocked functions?
I am reviewing some code in which one thread sets a variable then all further accesses on that variable by that thread or any number of threads it created use InterlockedExchange. To make it simple consider thread running main() creates a thread running other():
LONG foo;
main()
{
foo = TRUE;
createthread(other());
/* do other things */
if(InterlockedExchange(&foo, 0))
{
cleanup()
}
}
other()
{
/* do other things */
if(InterlockedExchange(&foo, 0))
{
cleanup()
}
}
cleanup()
{
/* it's expected this is only called once */
}
Is it possible in this example that foo will not appear TRUE to either of the InterlockedExchange calls? If main is busy doing other things so that the first InterlockedExchange call is made from the other thread, does that mean foo is guaranteed to have been written by the main thread and visible to the other thread at that time?
Sorry if this is unclear I don't know how to phrase it better.

Is it or is it not atomic to read/write a LONG in Windows without the interlocked functions?
Yes to both, assuming default alignment. This follows from the quoted statement "simple reads and writes to properly-aligned 32-bit variables are atomic operations", because LONG is a 32-bit signed integer in Windows, regardless of the bitness of the OS.
Is it possible in this example that foo will not appear TRUE to either of the InterlockedExchange calls?
No, not possible if both calls are reached. That's because within a single thread the foo = TRUE; write is guaranteed to be visible to the InterlockedExchange call that comes after it. So the InterlockedExchange call in main will see either the TRUE value previously set in main, or the FALSE value reset in the other thread. Therefore, one of the InterlockedExchange calls must read the foo value as TRUE.
However, if the /* do other things */ code in main is an infinite loop while(1); then that would leave only one InterlockedExchange outstanding in other, and it may be possible for that call to see foo as FALSE, for the same reason as...
If main is busy doing other things so that the first InterlockedExchange call is made from the other thread, does that mean foo is guaranteed to have been written by the main thread and visible to the other thread at that time?
Not necessarily. The foo = TRUE; write is visible to the main thread at the point the secondary thread is created, but may not necessarily be visible to the other thread when it starts, or even when it gets to the InterlockedExchange call.

Related

Is there a race between starting and seeing yourself in WinApi's EnumProcesses()?

I just found this code in the wild:
def _scan_for_self(self):
win32api.Sleep(2000) # sleep to give time for process to be seen in system table.
basename = self.cmdline.split()[0]
pids = win32process.EnumProcesses()
if not pids:
UserLog.warn("WindowsProcess", "no pids", pids)
for pid in pids:
try:
handle = win32api.OpenProcess(
win32con.PROCESS_QUERY_INFORMATION | win32con.PROCESS_VM_READ,
pywintypes.FALSE, pid)
except pywintypes.error, err:
UserLog.warn("WindowsProcess", str(err))
continue
try:
modlist = win32process.EnumProcessModules(handle)
except pywintypes.error,err:
UserLog.warn("WindowsProcess",str(err))
continue
This line caught my eye:
win32api.Sleep(2000) # sleep to give time for process to be seen in system table.
It suggests that if you call EnumProcesses() too fast after starting, you won't see yourself. Is there any truth to this?
There is a race, but it's not the race the code tried to protect against.
A successful call to CreateProcess returns only after the kernel object representing the process has been created and enqueued into the kernel's process list. A subsequent call to EnumProcesses accesses the same list, and will immediately observe the newly created process object.
That is, unless the process object has since been destroyed. This isn't entirely unusual since processes in Windows are initialized in-process. The documentation even makes note of that:
Note that the function returns before the process has finished initialization. If a required DLL cannot be located or fails to initialize, the process is terminated.
What this means is that if a call to EnumProcesses immediately following a successful call to CreateProcess doesn't observe the newly created process, it does so because it was late rather than early. If you are late already then adding a delay will only make you more late.
Which swiftly leads to the actual race here: Process IDs uniquely identify processes only for a finite time interval. Once a process object is gone, its ID is up for grabs, and the system will reuse it at some point. The only reliable way to identify a process is by holding a handle to it.
Now it's anyone's guess what the author of _scan_for_self was trying to accomplish. As written, the code takes more time to do something that's probably altogether wrong1 anyway.
1 Turns out my gut feeling was correct. This is just your average POSIX developer, that, in the process of learning that POSIX is insufficient would rather call out Microsoft instead of actually using an all-around superior API.
The documentation for EnumProcesses (WIn32 API - EnumProcesses function), does not mention anything about a delay needed to see the current process in the list it returns.
The example from Microsoft how to use EnumProcess to enumerate all running processes (Enumerating All Processes), also does not contain any delay before calling EnumProcesses.
A small test application I created in C++ (see below) always reports that the current process is in the list (tested on Windows 10):
#include <Windows.h>
#include <Psapi.h>
#include <iostream>
#include <vector>
const DWORD MAX_NUM_PROCESSES = 4096;
DWORD aProcesses[MAX_NUM_PROCESSES];
int main(void)
{
// Get the list of running process Ids:
DWORD cbNeeded;
if (!EnumProcesses(aProcesses, MAX_NUM_PROCESSES * sizeof(DWORD), &cbNeeded))
{
return 1;
}
// Check if current process is in the list:
DWORD curProcId = GetCurrentProcessId();
bool bFoundCurProcId{ false };
DWORD numProcesses = cbNeeded / sizeof(DWORD);
for (DWORD i=0; i<numProcesses; ++i)
{
if (aProcesses[i] == curProcId)
{
bFoundCurProcId = true;
}
}
std::cout << "bFoundCurProcId: " << bFoundCurProcId << std::endl;
return 0;
}
Note: I am aware that the fact that the program reported the expected result does not mean that there is no race. Maybe I just couldn't catch it manifest. But trying to run code like that can give you a hint sometimes (especially if the result would have been that there is a race).
The fact that I never had a problem running this test (did it many times), together with the lack of any mention of the need for a delay in Microsoft's documentation make me believe that it is not required.
My conclusion is that either:
There is a unique issue when using it from python (doubt it).
or:
The code you found is doing something unnecessary.
There is no race.
EnumProcesses calls a NT API function that switches to kernel mode to walk the linked list of processes. Your own process has been added to the list before it starts running.

Does set variable modify variable values across threads?

Say I have a function foo
void foo () {
bool block = true;
while(block) {}
}
And I have 5 threads running this function, all blocked in the while loop.
If I attach gdb to this process, and jump over to one of these 5 threads, and do the following
(gdb) set variable block=false
My question is what is the full effect of the above set variable statement?
Does it change the value of block to false on all threads, or just the current thread in gdb? Does running the above statement unblock only one or all of the 5 threads stuck in the while loop?
Does it change the value of block to false on all threads, or just the current thread in gdb?
Just the current thread. You have 5 separate and independent stack variables named block; it is unreasonable to expect GDB to change more than one of them.
Does running the above statement unblock only one or all of the 5 threads stuck in the while loop?
It follows that only one thread will be unblocked.
If you want an ability to unblock all threads at once, make the variable global (and volatile, since you are going to be modifying it from "outside" of the program).
P.S. Your spin loops are going to burn a lot of CPU, and compiler is allowed to compile them out. Inserting a usleep() for short duration may help with both problems.

KSPIN_LOCK blocks when acquiring from Driver's main thread

I have a KSPIN_LOCK which is shared among a Windows driver's main thread and some threads I created with PsCreateSystemThread. The problem is that the main thread blocks if I try to acquire the spinlock and doesn't unblock. I'm very confused as to why this happens.. it's probably somehow connected to the fact that the main thread runs at driver IRQL, while the other threads run at PASSIVE_LEVEL as far as I know.
NOTE: If I only run the main thread, acquiring/releasing the lock works just fine.
NOTE: I'm using the functions KeAcquireSpinLock and KeReleaseSpinLock to acquire/release the lock.
Here's my checklist for a "stuck" spinlock:
Make sure the spinlock was initialized with KeInitializeSpinLock. If the KSPIN_LOCK holds uninitialized garbage, then the first attempt to acquire it will likely spin forever.
Check that you're not acquiring it recursively/nested. KSPIN_LOCK does not support recursion, and if you try it, it will spin forever.
Normal spinlocks must be acquired at IRQL <= DISPATCH_LEVEL. If you need something that works at DIRQL, check out [1] and [2].
Check for leaks. If one processor acquires the spinlock, but forgets to release it, then the next processor will spin forever when trying to acquire the lock.
Ensure there's no memory-safety issues. If code randomly writes a non-zero value on top of the spinlock, that'll cause it to appear to be acquired, and the next acquisition will spin forever.
Some of these issues can be caught easily and automatically with Driver Verifier; use it if you're not using it already. Other issues can be caught if you encapsulate the spinlock in a little helper that adds your own asserts. For example:
typedef struct _MY_LOCK {
KSPIN_LOCK Lock;
ULONG OwningProcessor;
KIRQL OldIrql;
} MY_LOCK;
void MyInitialize(MY_LOCK *lock) {
KeInitializeSpinLock(&lock->Lock);
lock->OwningProcessor = (ULONG)-1;
}
void MyAcquire(MY_LOCK *lock) {
ULONG current = KeGetCurrentProcessorIndex();
NT_ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL);
NT_ASSERT(current != lock->OwningProcessor); // check for recursion
KeAcquireSpinLock(&lock->Lock, &lock->OldIrql);
NT_ASSERT(lock->OwningProcessor == (ULONG)-1); // check lock was inited
lock->OwningProcessor = current;
}
void MyRelease(MY_LOCK *lock) {
NT_ASSERT(KeGetCurrentProcessorIndex() == lock->OwningProcessor);
lock->OwningProcessor = (ULONG)-1;
KeReleaseSpinLock(&lock->Lock, lock->OldIrql);
}
Wrappers around KSPIN_LOCK are common. The KSPIN_LOCK is like a race car that has all the optional features stripped off to maximize raw speed. If you aren't counting microseconds, you might reasonably decide to add back the heated seats and FM radio by wrapping the low-level KSPIN_LOCK in something like the above. (And with the magic of #ifdefs, you can always take the airbags out of your retail builds, if you need to.)

How to check which index in a loop is executing without slow down process?

What is the best way to check which index is executing in a loop without too much slow down the process?
For example I want to find all long fancy numbers and have a loop like
for( long i = 1; i > 0; i++){
//block
}
and I want to learn which i is executing in real time.
Several ways I know to do in the block are printing i every time, or checking if(i % 10000), or adding a listener.
Which one of these ways is the fastest. Or what do you do in similar cases? Is there any way to access the value of the i manually?
Most of my recent experience is with Java, so I'd write something like this
import java.util.concurrent.atomic.AtomicLong;
public class Example {
public static void main(String[] args) {
AtomicLong atomicLong = new AtomicLong(1); // initialize to 1
LoopMonitor lm = new LoopMonitor(atomicLong);
Thread t = new Thread(lm);
t.start(); // start LoopMonitor
while(atomicLong.get() > 0) {
long l = atomicLong.getAndIncrement(); // equivalent to long l = atomicLong++ if atomicLong were a primitive
//block
}
}
private static class LoopMonitor implements Runnable {
private final AtomicLong atomicLong;
public LoopMonitor(AtomicLong atomicLong) {
this.atomicLong = atomicLong;
}
public void run() {
while(true) {
try {
System.out.println(atomicLong.longValue()); // Print l
Thread.sleep(1000); // Sleep for one second
} catch (InterruptedException ex) {}
}
}
}
}
Most AtomicLong implementations can be set in one clock cycle even on 32-bit platforms, which is why I used it here instead of a primitive long (you don't want to inadvertently print a half-set long); look into your compiler / platform details to see if you need something like this, but if you're on a 64-bit platform then you can probably use a primitive long regardless of which language you're using. The modified for loop doesn't take much of an efficiency hit - you've replaced a primitive long with a reference to a long, so all you've added is a pointer dereference.
It won't be easy, but probably the only way to probe the value without affecting the process is to access the loop variable in shared memory with another thread. Threading libraries vary from one system to another, so I can't help much there (on Linux I'd probably use pthreads). The "monitor" thread might do something like probe the value once a minute, sleep()ing in between, and so allowing the first thread to run uninterrupted.
To have a null cost reporting (on multi-cpu computers) : set your index as a "global" property (class-wide for instance), and have a separate thread to read and report the index value.
This report could be timer-based (5 times per seconds or so).
Rq : Maybe you'll need also a boolean stating 'are we in the loop ?'.
Volatile and Caches
If you're going to be doing this in, say, C / C++ and use a separate monitor thread as previously suggested then you'll have to make the global/static loop variable volatile. You don't want the compiler decide deciding to use a register for the loop variable. Some toolchains make that assumption anyway, but there's no harm being explicit about it.
And then there's the small issue of caches. A separate monitor thread nowadays will end up on a separate core, and that'll mean that the two separate cache subsystems will have to agree on what the value is. That will unavoidably have a small impact on the runtime of the loop.
Real real time constraint?
So that begs the question of just how real time is your loop anyway? I doubt that your timing constraint is such that you're depending on it running within a specific number of CPU clock cycles. Two reasons, a) no modern OS will ever come close to guaranteeing that, you'd have to be running on the bare metal, b) most CPUs these days vary their own clock rate behind your back, so you can't count on a specific number of clock cycles corresponding to a specific real time interval.
Feature rich solution
So assuming that your real time requirement is not that constrained, you may wish to do a more capable monitor thread. Have a shared structure protected by a semaphore which your loop occasionally updates, and your monitor thread periodically inspects and reports progress. For best performance the monitor thread would take the semaphore, copy the structure, release the semaphore and then inspect/print the structure, minimising the semaphore locked time.
The only advantage of this approach over that suggested in previous answers is that you could report more than just the loop variable's value. There may be more information from your loop block that you'd like to report too.
Mutex semaphores in, say, C on Linux are pretty fast these days. Unless your loop block is very lightweight the runtime overhead of a single mutex is not likely to be significant, especially if you're updating the shared structure every 1000 loop iterations. A decent OS will put your threads on separate cores, but for the sake of good form you'd make the monitor thread's priority higher than the thread running the loop. This would ensure that the monitoring does actually happen if the two threads do end up on the same core.

What does the term "interrupt safe" mean?

I come across this term every now and then.
And now I really need a clear explanation as I wish to use some MPI routines that
are said not to be interrupt-safe.
I believe it's another wording for reentrant. If a function is reentrant it can be interrupted in the middle and called again.
For example:
void function()
{
lock(mtx);
/* code ... */
unlock(mtx);
}
This function can clearly be called by different threads (the mutex will protect the code inside). But if a signal arrives after lock(mtx) and the function is called again it will deadlock. So it's not interrupt-safe.
Code that is safe from concurrent access from an interrupt is said to be interrupt-safe.
Consider a situation that your process is in critical section and an asynchronous event comes and interrupts your process to access the same shared resource that process was accessing before preemption.
It is a major bug if an interrupt occurs in the middle of code that is manipulating a resource and the interrupt handler can access the same resource. Locking can save you!

Resources