How to reserve a core for one thread on windows? - windows

I am working on a very time sensitive application which polls a region of shared memory taking action when it detects a change has occurred. Changes are rare but I need to minimize the time from change to action. Given the infrequency of changes I think the CPU cache is getting cold. Is there a way to reserve a core for my polling thread so that it does not have to compete with other threads for either cache or CPU?

Thread affinity alone (SetThreadAffinityMask) will not be enough. It does not reserve a CPU core, but it does the opposite, it binds the thread to only the cores that you specify (that is not the same thing!).
By constraining the CPU affinity, you reduce the likelihood that your thread will run. If another thread with higher priority runs on the same core, your thread will not be scheduled until that other thread is done (this is how Windows schedules threads).
Without constraining affinity, your thread has a chance of being migrated to another core (taking the last time it was run as metric for that decision). Thread migration is undesirable if it happens often and soon after the thread has run (or while it is running) but it is a harmless, beneficial thing if a couple of dozen milliseconds have passed since it was last scheduled (caches will have been overwritten by then anyway).
You can "kind of" assure that your thread will run by giving it a higher priority class (no guarantee, but high likelihood). If you then use SetThreadAffinityMask as well, you have a reasonable chance that the cache is always warm on most common desktop CPUs (which luckily are normally VIPT and PIPT). For the TLB, you will probably be less lucky, but there's nothing you can do about it.
The problem with a high priority thread is that it will starve other threads because scheduling is implemented so it serves higher priority classes first, and as long as these are not satisfied, lower classes get zero. So, the solution in this case must be to block. Otherwise, you may impair the system in an unfavorable way.
Try this:
create a semaphore and share it with the other process
set priority to THREAD_PRIORITY_TIME_CRITICAL
block on the semaphore
in the other process, after writing data, call SignalObjectAndWait on the semaphore with a timeout of 1 (or even zero timeout)
if you want, you can experiment binding them both to the same core
This will create a thread that will be the first (or among the first) to get CPU time, but it is not running.
When the writer thread calls SignalObjectAndWait, it atomically signals and blocks (even if it waits for "zero time" that is enough to reschedule). The other thread will wake from the Semaphore and do its work. Thanks to its high priority, it will not be interrupted by other "normal" (that is, non-realtime) threads. It will keep hogging CPU time until done, and then block again on the semaphore. At this point, SignalObjectAndWait returns.

Using the Task Manager, you can set the "affinity" of processes.
You would have to set the affinity of your time-critical app to core 4, and the affinity of all the other processes to cores 1, 2, and 3. Assuming four cores of course.

You could call the SetProcessAffinityMask on every process but yours with a mask that excludes just the core that will "belong" to your process, and use it on your process to set it to run just on this core (or, even better, SetThreadAffinityMask just on the thread that does the time-critical task).

Given the infrequency of changes I think the CPU cache is getting cold.
That sounds very strange.
Let's assume your polling thread and the writing thread are on different cores.
The polling thread will be reading the shared memory address and so will be caching the data. That cache line is probably marked as exclusive. Then the write thread finally writes; first, it reads the cache line of memory in (so that line is now marked as shared on both cores) and then it writes. Writing causes the polling thread CPU's cache line to be marked as invalid. The polling thread then comes to read again; if it reads while the writing thread still has the data cached, it will read from the second cores cache, invalidating its cache line and taking ownership for itself. There's a lot of bus traffic overhead to do this.
Another issue is that the writing thread, if it doesn't write often, will almost certainly lose the TLB entry for the page with the shared memory address. Recalculating the physical address is a long, slow process. Since the polling thread polls often, possibly that page is always in that cores TLB; and in that sense, you might well do better, in latency terms, to have both threads on the same core. (Although if they're both compute intensive, they might interfere destructively and that cost could be much higher - I can't know, as I don't know what the threads are doing).
One thing you could do is use a hyperthread on the writing thread core; if you know early on you're going to write, get the hyperthread to read the shared memory address. This will load the TLB and cache while the writing thread is still busy computing, giving you parallelism.

The Win32 function SetThreadAffinityMask() is what you are looking for.

Related

Windows thread-switching latency after IO completion - microseconds or milliseconds

I am trying to determine approximate time delay (Win 7, Vista, XP) to switch threads when an IO operation completes.
What I (think I) know is that:
a) Thread contex switches are themselves computationally very fast. (By very fast, I mean typically way under 1ms, maybe even under 1us? - assuming a relatively fast, unloaded machine etc.)
b) Round robin time slice quantums are on the order of 10-15ms.
What I can't seem to find is information about the typical latency time from a (high priority) thread becoming active/signaled - via, say, a synchronous disk write completing - and that thread actually running again.
E.g., I have read in at least one place that all inactive threads remain asleep until ~10ms system quantum expires and then (assuming they are ready to go), they all get reactivated almost synchronously together. But in another place I read that the delay between when a thread completes an I/O operation and when it becomes active/signaled and runs again is measured in microseconds, not milliseconds.
My context for asking is related to capture and continuous streaming write to a RAID array of SSDs, from a high speed camera, where unless I can start a new write after a prior one has finished in well under 1ms (it would be best if under 1/10ms, on average), it will be problematic.
Any information regarding this issue would be most appreciated.
Thanks,
David
Thread context switches cost between 2,000 and 10,000 cpu cycles, so a handful of microseconds.
An I/O completion is fast when a thread is blocking on the synchronization handle that signals completion. That makes the Windows thread scheduler temporarily boost the thread priority. Which in turn makes it likely (but not guaranteed) to be chosen as the thread that gets the processor love. So that's typically microseconds, not milliseconds.
Do note that disk writes normally go through the file system cache. Which makes the WriteFile() call a simple memory-to-memory copy that doesn't block the thread. This runs at memory bus speeds, 5 gigabytes per second and up. Data is then written to the disk in a lazy fashion, the thread isn't otherwise involved or delayed by that. You'll only get slow writes when the file system cache is filled to capacity and you don't use overlapped I/O. Which is certainly a possibility if you write video streams. The amount of RAM makes a great deal of difference. And SSD controllers are not made the same. Nothing you can reason out up front, you'll have to test.

Is it advisable to use the Windows API 'SetThreadPriority' within a parallel_for_each loop

I would like to reduce the thread-priority of the threads servicing a parallel_for_each, because under heavy load conditions they consume too much processor time relative to other threads in my system.
Questions:
1) Do the servicing threads of a parallel_for_each inherit the thread-priority of the calling thread? In this case I could presumably call SetThreadPriority before and after the parallel_for_each, and everything should be fine.
2) Alternatively is it advisable to call SetThreadPriority within the parallel_for_each? This will clearly invoke the API multiple times for the same threads. Is there a large overhead of doing this?
2.b) Assuming that I do this, will it affect thread-priorities the next time that parallel_for_each is called - ie do I need to somehow reset the priority of each thread afterwards?
3) I'm wondering about thread-priorities in general. Would anyone like to comment: supposing that I had 2 threads contending for a single processor and one was set to "below-normal" while the other was "normal" priority. Roughly what percentage more processor time would the one thread get compared to the other?
All threads initially start at THREAD_PRIORITY_NORMAL. So you'd have to reduce the priority of each thread. Or reduce the priority of the owning process.
There is little overhead in calling SetThreadPriority. Once you have woken up a thread, the additional cost of calling SetThreadPriority is negligible. Once you set the thread's priority it will remain at that value until changed.
Suppose that you have one processor, and two threads ready to run. The scheduler will always choose to run the thread with the higher priority. This means that in your scenario, the below normal threads would never run. In reality, there's a lot more to scheduling than that. For example priority inversion. However, you can think of it like this. If all processors are busy with normal priority threads, then expect lower priority threads to be starved of CPU.

Why spinlocks don't work in uniprocessor (unicore) systems?

I know that spinlocks work with spining, different kernel paths exist and Kernels are preemptive, so why spinlocks don't work in uniprocessor systems? (for example, in Linux)
If I understand your question, you're asking why spin locks are a bad idea on single core machines.
They should still work, but can be much more expensive than true thread-sleeping concurrency:
When you use a spinlock, you're essentially asserting that you don't think you will have to wait long. You are saying that you think it's better to maintain the processor time slice with a busy loop than the cost of sleeping your thread and context-shifting to another thread or process. If you have to wait a very short amount of time, you can sleep and be reawakened almost immediately, but the cost of going down and up is more expensive than just waiting around.
This is more likely to be OK on multi-core processors, since they have much better concurrency profiles than single core processors. On multi core processors, between loop iterations, some other thread may have taken care of your prerequisite. On single core processors, it's not possible that someone else could have helped you out - you've locked up the one and only core.
The problem here is that if you wait or sleep on a lock, you hint to the system that you don't have everything you need yet, so it should go do some other stuff and come back to you later. With a spin lock, you never tell the system this, so you lock it up waiting for something else to happen - but, meanwhile, you're holding up the whole system, so something else can't happen.
The nature of a spinlock is that it does not deschedule the process - instead it spins until the process acquires the lock.
On a uniprocessor, it will either immediately acquire the lock or it will spin forever - if the lock is contended, then there will never be an opportunity for the process which currently holds the resource to give it up. Spinlocks are only useful when another process can execute while one is spinning on the lock - which means multiprocessor systems.
there are different versions of spinlock:
spin_lock_irqsave(&xxx_lock, flags);
... critical section here ..
spin_unlock_irqrestore(&xxx_lock, flags);
In Uni processor spin_lock_irqsave() should be used when data needs to shared between process context and interrupt context, as in this case IRQ also gets disabled. spin_lock_irqsave() work under all circumstances, but partly because they are safe they are also fairly slow.
However, in case data needs to be protected across different CPUs then it is better to use below versions, these are cheaper ones as IRQs dont get disabled in this case:
spin_lock(&lock);
...
spin_unlock(&lock);
In uniprocessor systems calling spin_lock_irqsave(&xxx_lock, flags); has the same effect as disabling interrupts which will provide the needed interrupt concurrency protection without unneeded SMP protection. However, in multiprocessor systems this covers both interrupt and SMP concurrency issues.
Spinlocks are, by their nature, intended for use on multiprocessor systems, although a uniprocessor workstation running a preemptive kernel behaves like SMP, as far as concurrency is concerned. If a nonpreemptive uniprocessor system ever went into a spin on a lock, it would spin forever; no other thread would ever be able to obtain the CPU to release the lock. For this reason, spinlock operations on uniprocessor systems without preemption enabled are optimized to do nothing, with the exception of the ones that change the IRQ masking status. Because of preemption, even if you never expect your code to run on an SMP system, you still need to implement proper locking.
Ref:Linux device drivers
By Jonathan Corbet, Alessandro Rubini, Greg Kroah-Hartma
Find the following two paragraph in Operating System Three Easy Pieces that might be helpful:
For spin locks, in the single CPU case, performance overheads can be
quite painful; imagine the case where the thread holding the lock is
pre-empted within a critical section. The scheduler might then run
every other thread (imagine there are N − 1 others), each of which
tries to ac- quire the lock. In this case, each of those threads will
spin for the duration of a time slice before giving up the CPU, a
waste of CPU cycles.
However, on multiple CPUs, spin locks work
reasonably well (if the number of threads roughly equals the number of
CPUs). The thinking goes as follows: imagine Thread A on CPU 1 and
Thread B on CPU 2, both contending for a lock. If Thread A (CPU 1)
grabs the lock, and then Thread B tries to, B will spin (on CPU 2).
However, presumably the crit- ical section is short, and thus soon the
lock becomes available, and is ac- quired by Thread B. Spinning to
wait for a lock held on another processor doesn’t waste many cycles in
this case, and thus can be effective

Does the Task Parallel Library (or PLINQ) take other processes into account?

In particular, I'm looking at using TPL to start (and wait for) external processes. Does the TPL look at total machine load (both CPU and I/O) before deciding to start another task (hence -- in my case -- another external process)?
For example:
I've got about 100 media files that need to be encoded or transcoded (e.g. from WAV to FLAC or from FLAC to MP3). The encoding is done by launching an external process (e.g. FLAC.EXE or LAME.EXE). Each file takes about 30 seconds. Each process is mostly CPU-bound, but there's some I/O in there. I've got 4 cores, so the worst case (transcoding by piping the decoder into the encoder) still only uses 2 cores. I'd like to do something like:
Parallel.ForEach(sourceFiles,
sourceFile =>
TranscodeUsingPipedExternalProcesses(sourceFile));
Will this kick off 100 tasks (and hence 200 external processes competing for the CPU)? Or will it see that the CPU's busy and only do 2-3 at a time?
You're going to run into a couple of issues here. The starvation avoidance mechanism of the scheduler will see your tasks as blocked as they wait on processes. It will find it hard to distinguish between a deadlocked thread and one simply waiting for a process to complete. As a result it may schedule new tasks if your tasks run or a long time (see below). The hillclimbing heuristic should take into account the overall load on the system, both from your application and others. It simply tries to maximize work done, so it will add more work until the overall throughput of the system stops increasing and then it will back off. I don't think this will effect your application but the stavation avoidance issue probably will.
You can find more detail as to how this all works in Parallel Programming with Microsoft®.NET, Colin Campbell, Ralph Johnson, Ade Miller, Stephen Toub (an earlier draft is online).
"The .NET thread pool automatically manages the number of worker
threads in the pool. It adds and removes threads according to built-in
heuristics. The .NET thread pool has two main mechanisms for injecting
threads: a starvation-avoidance mechanism that adds worker
threads if it sees no progress being made on queued items and a hillclimbing
heuristic that tries to maximize throughput while using as
few threads as possible.
The goal of starvation avoidance is to prevent deadlock. This kind
of deadlock can occur when a worker thread waits for a synchronization
event that can only be satisfied by a work item that is still pending
in the thread pool’s global or local queues. If there were a fixed
number of worker threads, and all of those threads were similarly
blocked, the system would be unable to ever make further progress.
Adding a new worker thread resolves the problem.
A goal of the hill-climbing heuristic is to improve the utilization
of cores when threads are blocked by I/O or other wait conditions
that stall the processor. By default, the managed thread pool has one
worker thread per core. If one of these worker threads becomes
blocked, there’s a chance that a core might be underutilized, depending
on the computer’s overall workload. The thread injection logic
doesn’t distinguish between a thread that’s blocked and a thread
that’s performing a lengthy, processor-intensive operation. Therefore,
whenever the thread pool’s global or local queues contain pending
work items, active work items that take a long time to run (more than
a half second) can trigger the creation of new thread pool worker
threads.
The .NET thread pool has an opportunity to inject threads every
time a work item completes or at 500 millisecond intervals, whichever
is shorter. The thread pool uses this opportunity to try adding threads
(or taking them away), guided by feedback from previous changes in
the thread count. If adding threads seems to be helping throughput,
the thread pool adds more; otherwise, it reduces the number of
worker threads. This technique is called the hill-climbing heuristic.
Therefore, one reason to keep individual tasks short is to avoid
“starvation detection,” but another reason to keep them short is to
give the thread pool more opportunities to improve throughput by
adjusting the thread count. The shorter the duration of individual
tasks, the more often the thread pool can measure throughput and
adjust the thread count accordingly.
To make this concrete, consider an extreme example. Suppose
that you have a complex financial simulation with 500 processor-intensive
operations, each one of which takes ten minutes on average
to complete. If you create top-level tasks in the global queue for each
of these operations, you will find that after about five minutes the
thread pool will grow to 500 worker threads. The reason is that the
thread pool sees all of the tasks as blocked and begins to add new
threads at the rate of approximately two threads per second.
What’s wrong with 500 worker threads? In principle, nothing, if
you have 500 cores for them to use and vast amounts of system
memory. In fact, this is the long-term vision of parallel computing.
However, if you don’t have that many cores on your computer, you are
in a situation where many threads are competing for time slices. This
situation is known as processor oversubscription. Allowing many
processor-intensive threads to compete for time on a single core adds
context switching overhead that can severely reduce overall system
throughput. Even if you don’t run out of memory, performance in this
situation can be much, much worse than in sequential computation.
(Each context switch takes between 6,000 and 8,000 processor cycles.)
The cost of context switching is not the only source of overhead.
A managed thread in .NET consumes roughly a megabyte of stack
space, whether or not that space is used for currently executing functions.
It takes about 200,000 CPU cycles to create a new thread, and
about 100,000 cycles to retire a thread. These are expensive operations.
As long as your tasks don’t each take minutes, the thread pool’s
hill-climbing algorithm will eventually realize it has too many threads
and cut back on its own accord. However, if you do have tasks that
occupy a worker thread for many seconds or minutes or hours, that
will throw off the thread pool’s heuristics, and at that point you
should consider an alternative.
The first option is to decompose your application into shorter
tasks that complete fast enough for the thread pool to successfully
control the number of threads for optimal throughput.
A second possibility is to implement your own task scheduler
object that does not perform thread injection. If your tasks are of long
duration, you don’t need a highly optimized task scheduler because
the cost of scheduling will be negligible compared to the execution
time of the task. MSDN® developer program has an example of a
simple task scheduler implementation that limits the maximum degree
of concurrency. For more information, see the section, “Further Reading,”
at the end of this chapter.
As a last resort, you can use the SetMaxThreads method to
configure the ThreadPool class with an upper limit for the number
of worker threads, usually equal to the number of cores (this is the
Environment.ProcessorCount property). This upper limit applies for
the entire process, including all AppDomains."
The short answer is: no.
Internally, the TPL uses the standard ThreadPool to schedule its tasks. So you're actually asking whether the ThreadPool takes machine load into account and it doesn't. The only thing that limits the number of tasks simultaneously running is the number of threads in the thread pool, nothing else.
Is it possible to have the external processes report back to your application once they are ready? In that case you do not have to wait for them (keeping threads occupied).
Ran a test using TPL/ThreadPool to schedule a great number of tasks doing looped spins. Using an external app I've loaded one of the cores to 100% using proc affinity. The number of active tasks never decreased.
Even better, I ran multiple instances of the same CPU intensive .NET TPL enabled app. The number of threads for all the apps was the same, and never went below the number of cores, even though my machine was barely usable.
So theory aside, TPL uses the number of cores available, but never checks on their actual load. A very poor implementation in my opinion.

Are threads from multiple processes actually running at the same time

In a Windows operating system with 2 physical x86/amd64 processors (P0 + P1), running 2 processes (A + B), each with two threads (T0 + T1), is it possible (or even common) to see the following:
P0:A:T0 running at the same time as P1:B:T0
then, after 1 (or is that 2?) context switch(es?)
P0:B:T1 running at the same time as P1:A:T1
In a nutshell, I'd like to know if - on a multiple processor machine - the operating system is free to schedule any thread from any process at any time, regardless of what other threads from other processes are already running.
EDIT:
To clarify the silly example, imagine that process A's thread A:T0 has affinity to processor P0 (and A:T1 to P1,) while process B's thread B:T0 has affinity to processor P1 (and B:T1 to to P0). It probably doesn't matter whether these processors are cores or sockets.
Is there a first-class concept of a process context switch? Perfmon shows context switches under the Thread object, but nothing under the Process object.
Yes, it is possible and it happens pretty often.The OS tries to not switch one thread between CPUs (you can make it try harder setting the threads preferred processor, or you can even lock it to single processor via affinity).Windows' process is not an execution unit by itself - from this viewpoint, its basically just a context for its threads.
EDIT (further clarifications)
There's nothing like a "process context switch". Basically, the OS scheduler assigns the threads via a (very adaptive) round-robin algorithm to any free processor/core (as the affinity allows), if the "previous" processor isn't immediately available, regardless of the processes (which means multi-threaded processes can steal much more CPU power).
This "jumping" may seem expensive, considering at least the L1 (and sometimes L2) caches are per-core (apart from different slot/package processors), but it's still cheaper than delays caused by waiting to the "right" processor and inability to do elaborate load-balancing (which the "jumping" scheme makes possible).This may not apply to the NUMA architecture, but there are much more considerations invoved (e.g. adapting all memory-allocations to be thread- and processor-bound and avoiding as much state/memory sharing as possible).
As for affinity: you can set affinity masks per-thread or per-process (which supersedes all process' threads' settings), but the OS enforces least one logical processor affiliated per thread (you never end up with a zero mask).
A process' default affinity mask is inherited from its parent process (which allows you to create single-core loaders for problematic legacy executables), and threads inherit the mask from the process they belong to.
You may not set a threads affinity to a processor outside the process' affinity, but you can further limit it.
Any thread by default, will jump between the available logical processors (especially if it yields, calls to kernel, etc), may jump even if it has its preferred processor set, but only if it has to,
but it will NOT jump to a processor outside its affinity mask (which may lead to considerable delays).
I'm not sure if the scheduler sees any difference between physical and hyper-threaded processors, but even if it doesn't (which I assume), the consequences are in most cases not of a concern, i.e. there should not be much difference between multiple threads sharing physical or logical processors if the thread count is just the same. Regardless, there are some reports of cache-thrashing in this scenario, mainly in high-performance heavily multithreaded applications, like SQL server or .NET and Java VMs, which may or may not benefit from HyperThreading turned off.
I generally agree with the previous answer, however things are more complex.
Although processes are not execution units, threads belonging to the same process should be treated differently. There're two reasons for this:
Same address space. Means - when switching the context between such threads no need to setup the address translation registers.
Threads of the same process are much more likely to access the same memory.
The (2) has a great impact on the cache state. If threads read the same memory location - they reuse the L2 cache, hence the whole things speeds up. There's however the drawback too: once a thread changes a memory location - that address is invalidated in both L2 cache and L2 cache of both processors, so that the other processor invalidates its cache too.
So there're pros and cons with running the threads of the same process simultaneously (on different processors). BTW this situation has a name: "Gang scheduling".

Resources