Could process running only on one processor have threads running on other processors? - windows

Is it possible, in multiprocessor environment (PC) that one windows process is configured to run only on one processor (affinity mask = 1 or SetProcessAffinityMask(GetCurrentProcess(),1)), but its thread are spawned on other processors?
(Question came from discussion started in one company, regarding using synchronization objects (Events, Mutexes, Semaphores) and WinAPIs, like WaitForSignleObject, etc, especially SignalObjectAndWait for which MSDN states
"Note that the "signal" and "wait" are not guaranteed to be performed
as an atomic operation. Threads executing on other processors can
observe the signaled state of the first object before the thread
calling SignalObjectAndWait begins its wait on the second object"
Does it mean that for single processor it's guaranteed to be atomic?
P.S. Is there any differences for Windows Context Switching that there are multiple processors or single processor with more real cores?
P.P.S. Please be patient with this question if I didn't use exact and concrete terms - this are is still not very good known for me.

The set of processor cores a thread can run on is the intersection of the process affinity mask and the thread affinity mask.
To get the behavior you describe, one would set the thread affinity mask for the main thread, and not mess with the process mask.
For your followup questions: If it isn't atomic, it isn't atomic. There are additional guarantees for threads sharing a core, because preemption follows certain rules, but they are very complex, since relative priority and dynamic priority are important factors in thread scheduling. Because of the complexity, it is best to use proper synchronization.
Notably, race conditions between threads of equal priority certainly still exist on a single core (or single core restricted) system, but they are far less frequent and therefore far more difficult to find and debug.

Is it possible, in multiprocessor environment (PC) that one windows process is configured to run only on one processor (affinity mask = 1 or SetProcessAffinityMask(GetCurrentProcess(),1)), but its thread are spawned on other processors?
If not set cpu affinity to only one core, one process could run on multiple cores?
What's the difference between processes and threads?
Thread could have processes or process could have threads?
Could process seen from a thread point of view or vice verse?
What is atomic notion?
when number 1 could seen as multidimensional unit?
Could we divide 1/0 (to zero)? When could we or couldn't?
Does it mean that for single processor it's guaranteed to be atomic?
One cpu: do you remember: run and stay resident? Good old time!
Then Unix: multiprocessing, multithreading, etc. :)
You couldn't ask a question without knowing answer to that question.
Try to ask something you don't know, that's impossible! You're asking because you have an answer. Look inside your question. Answer is evident. :)


Threadpool - CPU usage?

I am working on a Windows C++ application. We use the boost library. I have an operation in my application that can be parallelized to run on multiple threads. Number of threads depends each time on the operation parameters and can be big(say like 50 or 70). I dont want to spawn the maximum threads that I can, since that is a risk of the application being non-responsive to other operations(since the all the processor(s) could be occupied doing this). How can I make sure I dont create a situation I described? Would a threadpool help and if so how?
70 threads on modern hardware can be easily handled w/o any noticeable impact on system performance. Thread creation time, memory usage, scheduling and context switch overhead can be a problem but we don't know if it's a problem in your particular case.
If creating 70 threads is not an option, consider using OpenMP (supported by all major compilers) as it's a very simple and often very efficient solution:
#pragma omp parallel for
for(int i = 1; i < 100; ++i)
It uses a thread pool under the hood.
If OpenMP is not acceptable for some reason(s), you can go with explicit thread pool. It can be a "home-made" thread pool (not recommended), or one from #sehe's answer, or one that is provided by OS (as #Hans Passant mentioned in his comment), or one from a 3rd-party library (e.g. Intel Threading Building Blocks).
Yes, thread pool can help with responsiveness, though typical thread pool implementation by default creates number of threads == number of logical CPU cores. This means all your cores can be busy doing your work and it's not necessarily a problem. Windows uses preemptive multithreading. This means it can handle number of threads much greater than number of CPUs and still being responsive.
Thread pool can help because it's not possible to simultaneously execute more tasks than number of logical CPU cores you have. Thread pool can be more efficient because of better use of caches and reduced number of context switches. Or because same threads can be used to execute your operation multiple times. To know for sure profile your performance.
Just create a thread pool, e.g. the one I posted here boost thread throwing exception "thread_resource_error: resource temporarily unavailable"
Two more flavours here c++ work queues with blocking (one using Asio, one using just C++11)
You can use std::async with default launch policy. However, this is not the same as thread pool.
In OpenMP, you can set a fixed number of threads and then use OpenMP tasks. Unfortunately, there is no such option in C++11. The Standard says that the choice whether the function will be invoked asynchronously in a new thread or synchronously in a thread that calls wait or get on a corresponding std::future object can be deferred, however, then still a new thread must be created when asynchronous invocation is selected.

OpenMP thread mapping to physical cores

So I've looked around online for some time to no avail. I'm new to using OpenMP and so not sure of the terminology here, but is there a way to figure out a specific machine's mapping from OMPThread (given by omp_get_thread_num();) and the physical cores on which the threads will run?
Also I was interested in how exactly OMP assigned threads, for example is thread 0 always going to run in the same location when the same code is run on the same machine? Thanks.
Typically, the OS takes care of assigning threads to cores, including with OpenMP. This is by design, and a good thing - you normally would want the OS to be able to move a thread across cores (transparently to your application) as required, since it will interrupt your application at times.
Certain operating system APIs will allow thread affinity to be set. For example, on Windows, you can use SetThreadAffinityMask to force a thread onto a specific core.
Most of the time Reed is correct, OpenMP doesn't care about the assignment of threads to cores (or processors). However, because of things like cache reuse and data locality we have found that there are many cases where having the threads assigned to cores increases the performance of OpenMP. Therefore if you look at most OpenMP implementations, you will find that there is usually some environment variable that can be set to "bind" threads to cores. The OpenMP ARB has not yet specified any "standard" way of doing this, so at this time it is left up to an OpenMP implementation to decide if and how this should be done. There has been a great deal of discussion about whether this should be included in the OpenMP spec or not and if so how it could best be done.

Are threads from multiple processes actually running at the same time

In a Windows operating system with 2 physical x86/amd64 processors (P0 + P1), running 2 processes (A + B), each with two threads (T0 + T1), is it possible (or even common) to see the following:
P0:A:T0 running at the same time as P1:B:T0
then, after 1 (or is that 2?) context switch(es?)
P0:B:T1 running at the same time as P1:A:T1
In a nutshell, I'd like to know if - on a multiple processor machine - the operating system is free to schedule any thread from any process at any time, regardless of what other threads from other processes are already running.
To clarify the silly example, imagine that process A's thread A:T0 has affinity to processor P0 (and A:T1 to P1,) while process B's thread B:T0 has affinity to processor P1 (and B:T1 to to P0). It probably doesn't matter whether these processors are cores or sockets.
Is there a first-class concept of a process context switch? Perfmon shows context switches under the Thread object, but nothing under the Process object.
Yes, it is possible and it happens pretty often.The OS tries to not switch one thread between CPUs (you can make it try harder setting the threads preferred processor, or you can even lock it to single processor via affinity).Windows' process is not an execution unit by itself - from this viewpoint, its basically just a context for its threads.
EDIT (further clarifications)
There's nothing like a "process context switch". Basically, the OS scheduler assigns the threads via a (very adaptive) round-robin algorithm to any free processor/core (as the affinity allows), if the "previous" processor isn't immediately available, regardless of the processes (which means multi-threaded processes can steal much more CPU power).
This "jumping" may seem expensive, considering at least the L1 (and sometimes L2) caches are per-core (apart from different slot/package processors), but it's still cheaper than delays caused by waiting to the "right" processor and inability to do elaborate load-balancing (which the "jumping" scheme makes possible).This may not apply to the NUMA architecture, but there are much more considerations invoved (e.g. adapting all memory-allocations to be thread- and processor-bound and avoiding as much state/memory sharing as possible).
As for affinity: you can set affinity masks per-thread or per-process (which supersedes all process' threads' settings), but the OS enforces least one logical processor affiliated per thread (you never end up with a zero mask).
A process' default affinity mask is inherited from its parent process (which allows you to create single-core loaders for problematic legacy executables), and threads inherit the mask from the process they belong to.
You may not set a threads affinity to a processor outside the process' affinity, but you can further limit it.
Any thread by default, will jump between the available logical processors (especially if it yields, calls to kernel, etc), may jump even if it has its preferred processor set, but only if it has to,
but it will NOT jump to a processor outside its affinity mask (which may lead to considerable delays).
I'm not sure if the scheduler sees any difference between physical and hyper-threaded processors, but even if it doesn't (which I assume), the consequences are in most cases not of a concern, i.e. there should not be much difference between multiple threads sharing physical or logical processors if the thread count is just the same. Regardless, there are some reports of cache-thrashing in this scenario, mainly in high-performance heavily multithreaded applications, like SQL server or .NET and Java VMs, which may or may not benefit from HyperThreading turned off.
I generally agree with the previous answer, however things are more complex.
Although processes are not execution units, threads belonging to the same process should be treated differently. There're two reasons for this:
Same address space. Means - when switching the context between such threads no need to setup the address translation registers.
Threads of the same process are much more likely to access the same memory.
The (2) has a great impact on the cache state. If threads read the same memory location - they reuse the L2 cache, hence the whole things speeds up. There's however the drawback too: once a thread changes a memory location - that address is invalidated in both L2 cache and L2 cache of both processors, so that the other processor invalidates its cache too.
So there're pros and cons with running the threads of the same process simultaneously (on different processors). BTW this situation has a name: "Gang scheduling".

WIN32: Yielding execution to another (given) thread

I am looking for a way to yield the remainder of the thread execution's scheduled time slice to a different thread. There is a SwitchToThread function in WINAPI, but it doesn't let the caller specify the thread it wants to switch to. I browsed MSDN for quite some time and haven't found anything that would offer just that.
For an operating-system-internals layman like me, it seems that yielding thread should be able to specify which thread does it want to pass the execution to. Is it possible or is it just my imagination?
The reason you can't yield processor time-slices to a designated thread is that Windows features a preemptive scheduling kernel which pretty much places the responsibility and authority of scheduling the processor time in the hands of the kernel and only the kernel.
As such threads don't have any control over when they run, if they run, and even less over which thread is switched to after their time slice is up.
However, there are a few way you may influence context switches:
by increasing the priority of a certain thread you may force the scheduler to schedule it more often in the detriment of other threads (obviously the reverse applies as well - you can lower the priority of other threads)
you can code your process to place threads in kernel wait mode when they don't have work to do in order to help the scheduler do it's job. When using proper kernel wait constructs such as Critical Sections, Mutexes, Semaphores, and Timers you effectively tell the kernel a certain thread doesn't need to be scheduled until a certain codition is met.
Note: There is rarely a reason you should tamper with task priorities so USE WITH CAUTION
You might use 'fibers' instead of 'threads': for example there's a Win32 API named SwitchToFiber which lets you specify the fiber to be scheduled.
Take a look at UMS (User-mode scheduling) threads in Windows 7
The second thread can simply wait for the yielding thread either by calling WaitForSingleObject() on its handle or periodically polling GetExitCodeThread(). The other answers are correct about altering the operating system's scheduling mechanisms - it is better to design the threads properly in the first place.
This is not possible. Only the kernel can decide what code runs next though you can influence it by reducing the non-waiting threads it has to choose from to run next, and by setting thread priorities with SetThreadPriority.
You can use regular synchronization primitives like events, semaphores, etc. to serialize your two threads. This does not in any form prevent the kernel from scheduling other threads in between, or in parallel on another CPU core, or virtually simultaneously on the same core. This is due to preemtive multitasking nature of modern general purpose operating systems.
If you want to do your own scheduling under Windows, you can use fibers, which essentially are threads that you have to schedule yourself. However, given that you describe yourself as a layman to the OS internals world, that would probably be a bad idea, as fibers are something of an advanced feature.
Can I ask why you want to use SwitchToThread?
If for example it's some form of because thread x is computing some value that you want to wait for on thread Y, then I'd really suggest looking at the Parallel Pattern Library or the Asynchronous Agents Library in Visual Studio 2010 which allows you to do this either with message blocks (receive on an asynchronous value) or simply via tasks : wait for a set of tasks to complete and inline their execution while waiting...
//i.e. on an arbitrary thread
task_group* tasks;
tasks->run(... / some functor/)
a call to tasks->wait() will wait and inline any tasks running.

Win32 Thread scheduling

As I understand, windows thread scheduler does not discriminate beween threads belonging two different processes, provided all of them have the same base priority. My question is if I have two applications one with only one thread and the other with say 50 threads all with same base priority, does it mean that the second process enjoys more CPU time then the first one?
Scheduling in Windows is at the thread granularity. The basic idea behind this approach is that processes don't run but only provide resources and a context in which their threads run. Coming back to your question, because scheduling decisions are made strictly on a thread basis, no consideration is given to what process the thread belongs to. In your example, if process A has 1 runnable thread and process B has 50 runnable threads, and all 51 threads are at the same priority, each thread would receive 1/51 of the CPU time—Windows wouldn't give 50 percent of the CPU to process A and 50 percent to process B.
To understand the thread-scheduling algorithms, you must first understand the priority levels that Windows uses. You can refer here for quick reference.
Try reading Windows Internals for in depth understanding.
All of the above are accurate but if you're worried about the 50 thread process hogging all the CPU, there ARE techniques you can do to ensure that no single process overwhelms the CPU.
IMHO the best way to do this is to use job objects to manage the usage of a process. First call CreateJobObject, then SetInformationJobObject to limit the max CPU usage of the processes in the job object and AssignProcessToJobObject to assign the process with 50 threads to the job object. You can then let the OS ensure that the 50 thread process doesn't consume too much CPU time.
The unit of scheduling is a thread, not a process, so a process with 50 threads, all in a tight loop, will get much more of the cpu than a process with only a single thread, provided all are running at the same priority. This is normally not a concern since most threads in the system are not in a runnable state and will not be up for scheduling; they are waiting on I/O, waiting for input from the user, and so on.
Windows Internals is a great book for learning more about the Windows thread scheduler.
That depends on the behavior of the threads. In general with a 50 : 1 difference in thread count, yes, the application with more threads is going to get a lot more time. However, windows also uses dynamic thread prioritization, which can change this somewhat. Dynamic thread prioritization is described here:
Relevant excerpt:
The base priority of a thread is the base level from which these upward adjustments are made. The current priority of a thread is called its dynamic priority. Interactive threads that yield before their time slice is up will tend to be adjusted upward in priority from their base priority. Compute-bound threads that do not yield, consuming their entire time slice, will tend to have their priority decreased, but not below the base level. This arrangement is often called heuristic scheduling. It provides better interactive performance and tends to lessen the system impact of "CPU hog" threads.
There is a local 'advanced' setting that purportedly can be used to shade scheduling slightly in favor of the app with focus. With the 'services' setting, there is no preference. In previous versions of Windows, this setting used to be somewhat more granular than just 'applications with focus'(slight preference to app with focus) and 'services' (all equal weigthing)
As this can be set by the user on the targe machine, it seems like it is asking for grief to depend on this setting...
