Threadpool - CPU usage? - windows

I am working on a Windows C++ application. We use the boost library. I have an operation in my application that can be parallelized to run on multiple threads. Number of threads depends each time on the operation parameters and can be big(say like 50 or 70). I dont want to spawn the maximum threads that I can, since that is a risk of the application being non-responsive to other operations(since the all the processor(s) could be occupied doing this). How can I make sure I dont create a situation I described? Would a threadpool help and if so how?

70 threads on modern hardware can be easily handled w/o any noticeable impact on system performance. Thread creation time, memory usage, scheduling and context switch overhead can be a problem but we don't know if it's a problem in your particular case.
If creating 70 threads is not an option, consider using OpenMP (supported by all major compilers) as it's a very simple and often very efficient solution:
#pragma omp parallel for
for(int i = 1; i < 100; ++i)
{
do_task(i);
}
It uses a thread pool under the hood.
If OpenMP is not acceptable for some reason(s), you can go with explicit thread pool. It can be a "home-made" thread pool (not recommended), or one from #sehe's answer, or one that is provided by OS (as #Hans Passant mentioned in his comment), or one from a 3rd-party library (e.g. Intel Threading Building Blocks).
Yes, thread pool can help with responsiveness, though typical thread pool implementation by default creates number of threads == number of logical CPU cores. This means all your cores can be busy doing your work and it's not necessarily a problem. Windows uses preemptive multithreading. This means it can handle number of threads much greater than number of CPUs and still being responsive.
Thread pool can help because it's not possible to simultaneously execute more tasks than number of logical CPU cores you have. Thread pool can be more efficient because of better use of caches and reduced number of context switches. Or because same threads can be used to execute your operation multiple times. To know for sure profile your performance.

Just create a thread pool, e.g. the one I posted here boost thread throwing exception "thread_resource_error: resource temporarily unavailable"
Two more flavours here c++ work queues with blocking (one using Asio, one using just C++11)

You can use std::async with default launch policy. However, this is not the same as thread pool.
In OpenMP, you can set a fixed number of threads and then use OpenMP tasks. Unfortunately, there is no such option in C++11. The Standard says that the choice whether the function will be invoked asynchronously in a new thread or synchronously in a thread that calls wait or get on a corresponding std::future object can be deferred, however, then still a new thread must be created when asynchronous invocation is selected.

Related

May I have Project Loom Clarified?

Brian Goetz got me excited about project Loom and, in order to fully appreciate it, I'll need some clarification on the status quo.
My understanding is as follows: Currently, in order to have real parallelism, we need to have a thread per cpu/core; 1) is there then any point in having n+1 threads on an n-core machine? Project Loom will bring us virtually limitless threads/fibres, by relying on the jvm to carry out a task on a virtual thread, inside the JVM. 2) Will that be truly parallel? 3)How, specifically, will that differ from the aforementioned scenario "n+1 threads on an n-core machine "?
Thanks for your time.
Virtual threads allow for concurrency (IO bound), not parallelism (CPU bound). They represent causal simultaneity, but not resource usage simultaneity.
In fact, if two virtual threads are in an IO bound* state (awaiting a return from a REST call for example), then no thread is being used at all. Whereas, the use of normal threads (if not using a reactive or completable semantic) would both be blocked and unavailable for use until the calls are complete.
*Except for certain conditions (e.g., use of synchonize vs ReentrackLock, blocking that occurs in a native method, and possibly some other minor areas).
is there then any point in having n+1 threads on an n-core machine?
For one, most modern n-core machines have n*2 hardware threads because each core has 2 hardware threads.
Sometimes it does make sense to spawn more OS threads than hardware threads. That’s the case when some OS threads are asleep waiting for something. For instance, on Linux, until io_uring arrived couple years ago, there was no good way to implement asynchronous I/O for files on local disks. Traditionally, disk-heavy applications spawned more threads than CPU cores, and used blocking I/O.
Will that be truly parallel?
Depends on the implementation. Not just the language runtime, but also the I/O related parts of the standard library. For instance, on Windows, when doing disk or network I/O in C# with async/await (an equivalent of project loom, released around 2012) these tasks are truly parallel, the OS kernel and drivers are indeed doing more work at the same time. AFAIK on Linux async/await is only truly parallel for sockets but not files, for asynchronous file I/O it uses a pool of OS threads under the hood.
How, specifically, will that differ from the aforementioned scenario "n+1 threads on an n-core machine "?
OS threads are more expensive for a few reasons. (1) They require native stack so each OS thread consumes memory (2) Memory is slow, processors have caches to compensate, switching between OS threads increases RAM bandwidth because thread-specific data invalidates after a context switch (3) OS schedulers were improving over decades but still they’re not free. One reason is saving/restoring thread state to/from memory takes time.
The higher-level cooperative multitasking implemented in C# async/await or Java’s Loom causes way less overhead when switching contexts, compared to switching OS threads. At least in theory, this should improve both throughput and latency for I/O heavy applications.

Could process running only on one processor have threads running on other processors?

Is it possible, in multiprocessor environment (PC) that one windows process is configured to run only on one processor (affinity mask = 1 or SetProcessAffinityMask(GetCurrentProcess(),1)), but its thread are spawned on other processors?
(Question came from discussion started in one company, regarding using synchronization objects (Events, Mutexes, Semaphores) and WinAPIs, like WaitForSignleObject, etc, especially SignalObjectAndWait for which MSDN states
"Note that the "signal" and "wait" are not guaranteed to be performed
as an atomic operation. Threads executing on other processors can
observe the signaled state of the first object before the thread
calling SignalObjectAndWait begins its wait on the second object"
Does it mean that for single processor it's guaranteed to be atomic?
P.S. Is there any differences for Windows Context Switching that there are multiple processors or single processor with more real cores?
P.P.S. Please be patient with this question if I didn't use exact and concrete terms - this are is still not very good known for me.
No.
The set of processor cores a thread can run on is the intersection of the process affinity mask and the thread affinity mask.
To get the behavior you describe, one would set the thread affinity mask for the main thread, and not mess with the process mask.
For your followup questions: If it isn't atomic, it isn't atomic. There are additional guarantees for threads sharing a core, because preemption follows certain rules, but they are very complex, since relative priority and dynamic priority are important factors in thread scheduling. Because of the complexity, it is best to use proper synchronization.
Notably, race conditions between threads of equal priority certainly still exist on a single core (or single core restricted) system, but they are far less frequent and therefore far more difficult to find and debug.
Is it possible, in multiprocessor environment (PC) that one windows process is configured to run only on one processor (affinity mask = 1 or SetProcessAffinityMask(GetCurrentProcess(),1)), but its thread are spawned on other processors?
If not set cpu affinity to only one core, one process could run on multiple cores?
What's the difference between processes and threads?
Thread could have processes or process could have threads?
Could process seen from a thread point of view or vice verse?
What is atomic notion?
when number 1 could seen as multidimensional unit?
Could we divide 1/0 (to zero)? When could we or couldn't?
Does it mean that for single processor it's guaranteed to be atomic?
One cpu: do you remember: run and stay resident? Good old time!
Then Unix: multiprocessing, multithreading, etc. :)
Note:
You couldn't ask a question without knowing answer to that question.
Try to ask something you don't know, that's impossible! You're asking because you have an answer. Look inside your question. Answer is evident. :)

Instruct win32 threads to run on a single processor core

I have a test program which would be much simpler if it could rely on threads being scheduled in strict priority order on Windows. I'm seeing a low priority thread running alongside higher priority threads and wonder if this is happening because the different threads are being scheduled on different processor cores.
Is there a way to force all Win32 threads in a process to use a single processor core? SetThreadAffinityMask looks like it might be interesting but its docs aren't entirely clear and I'm not sure how to use it.
SetThreadAffinityMask function: Sets a processor affinity mask for the specified thread.
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686247%28v=vs.85%29.aspx
SetThreadAffinityMask(GetCurrentThread(), (1 << CoreNumber));
Sets the current thread's affinity to 'CoreNumber' variable
Even if you force all threads onto one virtual processor you will still often have low-priority threads running and high-priority threads waiting for them (priority inversion). Once a thread is scheduled by the windows-scheduler it runs until it is either preempted or sleeps (or some other sleep-inducing system call). You will have to change the design of your application so that it no-longer assumes that no low-priority thread runs while a high-priority thread would be ready to run also.

OpenMP thread mapping to physical cores

So I've looked around online for some time to no avail. I'm new to using OpenMP and so not sure of the terminology here, but is there a way to figure out a specific machine's mapping from OMPThread (given by omp_get_thread_num();) and the physical cores on which the threads will run?
Also I was interested in how exactly OMP assigned threads, for example is thread 0 always going to run in the same location when the same code is run on the same machine? Thanks.
Typically, the OS takes care of assigning threads to cores, including with OpenMP. This is by design, and a good thing - you normally would want the OS to be able to move a thread across cores (transparently to your application) as required, since it will interrupt your application at times.
Certain operating system APIs will allow thread affinity to be set. For example, on Windows, you can use SetThreadAffinityMask to force a thread onto a specific core.
Most of the time Reed is correct, OpenMP doesn't care about the assignment of threads to cores (or processors). However, because of things like cache reuse and data locality we have found that there are many cases where having the threads assigned to cores increases the performance of OpenMP. Therefore if you look at most OpenMP implementations, you will find that there is usually some environment variable that can be set to "bind" threads to cores. The OpenMP ARB has not yet specified any "standard" way of doing this, so at this time it is left up to an OpenMP implementation to decide if and how this should be done. There has been a great deal of discussion about whether this should be included in the OpenMP spec or not and if so how it could best be done.

WIN32: Yielding execution to another (given) thread

I am looking for a way to yield the remainder of the thread execution's scheduled time slice to a different thread. There is a SwitchToThread function in WINAPI, but it doesn't let the caller specify the thread it wants to switch to. I browsed MSDN for quite some time and haven't found anything that would offer just that.
For an operating-system-internals layman like me, it seems that yielding thread should be able to specify which thread does it want to pass the execution to. Is it possible or is it just my imagination?
The reason you can't yield processor time-slices to a designated thread is that Windows features a preemptive scheduling kernel which pretty much places the responsibility and authority of scheduling the processor time in the hands of the kernel and only the kernel.
As such threads don't have any control over when they run, if they run, and even less over which thread is switched to after their time slice is up.
However, there are a few way you may influence context switches:
by increasing the priority of a certain thread you may force the scheduler to schedule it more often in the detriment of other threads (obviously the reverse applies as well - you can lower the priority of other threads)
you can code your process to place threads in kernel wait mode when they don't have work to do in order to help the scheduler do it's job. When using proper kernel wait constructs such as Critical Sections, Mutexes, Semaphores, and Timers you effectively tell the kernel a certain thread doesn't need to be scheduled until a certain codition is met.
Note: There is rarely a reason you should tamper with task priorities so USE WITH CAUTION
You might use 'fibers' instead of 'threads': for example there's a Win32 API named SwitchToFiber which lets you specify the fiber to be scheduled.
Take a look at UMS (User-mode scheduling) threads in Windows 7
http://msdn.microsoft.com/en-us/library/dd627187(VS.85).aspx
The second thread can simply wait for the yielding thread either by calling WaitForSingleObject() on its handle or periodically polling GetExitCodeThread(). The other answers are correct about altering the operating system's scheduling mechanisms - it is better to design the threads properly in the first place.
This is not possible. Only the kernel can decide what code runs next though you can influence it by reducing the non-waiting threads it has to choose from to run next, and by setting thread priorities with SetThreadPriority.
You can use regular synchronization primitives like events, semaphores, etc. to serialize your two threads. This does not in any form prevent the kernel from scheduling other threads in between, or in parallel on another CPU core, or virtually simultaneously on the same core. This is due to preemtive multitasking nature of modern general purpose operating systems.
If you want to do your own scheduling under Windows, you can use fibers, which essentially are threads that you have to schedule yourself. However, given that you describe yourself as a layman to the OS internals world, that would probably be a bad idea, as fibers are something of an advanced feature.
Can I ask why you want to use SwitchToThread?
If for example it's some form of because thread x is computing some value that you want to wait for on thread Y, then I'd really suggest looking at the Parallel Pattern Library or the Asynchronous Agents Library in Visual Studio 2010 which allows you to do this either with message blocks (receive on an asynchronous value) or simply via tasks : wait for a set of tasks to complete and inline their execution while waiting...
//i.e. on an arbitrary thread
task_group* tasks;
tasks->run(... / some functor/)
a call to tasks->wait() will wait and inline any tasks running.

Resources