TBB: parallel_for and parallel_invoke

TBB: parallel_for and parallel_invoke - parallel-processing

I am looking to use parallel invoke on two functions, these two functions are themselves tbb::parallel_for functions.
My question is this even possible and if so what will be the effect of this on performance on an 8 CPU machine.
Thanks

Yes, this is possible. You will need to wrap parallel_for into functors or lambdas to pass to parallel_reduce.
The effect on performance depends on what the code does. But if your question is really about the number of thread and CPU utilization: there will be 8 threads running, one of those is the main application thread and 7 more will be created by TBB.

Related

Threadpool - CPU usage?

I am working on a Windows C++ application. We use the boost library. I have an operation in my application that can be parallelized to run on multiple threads. Number of threads depends each time on the operation parameters and can be big(say like 50 or 70). I dont want to spawn the maximum threads that I can, since that is a risk of the application being non-responsive to other operations(since the all the processor(s) could be occupied doing this). How can I make sure I dont create a situation I described? Would a threadpool help and if so how?

70 threads on modern hardware can be easily handled w/o any noticeable impact on system performance. Thread creation time, memory usage, scheduling and context switch overhead can be a problem but we don't know if it's a problem in your particular case.
If creating 70 threads is not an option, consider using OpenMP (supported by all major compilers) as it's a very simple and often very efficient solution:
#pragma omp parallel for
for(int i = 1; i < 100; ++i)
{
do_task(i);
}
It uses a thread pool under the hood.
If OpenMP is not acceptable for some reason(s), you can go with explicit thread pool. It can be a "home-made" thread pool (not recommended), or one from #sehe's answer, or one that is provided by OS (as #Hans Passant mentioned in his comment), or one from a 3rd-party library (e.g. Intel Threading Building Blocks).
Yes, thread pool can help with responsiveness, though typical thread pool implementation by default creates number of threads == number of logical CPU cores. This means all your cores can be busy doing your work and it's not necessarily a problem. Windows uses preemptive multithreading. This means it can handle number of threads much greater than number of CPUs and still being responsive.
Thread pool can help because it's not possible to simultaneously execute more tasks than number of logical CPU cores you have. Thread pool can be more efficient because of better use of caches and reduced number of context switches. Or because same threads can be used to execute your operation multiple times. To know for sure profile your performance.

Just create a thread pool, e.g. the one I posted here boost thread throwing exception "thread_resource_error: resource temporarily unavailable"
Two more flavours here c++ work queues with blocking (one using Asio, one using just C++11)

You can use std::async with default launch policy. However, this is not the same as thread pool.
In OpenMP, you can set a fixed number of threads and then use OpenMP tasks. Unfortunately, there is no such option in C++11. The Standard says that the choice whether the function will be invoked asynchronously in a new thread or synchronously in a thread that calls wait or get on a corresponding std::future object can be deferred, however, then still a new thread must be created when asynchronous invocation is selected.

Is there a way to end idle threads in GNU OpenMP?

I use OpenMP for parallel sorting at start of my program. Once data is loaded and sorted, the program runs as a daemon and OpenMP is not used any more. Is there a way to turn off the idle threads created by OpenMP? omp_set_num_threads() doesn't affect the idle threads which have already been created for a task.

Please look up OMP_WAIT_POLICY, which is new in OpenMP 4 [https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fWAIT_005fPOLICY.html].
There are non-portable alternatives like GOMP_SPINCOUNT if your OpenMP implementation isn't recent enough. I recall from OpenMP specification discussions that at least Intel, IBM, Cray, and Oracle support their own implementation of this feature already.

I don't believe there is a way to trigger the threads' destruction. Modern OpenMP implementations tend to keep threads around in a pool to speed up starting future parallel sections.
In your case I would recommend a two program solution (one parallel to sort and one serial for the daemon). How you communicate the data between them is up to you. You could do something simple like writing it to a file and then reading it again. This may not be as slow as it sounds since a modern linux distribution might keep that file in memory in the file cache.
If you really want to be sure it stays in memory, you could launch the two processes simultaneously and allow them to share memory and allow the first parallel sort process to exit when it is done.

In theory, OpenMP has a implicit synchronization at the end of the "pragma" clauses. So, when the OpenMP parallel work ends, all the threads are deleted. You dont need to kill them or free them: OpenMP does that automatically.
Maybe "omp_get_num_threads()" is telling to you the actual configuration of the program, not the number of active threads. I mean: if you set the number of threads to 4, omp will tell you that the configuration is "4 threads", but this does not mean that there are actually 4 threads in process.

Could process running only on one processor have threads running on other processors?

Is it possible, in multiprocessor environment (PC) that one windows process is configured to run only on one processor (affinity mask = 1 or SetProcessAffinityMask(GetCurrentProcess(),1)), but its thread are spawned on other processors?
(Question came from discussion started in one company, regarding using synchronization objects (Events, Mutexes, Semaphores) and WinAPIs, like WaitForSignleObject, etc, especially SignalObjectAndWait for which MSDN states
"Note that the "signal" and "wait" are not guaranteed to be performed
as an atomic operation. Threads executing on other processors can
observe the signaled state of the first object before the thread
calling SignalObjectAndWait begins its wait on the second object"
Does it mean that for single processor it's guaranteed to be atomic?
P.S. Is there any differences for Windows Context Switching that there are multiple processors or single processor with more real cores?
P.P.S. Please be patient with this question if I didn't use exact and concrete terms - this are is still not very good known for me.

No.
The set of processor cores a thread can run on is the intersection of the process affinity mask and the thread affinity mask.
To get the behavior you describe, one would set the thread affinity mask for the main thread, and not mess with the process mask.
For your followup questions: If it isn't atomic, it isn't atomic. There are additional guarantees for threads sharing a core, because preemption follows certain rules, but they are very complex, since relative priority and dynamic priority are important factors in thread scheduling. Because of the complexity, it is best to use proper synchronization.
Notably, race conditions between threads of equal priority certainly still exist on a single core (or single core restricted) system, but they are far less frequent and therefore far more difficult to find and debug.

Is it possible, in multiprocessor environment (PC) that one windows process is configured to run only on one processor (affinity mask = 1 or SetProcessAffinityMask(GetCurrentProcess(),1)), but its thread are spawned on other processors?
If not set cpu affinity to only one core, one process could run on multiple cores?
What's the difference between processes and threads?
Thread could have processes or process could have threads?
Could process seen from a thread point of view or vice verse?
What is atomic notion?
when number 1 could seen as multidimensional unit?
Could we divide 1/0 (to zero)? When could we or couldn't?
Does it mean that for single processor it's guaranteed to be atomic?
One cpu: do you remember: run and stay resident? Good old time!
Then Unix: multiprocessing, multithreading, etc. :)
Note:
You couldn't ask a question without knowing answer to that question.
Try to ask something you don't know, that's impossible! You're asking because you have an answer. Look inside your question. Answer is evident. :)

OpenMP thread mapping to physical cores

So I've looked around online for some time to no avail. I'm new to using OpenMP and so not sure of the terminology here, but is there a way to figure out a specific machine's mapping from OMPThread (given by omp_get_thread_num();) and the physical cores on which the threads will run?
Also I was interested in how exactly OMP assigned threads, for example is thread 0 always going to run in the same location when the same code is run on the same machine? Thanks.

Typically, the OS takes care of assigning threads to cores, including with OpenMP. This is by design, and a good thing - you normally would want the OS to be able to move a thread across cores (transparently to your application) as required, since it will interrupt your application at times.
Certain operating system APIs will allow thread affinity to be set. For example, on Windows, you can use SetThreadAffinityMask to force a thread onto a specific core.

Most of the time Reed is correct, OpenMP doesn't care about the assignment of threads to cores (or processors). However, because of things like cache reuse and data locality we have found that there are many cases where having the threads assigned to cores increases the performance of OpenMP. Therefore if you look at most OpenMP implementations, you will find that there is usually some environment variable that can be set to "bind" threads to cores. The OpenMP ARB has not yet specified any "standard" way of doing this, so at this time it is left up to an OpenMP implementation to decide if and how this should be done. There has been a great deal of discussion about whether this should be included in the OpenMP spec or not and if so how it could best be done.

WIN32: Yielding execution to another (given) thread

I am looking for a way to yield the remainder of the thread execution's scheduled time slice to a different thread. There is a SwitchToThread function in WINAPI, but it doesn't let the caller specify the thread it wants to switch to. I browsed MSDN for quite some time and haven't found anything that would offer just that.
For an operating-system-internals layman like me, it seems that yielding thread should be able to specify which thread does it want to pass the execution to. Is it possible or is it just my imagination?

The reason you can't yield processor time-slices to a designated thread is that Windows features a preemptive scheduling kernel which pretty much places the responsibility and authority of scheduling the processor time in the hands of the kernel and only the kernel.
As such threads don't have any control over when they run, if they run, and even less over which thread is switched to after their time slice is up.
However, there are a few way you may influence context switches:
by increasing the priority of a certain thread you may force the scheduler to schedule it more often in the detriment of other threads (obviously the reverse applies as well - you can lower the priority of other threads)
you can code your process to place threads in kernel wait mode when they don't have work to do in order to help the scheduler do it's job. When using proper kernel wait constructs such as Critical Sections, Mutexes, Semaphores, and Timers you effectively tell the kernel a certain thread doesn't need to be scheduled until a certain codition is met.
Note: There is rarely a reason you should tamper with task priorities so USE WITH CAUTION

You might use 'fibers' instead of 'threads': for example there's a Win32 API named SwitchToFiber which lets you specify the fiber to be scheduled.

Take a look at UMS (User-mode scheduling) threads in Windows 7
http://msdn.microsoft.com/en-us/library/dd627187(VS.85).aspx

The second thread can simply wait for the yielding thread either by calling WaitForSingleObject() on its handle or periodically polling GetExitCodeThread(). The other answers are correct about altering the operating system's scheduling mechanisms - it is better to design the threads properly in the first place.

This is not possible. Only the kernel can decide what code runs next though you can influence it by reducing the non-waiting threads it has to choose from to run next, and by setting thread priorities with SetThreadPriority.

You can use regular synchronization primitives like events, semaphores, etc. to serialize your two threads. This does not in any form prevent the kernel from scheduling other threads in between, or in parallel on another CPU core, or virtually simultaneously on the same core. This is due to preemtive multitasking nature of modern general purpose operating systems.

If you want to do your own scheduling under Windows, you can use fibers, which essentially are threads that you have to schedule yourself. However, given that you describe yourself as a layman to the OS internals world, that would probably be a bad idea, as fibers are something of an advanced feature.

Can I ask why you want to use SwitchToThread?
If for example it's some form of because thread x is computing some value that you want to wait for on thread Y, then I'd really suggest looking at the Parallel Pattern Library or the Asynchronous Agents Library in Visual Studio 2010 which allows you to do this either with message blocks (receive on an asynchronous value) or simply via tasks : wait for a set of tasks to complete and inline their execution while waiting...
//i.e. on an arbitrary thread
task_group* tasks;
tasks->run(... / some functor/)
a call to tasks->wait() will wait and inline any tasks running.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio