Is there a way to end idle threads in GNU OpenMP? - openmp

I use OpenMP for parallel sorting at start of my program. Once data is loaded and sorted, the program runs as a daemon and OpenMP is not used any more. Is there a way to turn off the idle threads created by OpenMP? omp_set_num_threads() doesn't affect the idle threads which have already been created for a task.

Please look up OMP_WAIT_POLICY, which is new in OpenMP 4 [https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fWAIT_005fPOLICY.html].
There are non-portable alternatives like GOMP_SPINCOUNT if your OpenMP implementation isn't recent enough. I recall from OpenMP specification discussions that at least Intel, IBM, Cray, and Oracle support their own implementation of this feature already.

I don't believe there is a way to trigger the threads' destruction. Modern OpenMP implementations tend to keep threads around in a pool to speed up starting future parallel sections.
In your case I would recommend a two program solution (one parallel to sort and one serial for the daemon). How you communicate the data between them is up to you. You could do something simple like writing it to a file and then reading it again. This may not be as slow as it sounds since a modern linux distribution might keep that file in memory in the file cache.
If you really want to be sure it stays in memory, you could launch the two processes simultaneously and allow them to share memory and allow the first parallel sort process to exit when it is done.

In theory, OpenMP has a implicit synchronization at the end of the "pragma" clauses. So, when the OpenMP parallel work ends, all the threads are deleted. You dont need to kill them or free them: OpenMP does that automatically.
Maybe "omp_get_num_threads()" is telling to you the actual configuration of the program, not the number of active threads. I mean: if you set the number of threads to 4, omp will tell you that the configuration is "4 threads", but this does not mean that there are actually 4 threads in process.

Related

Threadpool - CPU usage?

I am working on a Windows C++ application. We use the boost library. I have an operation in my application that can be parallelized to run on multiple threads. Number of threads depends each time on the operation parameters and can be big(say like 50 or 70). I dont want to spawn the maximum threads that I can, since that is a risk of the application being non-responsive to other operations(since the all the processor(s) could be occupied doing this). How can I make sure I dont create a situation I described? Would a threadpool help and if so how?
70 threads on modern hardware can be easily handled w/o any noticeable impact on system performance. Thread creation time, memory usage, scheduling and context switch overhead can be a problem but we don't know if it's a problem in your particular case.
If creating 70 threads is not an option, consider using OpenMP (supported by all major compilers) as it's a very simple and often very efficient solution:
#pragma omp parallel for
for(int i = 1; i < 100; ++i)
{
do_task(i);
}
It uses a thread pool under the hood.
If OpenMP is not acceptable for some reason(s), you can go with explicit thread pool. It can be a "home-made" thread pool (not recommended), or one from #sehe's answer, or one that is provided by OS (as #Hans Passant mentioned in his comment), or one from a 3rd-party library (e.g. Intel Threading Building Blocks).
Yes, thread pool can help with responsiveness, though typical thread pool implementation by default creates number of threads == number of logical CPU cores. This means all your cores can be busy doing your work and it's not necessarily a problem. Windows uses preemptive multithreading. This means it can handle number of threads much greater than number of CPUs and still being responsive.
Thread pool can help because it's not possible to simultaneously execute more tasks than number of logical CPU cores you have. Thread pool can be more efficient because of better use of caches and reduced number of context switches. Or because same threads can be used to execute your operation multiple times. To know for sure profile your performance.
Just create a thread pool, e.g. the one I posted here boost thread throwing exception "thread_resource_error: resource temporarily unavailable"
Two more flavours here c++ work queues with blocking (one using Asio, one using just C++11)
You can use std::async with default launch policy. However, this is not the same as thread pool.
In OpenMP, you can set a fixed number of threads and then use OpenMP tasks. Unfortunately, there is no such option in C++11. The Standard says that the choice whether the function will be invoked asynchronously in a new thread or synchronously in a thread that calls wait or get on a corresponding std::future object can be deferred, however, then still a new thread must be created when asynchronous invocation is selected.

Do Ruby threads run on multiple cores?

I've read that Ruby code (CRuby/YARV) only "runs" on a single processor core, but something is not clear yet:
I understand that the GIL prevents threads from running concurrently and that in recent Ruby versions threads are scheduled by the operating system.
Couldn't a thread possibly be "placed" on core 1 and the other on core 2, even if they're not actually running at the same time?
Just trying to understand if the OS scheduler actually puts all Ruby threads on a single core. Thanks!
Edit: Another answer mentions that C++ uses pthreads and those are scheduled across cores, and that Ruby uses the same. I guess that's what I was looking for, but since most answers seem to equate not running threads in parallel with never running on multiple cores, I just wanted to confirm.
First off, we have to clearly distinguish between "Ruby Threads" and "Ruby Threads as implemented by YARV". Ruby Threads make no guarantees how they are scheduled. They might be scheduled concurrently, they might not. They might be scheduled on multiple CPUs, they might not. They might be implemented as native platform threads, they might be implemented as green threads, they might be implemented as something else.
YARV implements Ruby Threads as native platform threads (e.g. pthreads on POSIX and Windows threads on Windows). However, unlike other Ruby implementations which use native platform threads (e.g. JRuby, IronRuby, Rubinius), YARV has a Giant VM Lock (GVL) which prevents two threads to enter the YARV bytecode interpreter at the same time. This makes it effectively impossible to run Ruby code in multiple threads at the same time.
Note however, that the GVL only protects the YARV interpreter and runtime. This means that, for example, multiple threads can execute C code at the same time, and at the same time as another thread executed Ruby code. It just means that no two threads can execute Ruby code at the same time on YARV.
Note also that in recent versions of YARV, the "Giant" VM Lock is becoming ever smaller. Sections of code are moved out from under the lock, and the lock itself is broken down in smaller, more fine-grained locks. That is a very long process, but it means that in the future more and more Ruby code will be able to run in parallel on YARV.
But, all of this has nothing to do with how the platform schedules the threads. Many platforms have some sort of heuristics for thread affinity to CPU cores, e.g they may try to schedule the same thread to the same core, under the assumption that its working set is still in that core's cache, or they may try to identify threads that operate on shared data, and schedule those threads to the same CPU and so on. Therefore, it is hard to impossible to predict how and where a thread will be scheduled.
Many platforms also provide a way to influence this CPU affinity, e.g. on Linux and Windows, you can set a thread to only be scheduled on one specific or a set of specific cores. However, YARV does not do that by default. (In fact, on some platforms influencing CPU affinity requires elevated privileges, so it would mean that YARV would have to run with elevated privileges, which is not a good idea.)
So, in short: yes, depending on the platform, the hardware, and the environment, YARV threads may and probably will be scheduled on different cores. But, they won't be able to take advantage of that fact, i.e. they won't be able to run faster than on a single core (at least when running Ruby code).

MPI shared memory access

In the parallel MPI program on for example 100 processors:
In case of having a global counting number which should be known by all MPI processes and each one of them can add to this number and the others should see the change instantly and add to the changed value.
Synchronization is not possible and would have lots of latency issue.
Would it be OK to open a shared memory among all the processes and use this memory for accessing this number also changing that?
Would it be OK to use MPI_WIN_ALLOCATE_SHARED or something like that or is this not a good solution?
Your question suggests to me that you want to have your cake and eat it too. This will end in tears.
I write you want to have your cake and eat it too because you state that you want to synchronise the activities of 100 processes without synchronisation. You want to have 100 processes incrementing a shared counter, (presumably) to have all the updates applied correctly and consistently, and to have increments propagated to all processes instantly. No matter how you tackle this problem it is one of synchronisation; either you write synchronised code or you offload the task to a library or run-time which does it for you.
Is it reasonable to expect MPI RMA to provide automatic synchronisation for you ? No, not really. Note first that mpi_win_allocate_shared is only valid if all the processes in the communicator which make the call are in shared memory. Given that you have the hardware to support 100 processes in the same, shared, memory, you still have to write code to ensure synchronisation, MPI won't do it for you. If you do have 100 processes, any or all of which may increment the shared counter, there is nothing in the MPI standard, or any implementations that I am familiar with, which will prevent a data race on that counter.
Even shared-memory parallel programs (as opposed to MPI providing shared-memory-like parallel programs) have to take measures to avoid data races and other similar issues.
You could certainly write an MPI program to synchronise accesses to the shared counter but a better approach would be to rethink your program's structure to avoid too-tight synchronisation between processes.

OpenMP thread mapping to physical cores

So I've looked around online for some time to no avail. I'm new to using OpenMP and so not sure of the terminology here, but is there a way to figure out a specific machine's mapping from OMPThread (given by omp_get_thread_num();) and the physical cores on which the threads will run?
Also I was interested in how exactly OMP assigned threads, for example is thread 0 always going to run in the same location when the same code is run on the same machine? Thanks.
Typically, the OS takes care of assigning threads to cores, including with OpenMP. This is by design, and a good thing - you normally would want the OS to be able to move a thread across cores (transparently to your application) as required, since it will interrupt your application at times.
Certain operating system APIs will allow thread affinity to be set. For example, on Windows, you can use SetThreadAffinityMask to force a thread onto a specific core.
Most of the time Reed is correct, OpenMP doesn't care about the assignment of threads to cores (or processors). However, because of things like cache reuse and data locality we have found that there are many cases where having the threads assigned to cores increases the performance of OpenMP. Therefore if you look at most OpenMP implementations, you will find that there is usually some environment variable that can be set to "bind" threads to cores. The OpenMP ARB has not yet specified any "standard" way of doing this, so at this time it is left up to an OpenMP implementation to decide if and how this should be done. There has been a great deal of discussion about whether this should be included in the OpenMP spec or not and if so how it could best be done.

WIN32: Yielding execution to another (given) thread

I am looking for a way to yield the remainder of the thread execution's scheduled time slice to a different thread. There is a SwitchToThread function in WINAPI, but it doesn't let the caller specify the thread it wants to switch to. I browsed MSDN for quite some time and haven't found anything that would offer just that.
For an operating-system-internals layman like me, it seems that yielding thread should be able to specify which thread does it want to pass the execution to. Is it possible or is it just my imagination?
The reason you can't yield processor time-slices to a designated thread is that Windows features a preemptive scheduling kernel which pretty much places the responsibility and authority of scheduling the processor time in the hands of the kernel and only the kernel.
As such threads don't have any control over when they run, if they run, and even less over which thread is switched to after their time slice is up.
However, there are a few way you may influence context switches:
by increasing the priority of a certain thread you may force the scheduler to schedule it more often in the detriment of other threads (obviously the reverse applies as well - you can lower the priority of other threads)
you can code your process to place threads in kernel wait mode when they don't have work to do in order to help the scheduler do it's job. When using proper kernel wait constructs such as Critical Sections, Mutexes, Semaphores, and Timers you effectively tell the kernel a certain thread doesn't need to be scheduled until a certain codition is met.
Note: There is rarely a reason you should tamper with task priorities so USE WITH CAUTION
You might use 'fibers' instead of 'threads': for example there's a Win32 API named SwitchToFiber which lets you specify the fiber to be scheduled.
Take a look at UMS (User-mode scheduling) threads in Windows 7
http://msdn.microsoft.com/en-us/library/dd627187(VS.85).aspx
The second thread can simply wait for the yielding thread either by calling WaitForSingleObject() on its handle or periodically polling GetExitCodeThread(). The other answers are correct about altering the operating system's scheduling mechanisms - it is better to design the threads properly in the first place.
This is not possible. Only the kernel can decide what code runs next though you can influence it by reducing the non-waiting threads it has to choose from to run next, and by setting thread priorities with SetThreadPriority.
You can use regular synchronization primitives like events, semaphores, etc. to serialize your two threads. This does not in any form prevent the kernel from scheduling other threads in between, or in parallel on another CPU core, or virtually simultaneously on the same core. This is due to preemtive multitasking nature of modern general purpose operating systems.
If you want to do your own scheduling under Windows, you can use fibers, which essentially are threads that you have to schedule yourself. However, given that you describe yourself as a layman to the OS internals world, that would probably be a bad idea, as fibers are something of an advanced feature.
Can I ask why you want to use SwitchToThread?
If for example it's some form of because thread x is computing some value that you want to wait for on thread Y, then I'd really suggest looking at the Parallel Pattern Library or the Asynchronous Agents Library in Visual Studio 2010 which allows you to do this either with message blocks (receive on an asynchronous value) or simply via tasks : wait for a set of tasks to complete and inline their execution while waiting...
//i.e. on an arbitrary thread
task_group* tasks;
tasks->run(... / some functor/)
a call to tasks->wait() will wait and inline any tasks running.

Resources