More OpenMP threads running than set via omp_set_num_threads()? - openmp

I have a program that loops through a sequence of multiple for-loops (with other things in-between). Some the for-loops are parallelized using "#pragma omp parallel for". The number of threads is set via omp_set_num_threads() in the beginning.
It appears, however, that some of those for-loops make OpenMP to start a second team of threads and the process ends up with twice as many OpenMP threads than set with omp_set_num_threads().
What could cause this?

Related

Failed thread creation when parallelizing a branching recursive subroutine in Fortran with OpenMP

I am writing a recursive subroutine in Fortran that expands as a binary tree (i.e. the procedure calls itself twice until it reaches the end of a branch). The general algorithmic logic is:
'''
call my_subroutine(inputs, output)
use input to generate possible new_input(:,1) and new_input(:,2)
do i=1,2
call my_subroutine(new_input(:,i), new_output(i))
enddo
output = best(new_output(1), new_output(2))
'''
In principle, this could be substantially accelerated through parallel computing, however when I use OpenMP to parallelize the loop, running the resulting executable aborts with the error:
libgomp: Thread creation failed: Resource temporarily unavailable Thread creation failed: Resource temporarily unavailable
I'm guessing that the stack size is too large, but I haven't found a resolution or workaround. Are there ways I can use parallel computing to improve the performance of this kind of algorithm?
-Does OpenMP or gfortran have options to help avoid these issues?
-Would it help to parallelize only above or below a certain level in the tree?
-Would c or c++ be a better option for this application?
I am working on macOS Catalina. Stack size is hard capped at 65532.
My environment variables are:
OMP_NESTED=True
OMP_DYNAMIC=True
That sounds more like your code is creating too many threads due to a very deep recursion. There are ways to mitigate it. For example, OpenMP 4.5 introduced the concept of maximum active levels controlled by the max-active-levels-var ICV (internal control variable). You may set its value by either setting the OMP_MAX_ACTIVE_LEVELS environment variable or by calling omp_set_max_active_levels(). Once the level of nesting reaches that specified by max-active-levels-var, parallel regions nested further deeper are deactivated, i.e., they will execute sequentially without spawning new threads.
If your compiler does not support OpenMP 4.5, or if you want your code to be backward compatible with older compilers, then you can do it manually by tracking the level of nesting and deactivating the parallel region. For the latter, there is the if(b) clause that when applied to the parallel region makes it active only when b evaluates to .true.. A sample parallel implementation of your code:
subroute my_subroutine(inputs, output, level)
use input to generate possible new_input(:,1) and new_input(:,2)
!$omp parallel do schedule(static,1) if(level<max_levels)
do i=1,2
call my_subroutine(new_input(:,i), new_output(i), level+1)
enddo
!$omp end parallel do
output = best(new_output(1), new_output(2))
end subroutine my_subroutine
The top level call to my_subroutine has to be with a level equal to 0.
No matter how exactly you implement it, you'll need to experiment with the value of the maximum level. The optimal value will depend on the number of CPUs/cores and the arithmetic intensity of the code and will vary from system to system.
A better alternative to the parallel do construct would be to use OpenMP tasks, again, with a cut-off at a certain level of nesting. The good thing about tasks is that you can fix the number of OpenMP threads in advance and the tasking runtime will take care of workload distribution.
subroutine my_subroutine(inputs, output, level)
use input to generate possible new_input(:,1) and new_input(:,2)
!$omp taskloop shared(new_input, new_output) final(level>=max_levels)
do i=1,2
call my_subroutine(new_input(:,i), new_output(i), level+1)
end do
!$omp taskwait
output = best(new_output(1), new_output(2))
end subroutine my_subroutine
Here, each iteration the loop becomes a separate task. If max_levels of nesting has been reached, the tasks become final, which means they will not be deferred (i.e., will execute sequentially) and each nested task will be final too, effectively stopping parallel execution further down the recursion tree. Task loops are a convenience feature introduced in OpenMP 4.5. With earlier compilers, the following equivalent code will do:
subroutine my_subroutine(inputs, output, level)
use input to generate possible new_input(:,1) and new_input(:,2)
do i=1,2
!$omp task shared(new_input, new_output) final(level>=max_levels)
call my_subroutine(new_input(:,i), new_output(i), level+1)
!$omp end task
end do
!$omp taskwait
output = best(new_output(1), new_output(2))
end subroutine my_subroutine
There are no parallel constructs in the tasking code. Instead, you need to call my_subroutine from within a parallel region and the idiomatic way is to do it like this:
!$omp parallel
!$omp single
call my_subroutine(inputs, output, 0)
!$omp end single
!$omp end parallel
There is a fundamental difference between the nested parallel version and the one using tasks. In the former case, at each recursive level the current thread forks in two and each thread does one half of the computation in parallel. Limiting the level of active parallelism is needed here in order to prevent the runtime from spawning too many threads and exhausting the system resources. In the latter case, at each recursive level two new tasks are created and deferred for later, possibly parallel execution by the team of threads associated with the parallel region. The number of threads stays the same and the cut-off here limits the build-up of tasking overhead, which is way smaller than the overhead of spawning new parallel regions. Hence, the optimal value of max_levels for the tasking code will differ significantly from the optimal value for the nested parallel code.

OpenMP using lock

Why the program did not speed up and become slower than the sequential version?
Will it be faster if I change the lock to omp reduction?
The omp code for the calculation avgvalue
You have multiple threads running a single critical command. That is basically just as efficient as a serial code since only one thread will execute at a time. And you are also adding overhead by creating multiple threads and having them wait on one another to finish their execution before they can execute.
I think that making a reduction would be faster, since there are optimizations for that command in OpenMP.

What is the difference between a serial code and using the keyword critical while parallelizing a code in openmp?

If I have just one for loop to parallelize and if I use #pragma omp critical while parallelizing, will that make it equivalent to a serial code?
No.
The critical directive specifies that the code it covers is executed by one thread at a time, but it will (eventually) be executed by all threads that encounter it.
The single directive specifies that the code it covers will only be executed by one thread but even that isn't exactly the same as compiling the code without OpenMP. OpenMP imposes some restrictions on what programming constructs can be used inside parallel regions (eg no jumping out of them). Furthermore, at run-time you are likely to incur an overhead for firing up OpenMP even if you don't actually run any code in parallel.

How to remove Fortran race condition?

Forgive me if this is not actually a race condition; I'm not that familiar with the nomenclature.
The problem I'm having is that this code runs slower with OpenMP enabled. I think the loop should be plenty big enough (k=100,000), so I don't think overhead is the issue.
As I understand it, a race condition is occurring here because all the loops are trying to access the same v(i,j) values all the time, slowing down the code.
Would the best fix here be to create as many copies of the v() array as threads and have each thread access a different one?
I'm using intel compiler on 16 cores, and it runs just slightly slower than on a single core.
Thanks all!
!$OMP PARALLEL DO
Do 500, k=1,n
Do 10, i=-(b-1),b-1
Do 20, j=-(b-1),b-1
if (abs(i).le.l.and.abs(j).eq.d) then
cycle
endif
v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))
if (k.eq.n-1) then
vtest(i,j,1)=v(i,j)
endif
if (k.eq.n) then
vtest(i,j,2)=v(i,j)
endif
20 continue
10 continue
500 continue
!$OMP END PARALLEL DO
You certainly have programmed a race condition though I'm not sure that that is the cause of your program's failure to execute more quickly. This line
v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))
which will be executed by all threads for the same (set of) values for i and j is where the racing happens. Given that your program does nothing to coordinate reads and writes to the elements of v your program is, in practice, not deterministic as there is no way to know the order in which updates to v are made.
You should have observed this non-determinism on inspecting the results of the program, and have noticed that changing the number of threads has an impact on the results too. Then again, with a long-running stencil operation over an array the results may have converged to the same (or similar enough) values.
OpenMP gives you the tools to coordinate access to variables but it doesn't automatically implement them; there is definitely nothing going on under the hood to prevent quasi-simultaneous reads from and writes to v. So the explanation for the lack of performance improvement lies elsewhere. It may be down to the impact of multiple threads on cache at some level in your system's memory hierarchy. A nice, cache-friendly, run over every element of an array in memory order for a serial program becomes a blizzard of (as far as the cache is concerned) random accesses to memory requiring access to RAM at every go.
It's possible that the explanation lies elsewhere. If the time to execute the OpenMP version is slightly longer than the time to execute a serial version I suspect that the program is not, in fact, being executed in parallel. Failure to compile properly is a common (here on SO) cause of that.
How to fix this ?
Well the usual pattern of OpenMP across an array is to parallelise on one of the array indices. The statements
!$omp parallel do
do i=-(b-1),b-1
....
end do
ensure that each thread gets a different set of values for i which means that they write to different elements of v, removing (almost) the data race. As you've written the program each thread gets a different set of values of k but that's not used (much) in the inner loops.
In passing, testing
if (k==n-1) then
and
if (k==n) then
in every iteration looks like you are tying an anchor to your program, why not just
do k=1,n-2
and deal with the updates to vtest at the end of the loop.
You could separate the !$omp parallel do like this
!$omp parallel
do k=1,n-2
!$omp do
do i=-(b-1),b-1
(and make the corresponding changes at the end of the parallel loop and region). Now all threads execute the entire contents of the parallel region but each gets its own set of i values to use. I recommend that you add clauses to your directives to specify the accessibility (eg private or shared) of each variable; but this answer is getting a bit too long and I won't go into more detail on these. Or on using a schedule clause.
Finally, of course, even with the changes I've suggested your program will be non-deterministic because this statement
v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))
will read neighbouring elements from v which are updated (at a time you have no control over) by another thread. To sort that out ... got to go back to work.

OpenMP split-joint model

I am parallelizing several separated for-loops using OpenMP. While debugging in gdb, I found that the multiple threads are created when the running reaches the first parallel region. The multiple threads exited at the end of running the whole program. This is contrary to what I think about the split-join model of OpenMP, where threads should join together into a master thread and then terminate at the end of each parallel region instead of the end of the whole program.
Am I wrong?
Thanks!
It is implementation specific, but it is likely that the implementation puts the worker threads in a thread-pool.

Resources