Failed thread creation when parallelizing a branching recursive subroutine in Fortran with OpenMP - performance

I am writing a recursive subroutine in Fortran that expands as a binary tree (i.e. the procedure calls itself twice until it reaches the end of a branch). The general algorithmic logic is:
'''
call my_subroutine(inputs, output)
use input to generate possible new_input(:,1) and new_input(:,2)
do i=1,2
call my_subroutine(new_input(:,i), new_output(i))
enddo
output = best(new_output(1), new_output(2))
'''
In principle, this could be substantially accelerated through parallel computing, however when I use OpenMP to parallelize the loop, running the resulting executable aborts with the error:
libgomp: Thread creation failed: Resource temporarily unavailable Thread creation failed: Resource temporarily unavailable
I'm guessing that the stack size is too large, but I haven't found a resolution or workaround. Are there ways I can use parallel computing to improve the performance of this kind of algorithm?
-Does OpenMP or gfortran have options to help avoid these issues?
-Would it help to parallelize only above or below a certain level in the tree?
-Would c or c++ be a better option for this application?
I am working on macOS Catalina. Stack size is hard capped at 65532.
My environment variables are:
OMP_NESTED=True
OMP_DYNAMIC=True

That sounds more like your code is creating too many threads due to a very deep recursion. There are ways to mitigate it. For example, OpenMP 4.5 introduced the concept of maximum active levels controlled by the max-active-levels-var ICV (internal control variable). You may set its value by either setting the OMP_MAX_ACTIVE_LEVELS environment variable or by calling omp_set_max_active_levels(). Once the level of nesting reaches that specified by max-active-levels-var, parallel regions nested further deeper are deactivated, i.e., they will execute sequentially without spawning new threads.
If your compiler does not support OpenMP 4.5, or if you want your code to be backward compatible with older compilers, then you can do it manually by tracking the level of nesting and deactivating the parallel region. For the latter, there is the if(b) clause that when applied to the parallel region makes it active only when b evaluates to .true.. A sample parallel implementation of your code:
subroute my_subroutine(inputs, output, level)
use input to generate possible new_input(:,1) and new_input(:,2)
!$omp parallel do schedule(static,1) if(level<max_levels)
do i=1,2
call my_subroutine(new_input(:,i), new_output(i), level+1)
enddo
!$omp end parallel do
output = best(new_output(1), new_output(2))
end subroutine my_subroutine
The top level call to my_subroutine has to be with a level equal to 0.
No matter how exactly you implement it, you'll need to experiment with the value of the maximum level. The optimal value will depend on the number of CPUs/cores and the arithmetic intensity of the code and will vary from system to system.
A better alternative to the parallel do construct would be to use OpenMP tasks, again, with a cut-off at a certain level of nesting. The good thing about tasks is that you can fix the number of OpenMP threads in advance and the tasking runtime will take care of workload distribution.
subroutine my_subroutine(inputs, output, level)
use input to generate possible new_input(:,1) and new_input(:,2)
!$omp taskloop shared(new_input, new_output) final(level>=max_levels)
do i=1,2
call my_subroutine(new_input(:,i), new_output(i), level+1)
end do
!$omp taskwait
output = best(new_output(1), new_output(2))
end subroutine my_subroutine
Here, each iteration the loop becomes a separate task. If max_levels of nesting has been reached, the tasks become final, which means they will not be deferred (i.e., will execute sequentially) and each nested task will be final too, effectively stopping parallel execution further down the recursion tree. Task loops are a convenience feature introduced in OpenMP 4.5. With earlier compilers, the following equivalent code will do:
subroutine my_subroutine(inputs, output, level)
use input to generate possible new_input(:,1) and new_input(:,2)
do i=1,2
!$omp task shared(new_input, new_output) final(level>=max_levels)
call my_subroutine(new_input(:,i), new_output(i), level+1)
!$omp end task
end do
!$omp taskwait
output = best(new_output(1), new_output(2))
end subroutine my_subroutine
There are no parallel constructs in the tasking code. Instead, you need to call my_subroutine from within a parallel region and the idiomatic way is to do it like this:
!$omp parallel
!$omp single
call my_subroutine(inputs, output, 0)
!$omp end single
!$omp end parallel
There is a fundamental difference between the nested parallel version and the one using tasks. In the former case, at each recursive level the current thread forks in two and each thread does one half of the computation in parallel. Limiting the level of active parallelism is needed here in order to prevent the runtime from spawning too many threads and exhausting the system resources. In the latter case, at each recursive level two new tasks are created and deferred for later, possibly parallel execution by the team of threads associated with the parallel region. The number of threads stays the same and the cut-off here limits the build-up of tasking overhead, which is way smaller than the overhead of spawning new parallel regions. Hence, the optimal value of max_levels for the tasking code will differ significantly from the optimal value for the nested parallel code.

Related

OpenMP with new ecores/pcores

Is there any discussion on how OpenMP could work with new 12th gen Intel ecores and pcores ?
Is this going to be a nightmare for !$omp parallel do, where all threads are expected to have a similar workload ?
I utilise !$OMP BARRIER to synchronise threads to assist in improving cache uiliisation, but if ecores are going to be slower (question?) this approach would fail.
At present, using OpenMP with gfortran on windows does not allow any core locking management.
I am wondering what this new pcore/ecore approach could mean.
The OpenMP API does not have a specific feature to deal with this situation. If you know from your system that the P-cores are are cores 0-7 and the E-cores are 8-15, then you can do the following to restrict your OpenMP threads to run only on the P-cores:
In the shell (bash-like):
export OMP_PLACES=0-7
export OMP_PROC_BIND=true
Then, in your code do something like this (actually no change :-)):
!$omp parallel do
do ...
...
end do
!$omp end parallel do
Or in C/C++ syntax:
#pragma omp parallel for
for(...) {...}
If you want to span all P and E cores at the same code, you will have to accept some sort of load imbalance, but you could still make good use of them.
In the shell (bash-like):
export OMP_PLACES=cores
export OMP_PROC_BIND=true
Then, in your Fortran code:
!$omp parallel do schedule(nonmonotonic:dynamic,chunkz)
do ...
...
end do
!$omp end parallel do
Or in C/C++ syntax:
#pragma omp parallel for schedule(nonmonotonic:dynamic,chunksz)
for(...) {...}
In that case, I would anticipate that a dynamic schedule with a chunk size of chunksz would a good solution so that the (faster) P-cores get some more work compared to E-cores.
If you use OpenMP tasks, then you might still want to pin the OpenMP threads to cores, but since OpenMP tasks are dynamically scheduled to idling OpenMP threads, you get automatic load balancing. As a rough rule of thumb you should make sure that you create 10x more tasks than you have OpenMP threads.

What is the difference between a serial code and using the keyword critical while parallelizing a code in openmp?

If I have just one for loop to parallelize and if I use #pragma omp critical while parallelizing, will that make it equivalent to a serial code?
No.
The critical directive specifies that the code it covers is executed by one thread at a time, but it will (eventually) be executed by all threads that encounter it.
The single directive specifies that the code it covers will only be executed by one thread but even that isn't exactly the same as compiling the code without OpenMP. OpenMP imposes some restrictions on what programming constructs can be used inside parallel regions (eg no jumping out of them). Furthermore, at run-time you are likely to incur an overhead for firing up OpenMP even if you don't actually run any code in parallel.

OpenMP: scheduling performance

I'm working on OpenMP fortran.
I've a question regarding scheduling.
so from these two options which one will have better performance?
!$OMP PARALLEL DO PRIVATE(j) SCHEDULE(STATIC)
do j=1,l
call dgemm("N","N",..)
end do
!$OMP END PARALLEL DO
!$OMP PARALLEL DO PRIVATE(j)
do j=1,l
call dgemm("N","N",..)
end do
!$OMP END PARALLEL DO
There are three scheduling clauses defined by OpenMP: Static, Dynamic and Guided.
Static: Split the loop variable evenly between threads beforehand (at compile-time);
Dynamic: Distributes the chunks as the threads finishes at run-time;
Guided: As dynamic, but the chunk size decreases with each successive allocation;
The default scheduling is implementation independant (not specified in the standard). So, for your question, depending on the compiler, it may change nothing (if the default implementation is Static). Here is what happens if it changes something:
Static scheduling is the best for regular tasks, meaning that each iteration of the loop takes the same time. It lowers the overheads of synchronizing task distribution.
Dynamic scheduling is the best for irregular tasks, which means that your iterations may have different execution time. This is useful because a thread can process multiple small tasks while another can process less longer tasks.
Guided scheduling improves the global load balancing by diminishing the granularity of your chunks. It is more efficient to distribute small tasks at the end of your parallel for to diminish the end time difference among your threads.

How to remove Fortran race condition?

Forgive me if this is not actually a race condition; I'm not that familiar with the nomenclature.
The problem I'm having is that this code runs slower with OpenMP enabled. I think the loop should be plenty big enough (k=100,000), so I don't think overhead is the issue.
As I understand it, a race condition is occurring here because all the loops are trying to access the same v(i,j) values all the time, slowing down the code.
Would the best fix here be to create as many copies of the v() array as threads and have each thread access a different one?
I'm using intel compiler on 16 cores, and it runs just slightly slower than on a single core.
Thanks all!
!$OMP PARALLEL DO
Do 500, k=1,n
Do 10, i=-(b-1),b-1
Do 20, j=-(b-1),b-1
if (abs(i).le.l.and.abs(j).eq.d) then
cycle
endif
v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))
if (k.eq.n-1) then
vtest(i,j,1)=v(i,j)
endif
if (k.eq.n) then
vtest(i,j,2)=v(i,j)
endif
20 continue
10 continue
500 continue
!$OMP END PARALLEL DO
You certainly have programmed a race condition though I'm not sure that that is the cause of your program's failure to execute more quickly. This line
v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))
which will be executed by all threads for the same (set of) values for i and j is where the racing happens. Given that your program does nothing to coordinate reads and writes to the elements of v your program is, in practice, not deterministic as there is no way to know the order in which updates to v are made.
You should have observed this non-determinism on inspecting the results of the program, and have noticed that changing the number of threads has an impact on the results too. Then again, with a long-running stencil operation over an array the results may have converged to the same (or similar enough) values.
OpenMP gives you the tools to coordinate access to variables but it doesn't automatically implement them; there is definitely nothing going on under the hood to prevent quasi-simultaneous reads from and writes to v. So the explanation for the lack of performance improvement lies elsewhere. It may be down to the impact of multiple threads on cache at some level in your system's memory hierarchy. A nice, cache-friendly, run over every element of an array in memory order for a serial program becomes a blizzard of (as far as the cache is concerned) random accesses to memory requiring access to RAM at every go.
It's possible that the explanation lies elsewhere. If the time to execute the OpenMP version is slightly longer than the time to execute a serial version I suspect that the program is not, in fact, being executed in parallel. Failure to compile properly is a common (here on SO) cause of that.
How to fix this ?
Well the usual pattern of OpenMP across an array is to parallelise on one of the array indices. The statements
!$omp parallel do
do i=-(b-1),b-1
....
end do
ensure that each thread gets a different set of values for i which means that they write to different elements of v, removing (almost) the data race. As you've written the program each thread gets a different set of values of k but that's not used (much) in the inner loops.
In passing, testing
if (k==n-1) then
and
if (k==n) then
in every iteration looks like you are tying an anchor to your program, why not just
do k=1,n-2
and deal with the updates to vtest at the end of the loop.
You could separate the !$omp parallel do like this
!$omp parallel
do k=1,n-2
!$omp do
do i=-(b-1),b-1
(and make the corresponding changes at the end of the parallel loop and region). Now all threads execute the entire contents of the parallel region but each gets its own set of i values to use. I recommend that you add clauses to your directives to specify the accessibility (eg private or shared) of each variable; but this answer is getting a bit too long and I won't go into more detail on these. Or on using a schedule clause.
Finally, of course, even with the changes I've suggested your program will be non-deterministic because this statement
v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))
will read neighbouring elements from v which are updated (at a time you have no control over) by another thread. To sort that out ... got to go back to work.

Splitting LAPACK calls with OpenMP

I am working on code-tuning a routine I have written and one part of it performs two matrix multiplications which could be done simultaneously. Currently, I call DGEMM (from Intel's MKL library) on the first DGEMM and then the second. This uses all 12 cores on my machine per call. I want it to perform both DGEMM routines at the same time, using 6 cores each. I get the feeling that this is a simple matter, but have not been able to find/understand how to achieve this. The main problem I have is that OpenMP must call the DGEMM routine from one thread, but be able to use 6 for each call. Which directive would work best for this situation? Would it require nested pragmas?
So as a more general note, how can I divide the (in my case 12) cores into sets which then run a routine from one thread which uses all threads in its set.
Thanks!
The closest thing that you can do is to have an OpenMP parallel region executing with a team of two threads and then call MKL from each thread. You have to enable nested parallelism in MKL (by disabling dynamic threads), fix the number of MKL threads to 6 and have to use Intel's compiler suite to compile your code. MKL itself is threaded using OpenMP but it's Intel's OpenMP runtime. If you happen to use another compiler, e.g. GCC, its OpenMP runtime might prove incompatible with Intel's.
As you haven't specified the language, I provide two examples - one in Fortran and one in C/C++:
Fortran:
call mkl_set_num_threads(6)
call mkl_set_dynamic(0)
!$omp parallel sections num_threads(2)
!$omp section
call dgemm(...)
!$omp end section
!$omp section
call dgemm(...)
!$omp end section
!$omp end parallel sections
C/C++:
mkl_set_num_threads(6);
mkl_set_dynamic(0);
#pragma omp parallel sections num_threads(2)
{
#pragma omp section
{
cblas_dgemm(...)
}
#pragma omp section
{
cblas_dgemm(...)
}
}
In general you cannot create subsets of threads for MKL (at least given my current understanding). Each DGEMM call would use the globally specified number of MKL threads. Note that MKL operations might tune for the cache size of the CPU and performing two matrix multiplications in parallel might not be beneficial. You might be if you have a NUMA system with two hexacore CPUs, each with its own memory controller (which I suspect is your case), but you have to take care of where data is being placed and also enable binding (pinning) of threads to cores.

Resources