I am writing a recursive subroutine in Fortran that expands as a binary tree (i.e. the procedure calls itself twice until it reaches the end of a branch). The general algorithmic logic is:
'''
call my_subroutine(inputs, output)
use input to generate possible new_input(:,1) and new_input(:,2)
do i=1,2
call my_subroutine(new_input(:,i), new_output(i))
enddo
output = best(new_output(1), new_output(2))
'''
In principle, this could be substantially accelerated through parallel computing, however when I use OpenMP to parallelize the loop, running the resulting executable aborts with the error:
libgomp: Thread creation failed: Resource temporarily unavailable Thread creation failed: Resource temporarily unavailable
I'm guessing that the stack size is too large, but I haven't found a resolution or workaround. Are there ways I can use parallel computing to improve the performance of this kind of algorithm?
-Does OpenMP or gfortran have options to help avoid these issues?
-Would it help to parallelize only above or below a certain level in the tree?
-Would c or c++ be a better option for this application?
I am working on macOS Catalina. Stack size is hard capped at 65532.
My environment variables are:
OMP_NESTED=True
OMP_DYNAMIC=True
That sounds more like your code is creating too many threads due to a very deep recursion. There are ways to mitigate it. For example, OpenMP 4.5 introduced the concept of maximum active levels controlled by the max-active-levels-var ICV (internal control variable). You may set its value by either setting the OMP_MAX_ACTIVE_LEVELS environment variable or by calling omp_set_max_active_levels(). Once the level of nesting reaches that specified by max-active-levels-var, parallel regions nested further deeper are deactivated, i.e., they will execute sequentially without spawning new threads.
If your compiler does not support OpenMP 4.5, or if you want your code to be backward compatible with older compilers, then you can do it manually by tracking the level of nesting and deactivating the parallel region. For the latter, there is the if(b) clause that when applied to the parallel region makes it active only when b evaluates to .true.. A sample parallel implementation of your code:
subroute my_subroutine(inputs, output, level)
use input to generate possible new_input(:,1) and new_input(:,2)
!$omp parallel do schedule(static,1) if(level<max_levels)
do i=1,2
call my_subroutine(new_input(:,i), new_output(i), level+1)
enddo
!$omp end parallel do
output = best(new_output(1), new_output(2))
end subroutine my_subroutine
The top level call to my_subroutine has to be with a level equal to 0.
No matter how exactly you implement it, you'll need to experiment with the value of the maximum level. The optimal value will depend on the number of CPUs/cores and the arithmetic intensity of the code and will vary from system to system.
A better alternative to the parallel do construct would be to use OpenMP tasks, again, with a cut-off at a certain level of nesting. The good thing about tasks is that you can fix the number of OpenMP threads in advance and the tasking runtime will take care of workload distribution.
subroutine my_subroutine(inputs, output, level)
use input to generate possible new_input(:,1) and new_input(:,2)
!$omp taskloop shared(new_input, new_output) final(level>=max_levels)
do i=1,2
call my_subroutine(new_input(:,i), new_output(i), level+1)
end do
!$omp taskwait
output = best(new_output(1), new_output(2))
end subroutine my_subroutine
Here, each iteration the loop becomes a separate task. If max_levels of nesting has been reached, the tasks become final, which means they will not be deferred (i.e., will execute sequentially) and each nested task will be final too, effectively stopping parallel execution further down the recursion tree. Task loops are a convenience feature introduced in OpenMP 4.5. With earlier compilers, the following equivalent code will do:
subroutine my_subroutine(inputs, output, level)
use input to generate possible new_input(:,1) and new_input(:,2)
do i=1,2
!$omp task shared(new_input, new_output) final(level>=max_levels)
call my_subroutine(new_input(:,i), new_output(i), level+1)
!$omp end task
end do
!$omp taskwait
output = best(new_output(1), new_output(2))
end subroutine my_subroutine
There are no parallel constructs in the tasking code. Instead, you need to call my_subroutine from within a parallel region and the idiomatic way is to do it like this:
!$omp parallel
!$omp single
call my_subroutine(inputs, output, 0)
!$omp end single
!$omp end parallel
There is a fundamental difference between the nested parallel version and the one using tasks. In the former case, at each recursive level the current thread forks in two and each thread does one half of the computation in parallel. Limiting the level of active parallelism is needed here in order to prevent the runtime from spawning too many threads and exhausting the system resources. In the latter case, at each recursive level two new tasks are created and deferred for later, possibly parallel execution by the team of threads associated with the parallel region. The number of threads stays the same and the cut-off here limits the build-up of tasking overhead, which is way smaller than the overhead of spawning new parallel regions. Hence, the optimal value of max_levels for the tasking code will differ significantly from the optimal value for the nested parallel code.
Why the program did not speed up and become slower than the sequential version?
Will it be faster if I change the lock to omp reduction?
The omp code for the calculation avgvalue
You have multiple threads running a single critical command. That is basically just as efficient as a serial code since only one thread will execute at a time. And you are also adding overhead by creating multiple threads and having them wait on one another to finish their execution before they can execute.
I think that making a reduction would be faster, since there are optimizations for that command in OpenMP.
If I have just one for loop to parallelize and if I use #pragma omp critical while parallelizing, will that make it equivalent to a serial code?
No.
The critical directive specifies that the code it covers is executed by one thread at a time, but it will (eventually) be executed by all threads that encounter it.
The single directive specifies that the code it covers will only be executed by one thread but even that isn't exactly the same as compiling the code without OpenMP. OpenMP imposes some restrictions on what programming constructs can be used inside parallel regions (eg no jumping out of them). Furthermore, at run-time you are likely to incur an overhead for firing up OpenMP even if you don't actually run any code in parallel.
I'm working on OpenMP fortran.
I've a question regarding scheduling.
so from these two options which one will have better performance?
!$OMP PARALLEL DO PRIVATE(j) SCHEDULE(STATIC)
do j=1,l
call dgemm("N","N",..)
end do
!$OMP END PARALLEL DO
!$OMP PARALLEL DO PRIVATE(j)
do j=1,l
call dgemm("N","N",..)
end do
!$OMP END PARALLEL DO
There are three scheduling clauses defined by OpenMP: Static, Dynamic and Guided.
Static: Split the loop variable evenly between threads beforehand (at compile-time);
Dynamic: Distributes the chunks as the threads finishes at run-time;
Guided: As dynamic, but the chunk size decreases with each successive allocation;
The default scheduling is implementation independant (not specified in the standard). So, for your question, depending on the compiler, it may change nothing (if the default implementation is Static). Here is what happens if it changes something:
Static scheduling is the best for regular tasks, meaning that each iteration of the loop takes the same time. It lowers the overheads of synchronizing task distribution.
Dynamic scheduling is the best for irregular tasks, which means that your iterations may have different execution time. This is useful because a thread can process multiple small tasks while another can process less longer tasks.
Guided scheduling improves the global load balancing by diminishing the granularity of your chunks. It is more efficient to distribute small tasks at the end of your parallel for to diminish the end time difference among your threads.
I am working on code-tuning a routine I have written and one part of it performs two matrix multiplications which could be done simultaneously. Currently, I call DGEMM (from Intel's MKL library) on the first DGEMM and then the second. This uses all 12 cores on my machine per call. I want it to perform both DGEMM routines at the same time, using 6 cores each. I get the feeling that this is a simple matter, but have not been able to find/understand how to achieve this. The main problem I have is that OpenMP must call the DGEMM routine from one thread, but be able to use 6 for each call. Which directive would work best for this situation? Would it require nested pragmas?
So as a more general note, how can I divide the (in my case 12) cores into sets which then run a routine from one thread which uses all threads in its set.
Thanks!
The closest thing that you can do is to have an OpenMP parallel region executing with a team of two threads and then call MKL from each thread. You have to enable nested parallelism in MKL (by disabling dynamic threads), fix the number of MKL threads to 6 and have to use Intel's compiler suite to compile your code. MKL itself is threaded using OpenMP but it's Intel's OpenMP runtime. If you happen to use another compiler, e.g. GCC, its OpenMP runtime might prove incompatible with Intel's.
As you haven't specified the language, I provide two examples - one in Fortran and one in C/C++:
Fortran:
call mkl_set_num_threads(6)
call mkl_set_dynamic(0)
!$omp parallel sections num_threads(2)
!$omp section
call dgemm(...)
!$omp end section
!$omp section
call dgemm(...)
!$omp end section
!$omp end parallel sections
C/C++:
mkl_set_num_threads(6);
mkl_set_dynamic(0);
#pragma omp parallel sections num_threads(2)
{
#pragma omp section
{
cblas_dgemm(...)
}
#pragma omp section
{
cblas_dgemm(...)
}
}
In general you cannot create subsets of threads for MKL (at least given my current understanding). Each DGEMM call would use the globally specified number of MKL threads. Note that MKL operations might tune for the cache size of the CPU and performing two matrix multiplications in parallel might not be beneficial. You might be if you have a NUMA system with two hexacore CPUs, each with its own memory controller (which I suspect is your case), but you have to take care of where data is being placed and also enable binding (pinning) of threads to cores.