Splitting LAPACK calls with OpenMP - openmp

I am working on code-tuning a routine I have written and one part of it performs two matrix multiplications which could be done simultaneously. Currently, I call DGEMM (from Intel's MKL library) on the first DGEMM and then the second. This uses all 12 cores on my machine per call. I want it to perform both DGEMM routines at the same time, using 6 cores each. I get the feeling that this is a simple matter, but have not been able to find/understand how to achieve this. The main problem I have is that OpenMP must call the DGEMM routine from one thread, but be able to use 6 for each call. Which directive would work best for this situation? Would it require nested pragmas?
So as a more general note, how can I divide the (in my case 12) cores into sets which then run a routine from one thread which uses all threads in its set.
Thanks!

The closest thing that you can do is to have an OpenMP parallel region executing with a team of two threads and then call MKL from each thread. You have to enable nested parallelism in MKL (by disabling dynamic threads), fix the number of MKL threads to 6 and have to use Intel's compiler suite to compile your code. MKL itself is threaded using OpenMP but it's Intel's OpenMP runtime. If you happen to use another compiler, e.g. GCC, its OpenMP runtime might prove incompatible with Intel's.
As you haven't specified the language, I provide two examples - one in Fortran and one in C/C++:
Fortran:
call mkl_set_num_threads(6)
call mkl_set_dynamic(0)
!$omp parallel sections num_threads(2)
!$omp section
call dgemm(...)
!$omp end section
!$omp section
call dgemm(...)
!$omp end section
!$omp end parallel sections
C/C++:
mkl_set_num_threads(6);
mkl_set_dynamic(0);
#pragma omp parallel sections num_threads(2)
{
#pragma omp section
{
cblas_dgemm(...)
}
#pragma omp section
{
cblas_dgemm(...)
}
}
In general you cannot create subsets of threads for MKL (at least given my current understanding). Each DGEMM call would use the globally specified number of MKL threads. Note that MKL operations might tune for the cache size of the CPU and performing two matrix multiplications in parallel might not be beneficial. You might be if you have a NUMA system with two hexacore CPUs, each with its own memory controller (which I suspect is your case), but you have to take care of where data is being placed and also enable binding (pinning) of threads to cores.

Related

OpenMP with new ecores/pcores

Is there any discussion on how OpenMP could work with new 12th gen Intel ecores and pcores ?
Is this going to be a nightmare for !$omp parallel do, where all threads are expected to have a similar workload ?
I utilise !$OMP BARRIER to synchronise threads to assist in improving cache uiliisation, but if ecores are going to be slower (question?) this approach would fail.
At present, using OpenMP with gfortran on windows does not allow any core locking management.
I am wondering what this new pcore/ecore approach could mean.
The OpenMP API does not have a specific feature to deal with this situation. If you know from your system that the P-cores are are cores 0-7 and the E-cores are 8-15, then you can do the following to restrict your OpenMP threads to run only on the P-cores:
In the shell (bash-like):
export OMP_PLACES=0-7
export OMP_PROC_BIND=true
Then, in your code do something like this (actually no change :-)):
!$omp parallel do
do ...
...
end do
!$omp end parallel do
Or in C/C++ syntax:
#pragma omp parallel for
for(...) {...}
If you want to span all P and E cores at the same code, you will have to accept some sort of load imbalance, but you could still make good use of them.
In the shell (bash-like):
export OMP_PLACES=cores
export OMP_PROC_BIND=true
Then, in your Fortran code:
!$omp parallel do schedule(nonmonotonic:dynamic,chunkz)
do ...
...
end do
!$omp end parallel do
Or in C/C++ syntax:
#pragma omp parallel for schedule(nonmonotonic:dynamic,chunksz)
for(...) {...}
In that case, I would anticipate that a dynamic schedule with a chunk size of chunksz would a good solution so that the (faster) P-cores get some more work compared to E-cores.
If you use OpenMP tasks, then you might still want to pin the OpenMP threads to cores, but since OpenMP tasks are dynamically scheduled to idling OpenMP threads, you get automatic load balancing. As a rough rule of thumb you should make sure that you create 10x more tasks than you have OpenMP threads.

Failed thread creation when parallelizing a branching recursive subroutine in Fortran with OpenMP

I am writing a recursive subroutine in Fortran that expands as a binary tree (i.e. the procedure calls itself twice until it reaches the end of a branch). The general algorithmic logic is:
'''
call my_subroutine(inputs, output)
use input to generate possible new_input(:,1) and new_input(:,2)
do i=1,2
call my_subroutine(new_input(:,i), new_output(i))
enddo
output = best(new_output(1), new_output(2))
'''
In principle, this could be substantially accelerated through parallel computing, however when I use OpenMP to parallelize the loop, running the resulting executable aborts with the error:
libgomp: Thread creation failed: Resource temporarily unavailable Thread creation failed: Resource temporarily unavailable
I'm guessing that the stack size is too large, but I haven't found a resolution or workaround. Are there ways I can use parallel computing to improve the performance of this kind of algorithm?
-Does OpenMP or gfortran have options to help avoid these issues?
-Would it help to parallelize only above or below a certain level in the tree?
-Would c or c++ be a better option for this application?
I am working on macOS Catalina. Stack size is hard capped at 65532.
My environment variables are:
OMP_NESTED=True
OMP_DYNAMIC=True
That sounds more like your code is creating too many threads due to a very deep recursion. There are ways to mitigate it. For example, OpenMP 4.5 introduced the concept of maximum active levels controlled by the max-active-levels-var ICV (internal control variable). You may set its value by either setting the OMP_MAX_ACTIVE_LEVELS environment variable or by calling omp_set_max_active_levels(). Once the level of nesting reaches that specified by max-active-levels-var, parallel regions nested further deeper are deactivated, i.e., they will execute sequentially without spawning new threads.
If your compiler does not support OpenMP 4.5, or if you want your code to be backward compatible with older compilers, then you can do it manually by tracking the level of nesting and deactivating the parallel region. For the latter, there is the if(b) clause that when applied to the parallel region makes it active only when b evaluates to .true.. A sample parallel implementation of your code:
subroute my_subroutine(inputs, output, level)
use input to generate possible new_input(:,1) and new_input(:,2)
!$omp parallel do schedule(static,1) if(level<max_levels)
do i=1,2
call my_subroutine(new_input(:,i), new_output(i), level+1)
enddo
!$omp end parallel do
output = best(new_output(1), new_output(2))
end subroutine my_subroutine
The top level call to my_subroutine has to be with a level equal to 0.
No matter how exactly you implement it, you'll need to experiment with the value of the maximum level. The optimal value will depend on the number of CPUs/cores and the arithmetic intensity of the code and will vary from system to system.
A better alternative to the parallel do construct would be to use OpenMP tasks, again, with a cut-off at a certain level of nesting. The good thing about tasks is that you can fix the number of OpenMP threads in advance and the tasking runtime will take care of workload distribution.
subroutine my_subroutine(inputs, output, level)
use input to generate possible new_input(:,1) and new_input(:,2)
!$omp taskloop shared(new_input, new_output) final(level>=max_levels)
do i=1,2
call my_subroutine(new_input(:,i), new_output(i), level+1)
end do
!$omp taskwait
output = best(new_output(1), new_output(2))
end subroutine my_subroutine
Here, each iteration the loop becomes a separate task. If max_levels of nesting has been reached, the tasks become final, which means they will not be deferred (i.e., will execute sequentially) and each nested task will be final too, effectively stopping parallel execution further down the recursion tree. Task loops are a convenience feature introduced in OpenMP 4.5. With earlier compilers, the following equivalent code will do:
subroutine my_subroutine(inputs, output, level)
use input to generate possible new_input(:,1) and new_input(:,2)
do i=1,2
!$omp task shared(new_input, new_output) final(level>=max_levels)
call my_subroutine(new_input(:,i), new_output(i), level+1)
!$omp end task
end do
!$omp taskwait
output = best(new_output(1), new_output(2))
end subroutine my_subroutine
There are no parallel constructs in the tasking code. Instead, you need to call my_subroutine from within a parallel region and the idiomatic way is to do it like this:
!$omp parallel
!$omp single
call my_subroutine(inputs, output, 0)
!$omp end single
!$omp end parallel
There is a fundamental difference between the nested parallel version and the one using tasks. In the former case, at each recursive level the current thread forks in two and each thread does one half of the computation in parallel. Limiting the level of active parallelism is needed here in order to prevent the runtime from spawning too many threads and exhausting the system resources. In the latter case, at each recursive level two new tasks are created and deferred for later, possibly parallel execution by the team of threads associated with the parallel region. The number of threads stays the same and the cut-off here limits the build-up of tasking overhead, which is way smaller than the overhead of spawning new parallel regions. Hence, the optimal value of max_levels for the tasking code will differ significantly from the optimal value for the nested parallel code.

What is the difference between a serial code and using the keyword critical while parallelizing a code in openmp?

If I have just one for loop to parallelize and if I use #pragma omp critical while parallelizing, will that make it equivalent to a serial code?
No.
The critical directive specifies that the code it covers is executed by one thread at a time, but it will (eventually) be executed by all threads that encounter it.
The single directive specifies that the code it covers will only be executed by one thread but even that isn't exactly the same as compiling the code without OpenMP. OpenMP imposes some restrictions on what programming constructs can be used inside parallel regions (eg no jumping out of them). Furthermore, at run-time you are likely to incur an overhead for firing up OpenMP even if you don't actually run any code in parallel.

OpenMP: scheduling performance

I'm working on OpenMP fortran.
I've a question regarding scheduling.
so from these two options which one will have better performance?
!$OMP PARALLEL DO PRIVATE(j) SCHEDULE(STATIC)
do j=1,l
call dgemm("N","N",..)
end do
!$OMP END PARALLEL DO
!$OMP PARALLEL DO PRIVATE(j)
do j=1,l
call dgemm("N","N",..)
end do
!$OMP END PARALLEL DO
There are three scheduling clauses defined by OpenMP: Static, Dynamic and Guided.
Static: Split the loop variable evenly between threads beforehand (at compile-time);
Dynamic: Distributes the chunks as the threads finishes at run-time;
Guided: As dynamic, but the chunk size decreases with each successive allocation;
The default scheduling is implementation independant (not specified in the standard). So, for your question, depending on the compiler, it may change nothing (if the default implementation is Static). Here is what happens if it changes something:
Static scheduling is the best for regular tasks, meaning that each iteration of the loop takes the same time. It lowers the overheads of synchronizing task distribution.
Dynamic scheduling is the best for irregular tasks, which means that your iterations may have different execution time. This is useful because a thread can process multiple small tasks while another can process less longer tasks.
Guided scheduling improves the global load balancing by diminishing the granularity of your chunks. It is more efficient to distribute small tasks at the end of your parallel for to diminish the end time difference among your threads.

optimizing nbody on a GPU cluster with openacc

We are trying to provide a generic nbody algorithm for multiple Nodes.
A node has 2 GPUs and 1 CPU.
We want to calculate the n-body only on GPUs using openacc. After doing some research about openacc i am unsure how to spread the calculation to multiple GPUs.
Is it possible to use 2 GPUs with only one thread and openacc?
If not, what would be a suitable approch, using openMP to use both GPUs on one node
and communicate with other nodes via MPI?
The OpenACC runtime library provides routines (acc_set_device_num(), acc_get_device_num()) to select which accelerator device will be targetted by a particular thread, but it's not convenient to use a single thread to use multiple devices simultaneously. Instead, either OpenMP or MPI can be used.
For example (lifting from here) a basic framework for OpenMP might be:
#include <openacc.h>
#include <omp.h>
#pragma omp parallel num_threads(2)
{
int i = omp_get_threadnum();
acc_set_device_num( i, acc_device_nvidia );
#pragma acc data copy...
{
}
}
It can also be done with MPI, and/or you could use MPI to communicate between nodes, as is typical.

Resources