How is synchronization between two taskloop constructs done? Specifically, in the following pseudo code, if there are more threads available than the number of tasks of the first loop, I believe these free threads are spinning at the implicit barrier at the end of the single construct. Now are these free threads allowed to start executing the second loop concurrently making it unsafe to parallelize things this way (due to the inter-dependency on the array A)?
!$omp parallel
!$omp single
!$omp taskloop num_tasks(10)
DO i=1, 10
A(i) = foo()
END DO
!$omp end taskloop
!do other stuff
!$omp taskloop
DO j=1, 10
B(j) = A(j)
END DO
!$omp end taskloop
!$omp end single
!$omp end parallel
I haven't been able to find a clear answer from the API specification: https://www.openmp.org/spec-html/5.0/openmpsu47.html#x71-2080002.10.2
The taskloop construct by default has an implicit taskgroup around it. With that in mind, what happens for your code is that the single constructs picks any one thread out of the available threads of the parallel team (I'll call that the producer thread). The n-1 other threads are then send straight to the barrier of the single construct and ware waiting for work to arrive (the tasks).
Now with the taskgroup what happens is that producer thread kicks off the creation of the loop tasks, but then waits at the end of the taskloop construct for all the created tasks to finish:
!$omp parallel
!$omp single
!$omp taskloop num_tasks(10)
DO i=1, 10
A(i) = foo()
END DO
!$omp end taskloop ! producer waits here for all loop tasks to finish
!do other stuff
!$omp taskloop
DO j=1, 10
B(j) = A(j)
END DO
!$omp end taskloop ! producer waits here for all loop tasks to finish
!$omp end single
!$omp end parallel
So, if you have less parallelism (= number of tasks created by the first taskloop) than the n-1 worker threads in the barrier, then some of these threads will idle.
If you want more overlap and if the "other stuff" is independent of the first taskloop, then you can do this:
!$omp parallel
!$omp single
!$omp taskgroup
!$omp taskloop num_tasks(10) nogroup
DO i=1, 10
A(i) = foo()
END DO
!$omp end taskloop ! producer will not wait for the loop tasks to complete
!do other stuff
!$omp end taskgroup ! wait for the loop tasks (and their descendant tasks)
!$omp taskloop
DO j=1, 10
B(j) = A(j)
END DO
!$omp end taskloop
!$omp end single
!$omp end parallel
Alas, the OpenMP API as of version 5.1 does not support task dependences for the taskloop construct, so you cannot easily describe the dependency between the loop iterations of the first taskloop and the second taskloop. The OpenMP language committee is working on this right now, but I do not see this being implemented for the OpenMP API version 5.2, but rather for version 6.0.
PS (EDIT): For the second taskloop as it's right before the end of the single construct and thus right before a barrier, you can easily add the nogroup there as well to avoid that extra bit of waiting for the producer thread.
Related
I'm trying to write a code that will port openmp thread to a single gpu. I found very less case studies /codes on this.Since I`m not from computer science background.
I have less skills in programming.
This is how the basic idea look's like
And this is the code so far developed.
CALL OMP_SET_NUM_THREADS(2)
!$omp parallel num_threads(acc_get_num_devices(acc_device_nvidia))
do while ( num.gt.iteration)
id = omp_get_thread_num()
call acc_set_device_num(id+1, acc_device_nvidia)
!!$acc kernels
!error=0.0_rk
!!$omp do
!$acc kernels
!!$omp do
do j=2,nj-1
!!$acc kernels
do i=2,ni-1
T(i,j)=0.25*(T_o(i+1,j)+T_o(i-1,j)+ T_o(i,j+1)+T_o(i,j-1) )
enddo
!!$acc end kernels
enddo
!!$omp end do
!$acc end kernels
!!$acc update host(T,T_o)
error=0.0_rk
do j=2,nj-1
do i=2,ni-1
error = max( abs(T(i,j) - T_o(i,j)), error)
T_o(i,j) = T(i,j)
enddo
enddo
!!$acc end kernels
!!$acc update host(T,T_o,error)
iteration = iteration+1
print*,iteration , error
!print*,id
enddo
!$omp end parallel
There's a number of issues here.
First, you can't put an OpenMP (or OpenACC) parallel loop on a do while. Do while have indeterminant number to iterations therefor create a dependency in that exiting the loop depends on the previous iteration of the loop. You need to use a DO loop where the number of iterations is known upon entry into the loop.
Second, even if you convert this to a DO loop, you'd get a race condition if run in parallel. Each OpenMP thread would be assigning values to the same elements of the T and T_o arrays. Plus the results of T_o is used as input to the next iteration creating a dependency. In other words, you'd get wrong answers if you tried to parallelize the outer iteration loop.
For the OpenACC code, I'd suggest adding a data region around the iteration loop, i.e. "!$acc data copy(T,T_o) " before the iteration loop and then after the loop "!$acc end data", so that the data is created on the device only once. As you have it now, the data would be implicitly created and copied each time through the iteration loop causing unnecessary data movement. Also add a kernels region around the max error reduction loop so this is offloaded as well.
In general, I prefer using MPI+OpenCC for multi-GPU programming rather than OpenMP. With MPI, the domain decomposition is inherent and you then have a one-to-one mapping of MPI rank to a device. Not that OpenMP can't work, but you then often need to manually decompose the domain. Also trying to manage multiple device memories and keep them in sync can be tricky. Plus with MPI, your code can also go across nodes rather than be limited to a single node.
I want to distribute subroutines to different tasks with OpenMP.
In my code I implemented this:
!$omp parallel
!$omp single
do thread = 1, omp_get_num_threads()
!$omp task
write(*,*) "Task,", thread, "is computing"
call find_pairs(me, thread, points)
call count_neighbors(me, thread, neighbors(:, thread))
!$omp end task
end do
!$omp end single
!$omp end parallel
The subroutines find_neighbors and count_neighbors do some calculations.
I set the number of threads in my program before with:
nr_threads = 4
call omp_set_num_threads(nr_threads)
Compiling this with GNU Fortran (Ubuntu 8.3.0-6ubuntu1) 8.3.0 and running,
gives me only one thread, running at nearly 100% when monitoring with top. Nevertheless, it prints the right
Task, 1 is computing
Task, 2 is computing
Task, 3 is computing
Task, 4 is computing
I compile it using:
gfortran -fopenmp main.f90 -o program
What I want is to distribute different calls of the subroutines according to
the number of OpenMP threads, working in parallel.
From what I understand is, that a single thread is created which creates the different
tasks.
What is exactly "implicit synchronization" in OpenMP and how can you spot one? My teacher said that
#pragma omp parallel
printf(“Hello 1\n”);
Has an implicit sync. Why? And how do you see it?
Synchronisation is an important issue in parallel processing and in openmp. In general parallel processing is asynchronous. You know that several threads are working on a problem, but you have no way to know exactly what is their actual state, the iteration they are working on, etc. A synchronisation allows you get control on thread execution.
There are two kinds of synchronisations in openmp: explicit and implicit. An explicit synchronisation is done with a specific openmp construct that allows to create a barrier: #pragma omp barrier. A barrier is a parallel construct that can only be passed by all the threads simultaneously. So after the barrier, you know exactly the state of all threads and, more importantly, what amount of work they have done.
Implicit synchronisation is done in two situations:
at the end of a parallel region. Openmp relies on a fork-join model. When the program starts, a single thread (master thread) is created. When you create a parallel section by #pragma omp parallel, several threads are created (fork). These threads will work concurrently and at the end of the parallel section will be destroyed (join). So at the end of a parallel section, you have a synchronisation and you know precisely the status of all threads (they have finished their work). This is what happens in the example that you give. The parallel section only contains the printf() and at the end, the program waits for the termination of all threads before continuing.
at the end of some openmp constructs like #pragma omp for or #pragma omp sections, there is an implicit barrier. No thread can continue working as long as all the threads have not reached the barrier. This is important to know exactly what work has been done by the different threads.
For instance, consider the following code.
#pragma omp parallel
{
#pragma omp for
for(int i=0; i<N; i++)
A[i]=f(i); // compute values for A
#pragma omp for
for(int j=0; j<N/2; j++)
B[j]=A[j]+A[j+N/2];// use the previously computed vector A
} // end of parallel section
As all the threads work asynchronously, you do not know which threads have finished creating their part of vector A. Without a synchronisation, there is a risk that a thread finishes rapidly its part of the first for loop, enters the second for loop and accesses elements of vector A while the threads that are supposed to compute them are still in the first loop and have not computed the corresponding value of A[i].
This is reason why openmp compilers add an implicit barrier to synchronize all the threads. So you are certain that all threads have finished all their work and that all values of A have been computed when the second for loop starts.
But in some situations, no synchronisation is required. For instance, consider the following code:
#pragma omp parallel
{
#pragma omp for
for(int i=0; i<N; i++)
A[i]=f(i); // compute values for A
#pragma omp for
for(int j=0; j<N/2; j++)
B[j]=g(j);// compute values for B
} // end of parallel section
Obviously the two loops are completely independent and it does not matter if A is properly computed to start the second for loop. So the synchronisation gives nothing for the program correctness
and adding a synchronisation barrier has two major drawbacks:
If function f() has very different running times, you may have some threads that have finished their work, while others are still computing. The synchronisation will force the former threads to wait and this idleness do not exploit properly parallelism.
Synchronisations are expensive. A simple way to realize a barrier is to increment a global counter when reaching the barrier and to wait until the value of the counter is equal to the number of threads omp_get_num_threads(). To avoid races between threads, the incrementation of the global counter must be done with an atomic read-modify-write that requires a large number of cycles and the wait for the proper value of the counter is typically done with a spin lock that wastes processor cycles.
So there is construct to suppress implicit synchronisations and the best way to program the previous loop would be:
#pragma omp parallel
{
#pragma omp for nowait // nowait suppresses implicit synchronisations.
for(int i=0; i<N; i++)
A[i]=f(i); // compute values for A
#pragma omp for
for(int j=0; j<N/2; j++)
B[j]=g(j);// compute values for B
} // end of parallel section
This way, as soon as a thread has finished its work in the first loop, it will immediately start to process the second for loop, and, depending on the actual program, this may reduce significantly execution time.
In trying to optimise some code I find that using OpenMP linearly increases the time it takes to run. The representative section of code that I am trying to speed up is as follow:
CALL system_clock(count_rate=cr)
CALL system_clock(count_max=cm)
rate = REAL(cr)
CALL SYSTEM_CLOCK(c1)
DO k=1,ntotal
CALL OMP_INIT_LOCK(locks(k))
END DO
!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i,j,k)
DO k=1,niac
i = pair_i(k)
j = pair_j(k)
dvx(:,k) = vx(:,i)-vx(:,j)
CALL omp_set_lock(locks(i))
CALL DGER(dim,dim,-1.d0, (disp_nmh(:,j)-disp_nmh(:,i)),1, &
(dwdx_nor(dim+1:2*dim,k)*V_0(j)),1, particle_data(i)%def_grad,dim)
CALL DGER(dim,dim,-1.d0, (-dvx(:,k)),1, &
(dwdx_nor(dim+1:2*dim,k)*V_0(j)) ,1, particle_data(i)%vel_grad(1:dim,1:dim),dim)
CALL omp_unset_lock(locks(i))
CALL omp_set_lock(locks(j))
CALL DGER(dim,dim,-1.d0, (dvx(:,k)),1, &
(dwdx_nor(3*dim+1:4*dim,k)*V_0(i)) ,1, particle_data(j)%vel_grad(1:dim,1:dim),dim)
CALL DGER(dim,dim,-1.d0, (disp_nmh(:,i)-disp_nmh(:,j)),1, &
(dwdx_nor(3*dim+1:4*dim,k)*V_0(i)),1, particle_data(j)%def_grad,dim)
CALL omp_unset_lock(locks(j))
END DO
!$OMP END PARALLEL DO
CALL SYSTEM_CLOCK(c2)
t_el = t_el + (c2-c1)/rate
WRITE(*,*) "Wall time elapsed: ", t_el
Note that for the simulation I am testing k=14000 which I thought was a reasonable candidate for running in parallel. So far as I know I have to use the locks to ensure that threads which are given the same value of "i" (but a different value of "j") cannot access the same index of the arrays which are being written to at the same time. I cannot figure out if the version of BLAS (sudo apt-get install libblas-dev liblapack-dev) which I use is thread safe. I ran a simulation with 8 cores and got the same result as without OpenMP so I am guessing that it could be. BLAS is used, in this case, to calculate and sum the outer product of many 3x3 matrices.
Is the implementation of OpenMP above the best way to speed up this code? I know very little about OpenMP but my guesses are that:
the memory being all over the place ("i" is sequential but "j" is not)
the overhead in starting and closing down all the threads
the constant locking and unlocking
and maybe the small loop size (although I thought 14000 would be sufficient)
are significantly outweighing the performance benefits. Is this correct? Or can the code above be modified to get some performance gain?
EDIT
I should probably add that the code above is part of a time integration loop. Hopefully this explains why the elapsed time is summed.
OK, I hope this was not asked before, because this is a little tricky to find on the search.
I have looked over the F95 manual, but still find this vague:
For the simple case of:
DO i=0,99
<some functionality>
END DO
I'm trying to figure out what is the difference between:
!$OMP DO PRIVATE(i)
DO i=0,99
<some functionality>
END DO
!$OMP END DO
And:
!$OMP PARALLEL DO PRIVATE(i)
DO i=0,99
<some functionality>
END DO
!$OMP PARALLEL END DO
(Just to point out the difference: the first one has OMP DO but no PARALLEL directive AT ALL. The second just has the PARALLEL directive added)
Thanks!
The !$OMP DO PRIVATE(i) instructs the compiler how to divide the work between the threads, but does not start any threads. It will do any worksharing only if it is (even indirectly) inside a $OMP PARALLEL region, otherwise it will not do anything.
!$OMP PARALLEL DO PRIVATE(i)
!$OMP END PARALLEL DO
does the same as
!$OMP PARALLELPRIVATE(i)
!$OMP DO
!$OMP END DO
!$OMP END PARALLEL
So it both starts the threads and distributes the work beteen them.
If you had just
!$OMP PARALLEL PRIVATE(i)
!$OMP END PARALLEL
all threads would do all the work inside the parallel region.
If the OpenMP do directive is encountered outside a parallel region it is executed in serial by one thread -- it behaves as if it were not parallelised at all. Of course, that's because it isn't.
The first of your snippets isn't parallelised, the second is.
I'm not sure what you mean by the F95 manual nor why you would look there for information about OpenMP.