Distribute functions to multiple tasks

Distribute functions to multiple tasks - parallel-processing

I want to distribute subroutines to different tasks with OpenMP.
In my code I implemented this:
!$omp parallel
!$omp single
do thread = 1, omp_get_num_threads()
!$omp task
write(*,*) "Task,", thread, "is computing"
call find_pairs(me, thread, points)
call count_neighbors(me, thread, neighbors(:, thread))
!$omp end task
end do
!$omp end single
!$omp end parallel
The subroutines find_neighbors and count_neighbors do some calculations.
I set the number of threads in my program before with:
nr_threads = 4
call omp_set_num_threads(nr_threads)
Compiling this with GNU Fortran (Ubuntu 8.3.0-6ubuntu1) 8.3.0 and running,
gives me only one thread, running at nearly 100% when monitoring with top. Nevertheless, it prints the right
Task, 1 is computing
Task, 2 is computing
Task, 3 is computing
Task, 4 is computing
I compile it using:
gfortran -fopenmp main.f90 -o program
What I want is to distribute different calls of the subroutines according to
the number of OpenMP threads, working in parallel.
From what I understand is, that a single thread is created which creates the different
tasks.

Related

Multi-threaded multi GPU computation using openMP and openACC

I'm trying to write a code that will port openmp thread to a single gpu. I found very less case studies /codes on this.Since I`m not from computer science background.
I have less skills in programming.
This is how the basic idea look's like
And this is the code so far developed.
CALL OMP_SET_NUM_THREADS(2)
!$omp parallel num_threads(acc_get_num_devices(acc_device_nvidia))
do while ( num.gt.iteration)
id = omp_get_thread_num()
call acc_set_device_num(id+1, acc_device_nvidia)
!!$acc kernels
!error=0.0_rk
!!$omp do
!$acc kernels
!!$omp do
do j=2,nj-1
!!$acc kernels
do i=2,ni-1
T(i,j)=0.25*(T_o(i+1,j)+T_o(i-1,j)+ T_o(i,j+1)+T_o(i,j-1) )
enddo
!!$acc end kernels
enddo
!!$omp end do
!$acc end kernels
!!$acc update host(T,T_o)
error=0.0_rk
do j=2,nj-1
do i=2,ni-1
error = max( abs(T(i,j) - T_o(i,j)), error)
T_o(i,j) = T(i,j)
enddo
enddo
!!$acc end kernels
!!$acc update host(T,T_o,error)
iteration = iteration+1
print*,iteration , error
!print*,id
enddo
!$omp end parallel

There's a number of issues here.
First, you can't put an OpenMP (or OpenACC) parallel loop on a do while. Do while have indeterminant number to iterations therefor create a dependency in that exiting the loop depends on the previous iteration of the loop. You need to use a DO loop where the number of iterations is known upon entry into the loop.
Second, even if you convert this to a DO loop, you'd get a race condition if run in parallel. Each OpenMP thread would be assigning values to the same elements of the T and T_o arrays. Plus the results of T_o is used as input to the next iteration creating a dependency. In other words, you'd get wrong answers if you tried to parallelize the outer iteration loop.
For the OpenACC code, I'd suggest adding a data region around the iteration loop, i.e. "!$acc data copy(T,T_o) " before the iteration loop and then after the loop "!$acc end data", so that the data is created on the device only once. As you have it now, the data would be implicitly created and copied each time through the iteration loop causing unnecessary data movement. Also add a kernels region around the max error reduction loop so this is offloaded as well.
In general, I prefer using MPI+OpenCC for multi-GPU programming rather than OpenMP. With MPI, the domain decomposition is inherent and you then have a one-to-one mapping of MPI rank to a device. Not that OpenMP can't work, but you then often need to manually decompose the domain. Also trying to manage multiple device memories and keep them in sync can be tricky. Plus with MPI, your code can also go across nodes rather than be limited to a single node.

OpenMP taskloop: synchronization between two consecutive taskloop constructs

How is synchronization between two taskloop constructs done? Specifically, in the following pseudo code, if there are more threads available than the number of tasks of the first loop, I believe these free threads are spinning at the implicit barrier at the end of the single construct. Now are these free threads allowed to start executing the second loop concurrently making it unsafe to parallelize things this way (due to the inter-dependency on the array A)?
!$omp parallel
!$omp single
!$omp taskloop num_tasks(10)
DO i=1, 10
A(i) = foo()
END DO
!$omp end taskloop
!do other stuff
!$omp taskloop
DO j=1, 10
B(j) = A(j)
END DO
!$omp end taskloop
!$omp end single
!$omp end parallel
I haven't been able to find a clear answer from the API specification: https://www.openmp.org/spec-html/5.0/openmpsu47.html#x71-2080002.10.2

The taskloop construct by default has an implicit taskgroup around it. With that in mind, what happens for your code is that the single constructs picks any one thread out of the available threads of the parallel team (I'll call that the producer thread). The n-1 other threads are then send straight to the barrier of the single construct and ware waiting for work to arrive (the tasks).
Now with the taskgroup what happens is that producer thread kicks off the creation of the loop tasks, but then waits at the end of the taskloop construct for all the created tasks to finish:
!$omp parallel
!$omp single
!$omp taskloop num_tasks(10)
DO i=1, 10
A(i) = foo()
END DO
!$omp end taskloop ! producer waits here for all loop tasks to finish
!do other stuff
!$omp taskloop
DO j=1, 10
B(j) = A(j)
END DO
!$omp end taskloop ! producer waits here for all loop tasks to finish
!$omp end single
!$omp end parallel
So, if you have less parallelism (= number of tasks created by the first taskloop) than the n-1 worker threads in the barrier, then some of these threads will idle.
If you want more overlap and if the "other stuff" is independent of the first taskloop, then you can do this:
!$omp parallel
!$omp single
!$omp taskgroup
!$omp taskloop num_tasks(10) nogroup
DO i=1, 10
A(i) = foo()
END DO
!$omp end taskloop ! producer will not wait for the loop tasks to complete
!do other stuff
!$omp end taskgroup ! wait for the loop tasks (and their descendant tasks)
!$omp taskloop
DO j=1, 10
B(j) = A(j)
END DO
!$omp end taskloop
!$omp end single
!$omp end parallel
Alas, the OpenMP API as of version 5.1 does not support task dependences for the taskloop construct, so you cannot easily describe the dependency between the loop iterations of the first taskloop and the second taskloop. The OpenMP language committee is working on this right now, but I do not see this being implemented for the OpenMP API version 5.2, but rather for version 6.0.
PS (EDIT): For the second taskloop as it's right before the end of the single construct and thus right before a barrier, you can easily add the nogroup there as well to avoid that extra bit of waiting for the producer thread.

Memory error when using OpenMP with Fortran, running FFTW

I am testing FFTW in a fortran program, because I need to use it. Since I am working with huge matrixes, my first solution is to use OpenMP. When my matrix has dimension 500 x 500 x 500, the following error happens:
Operating system error:
Program aborted. Backtrace:
Cannot allocate memory
Allocation would exceed memory limit
I compiled the code using the following: gfortran -o test teste_fftw_openmp.f90 -I/usr/local/include -L/usr/lib/x86_64-linux-gnu -lfftw3_omp -lfftw3 -lm -fopenmp
PROGRAM test_fftw
USE omp_lib
USE, intrinsic:: iso_c_binding
IMPLICIT NONE
INCLUDE 'fftw3.f'
INTEGER::i, DD=500
DOUBLE COMPLEX:: OUTPUT_FFTW(3,3,3)
DOUBLE COMPLEX, ALLOCATABLE:: A3D(:,:,:), FINAL_OUTPUT(:,:,:)
integer*8:: plan
integer::iret, nthreads
INTEGER:: indiceX, indiceY, indiceZ, window=2
!! TESTING 3D FFTW with OPENMP
ALLOCATE(A3D(DD,DD,DD))
ALLOCATE(FINAL_OUTPUT(DD-2,DD-2,DD-2))
write(*,*) '---------------'
write(*,*) '------------TEST 3D FFTW WITH OPENMP----------'
A3D = reshape((/(i, i=1,DD*DD*DD)/),shape(A3D))
CALL dfftw_init_threads(iret)
CALL dfftw_plan_with_nthreads(nthreads)
CALL dfftw_plan_dft_3d(plan, 3,3,3, OUTPUT_FFTW, OUTPUT_FFTW, FFTW_FORWARD, FFTW_ESTIMATE)
FINAL_OUTPUT=0.
!$OMP PARALLEL DO DEFAULT(SHARED) SHARED(A3D,plan,window) &
!$OMP PRIVATE(indiceX, indiceY, indiceZ, OUTPUT_FFTW, FINAL_OUTPUT)
DO indiceZ=1,10!500-window
write(*,*) 'INDICE Z=', indiceZ
DO indiceY=1,10!500-window
DO indiceX=1,10!500-window
CALL dfftw_execute_dft(plan, A3D(indiceX:indiceX+window,indiceY:indiceY+window, indiceZ:indiceZ+window), OUTPUT_FFTW)
FINAL_OUTPUT(indiceX,indiceY,indiceZ)=SUM(ABS(OUTPUT_FFTW))
ENDDO
ENDDO
ENDDO
!$OMP END PARALLEL DO
call dfftw_destroy_plan(plan)
CALL dfftw_cleanup_threads()
DEALLOCATE(A3D,FINAL_OUTPUT)
END PROGRAM test_fftw
Notice this error occurs when I just use a huge matrix(A3D) without running the loop in all the values of this matrix (for running in all values, I should have the limits of the three (nested) loops as 500-window.
I tried to solve this(tips here and here) with -mcmodel=medium in the compilation without success.
I had success when I compiled with gfortran -o test teste_fftw_openmp.f90 -I/usr/local/include -L/usr/lib/x86_64-linux-gnu -lfftw3_omp -lfftw3 -lm -fopenmp -fmax-stack-var-size=65536
So, I don't understand:
1) Why there is memory allocation problem, if the huge matrix is a shared variable?
2) The solution I found is going to work if I have more huge matrix variables? For example, 3 more matrixes 500 x 500 x 500 to store calculation results.
3) In the tips I found, people said that using allocatable arrays/matrixes would solve, but I was using without any difference. Is there anything else I need to do for this?

Two double complex arrays with 500 x 500 x 500 elements require 4 gigabytes of memory. It is likely that the amount of available memory in your computer is not sufficient.
If you only work with small windows, you might consider not using the whole array at the whole time, but only parts of it. Or distribute the computation across multiple computers using MPI.
Or just use a computer with bigger RAM.

parallel running a mpi subroutine

I got a MPI program written by other people.
Basic structure is like this
program basis
initialize MPI
do n=1,12
call mpi_job(n)
end do
finalize MPI
contains
subroutine mpi_job(n) !this is mpi subroutine
.....
end subroutine
end program
What I want to do now is to make the do loop a parallel do loop. So if I got a 24 core machine, I can run this program with 12 mpi_job running simultaneously and each mpi_job uses 2 threads. There are several reasons to do this, for example, the performance of mpi_job may not scale well with number of cores. To sum up, I want to make one level of MPI parallelization into two levels of parallelization.
I found myself constantly encounter this problem when I working with other people.The question is what is the easiest and efficient way to modify the program?

So if I got a 24 core machine, I can run this program with 12 mpi_job running simultaneously and each mpi_job uses 2 threads.
I wouldn't do that. I recommend mapping MPI processes to NUMA nodes and then spawning k threads where there are k cores per NUMA node.
There are several reasons to do this, for example, the performance of mpi_job may not scale well with number of cores.
That's an entirely different issue. What aspect of mpi_job won't scale well? Is it memory bound? Does it require too much communication?

You use should use sub-communicators.
Compute job_nr = floor(global_rank / ranks_per_job)
Use MPI_COMM_SPLIT over the job_nr. This creates a local to be used communicator for each job
Pass the resulting communicator to the mpi_job. All communication then should use that communicator and the rank local to that communicator.
Of course, this all implies that there is no dependencies between the different calls to mpi_job - or that you map that to appropriate global/world communicator.

There is some confusion here over the basics of what you are trying to do. Your skeleton code will not run 12 MPI jobs at the same time; each MPI process that you create will run 12 jobs sequentially.
What you want to do is run 12 MPI processes, each of which calls mpi_job a single time. Within mpi_job, you can then create 2 threads using OpenMP.
Process and thread placement is outside the scope of the MPI and OpenMP standards. For example, ensuring that the processes are spread evenly across your multicore machine (e.g. each of the 12 even cores 0, 2, ... out of 24) and that the OpenMP threads run on even and odd pairs of cores would require you to look up the man pages for your MPI and OpenMP implementations. You may be able to place processes using arguments to mpiexec; thread placement may be controlled by environment variables, e.g. KMP_AFFINITY for Intel OpenMP.
Placement aside, here is a code that I think does what you want (I make no comment on whether it is the most efficient thing to do). I am using GNU compilers here.
user#laptop$ mpif90 -fopenmp -o basis basis.f90
user#laptop$ export OMP_NUM_THREADS=2
user#laptop$ mpiexec -n 12 ./basis
Running 12 MPI jobs at the same time
MPI job 2 , thread no. 1 reporting for duty
MPI job 11 , thread no. 1 reporting for duty
MPI job 11 , thread no. 0 reporting for duty
MPI job 8 , thread no. 0 reporting for duty
MPI job 0 , thread no. 1 reporting for duty
MPI job 0 , thread no. 0 reporting for duty
MPI job 2 , thread no. 0 reporting for duty
MPI job 8 , thread no. 1 reporting for duty
MPI job 4 , thread no. 1 reporting for duty
MPI job 4 , thread no. 0 reporting for duty
MPI job 10 , thread no. 1 reporting for duty
MPI job 10 , thread no. 0 reporting for duty
MPI job 3 , thread no. 1 reporting for duty
MPI job 3 , thread no. 0 reporting for duty
MPI job 1 , thread no. 0 reporting for duty
MPI job 1 , thread no. 1 reporting for duty
MPI job 5 , thread no. 0 reporting for duty
MPI job 5 , thread no. 1 reporting for duty
MPI job 9 , thread no. 1 reporting for duty
MPI job 9 , thread no. 0 reporting for duty
MPI job 7 , thread no. 0 reporting for duty
MPI job 7 , thread no. 1 reporting for duty
MPI job 6 , thread no. 1 reporting for duty
MPI job 6 , thread no. 0 reporting for duty
Here's the code:
program basis
use mpi
implicit none
integer :: ierr, size, rank
integer :: comm = MPI_COMM_WORLD
call MPI_Init(ierr)
call MPI_Comm_size(comm, size, ierr)
call MPI_Comm_rank(comm, rank, ierr)
if (rank == 0) then
write(*,*) 'Running ', size, ' MPI jobs at the same time'
end if
call mpi_job(rank)
call MPI_Finalize(ierr)
contains
subroutine mpi_job(n) !this is mpi subroutine
use omp_lib
implicit none
integer :: n, ithread
!$omp parallel default(none) private(ithread) shared(n)
ithread = omp_get_thread_num()
write(*,*) 'MPI job ', n, ', thread no. ', ithread, ' reporting for duty'
!$omp end parallel
end subroutine mpi_job
end program basis

OMP: What is the difference between OMP PARALLEL DO and OMP DO (Without parallel directive at all)

OK, I hope this was not asked before, because this is a little tricky to find on the search.
I have looked over the F95 manual, but still find this vague:
For the simple case of:
DO i=0,99
<some functionality>
END DO
I'm trying to figure out what is the difference between:
!$OMP DO PRIVATE(i)
DO i=0,99
<some functionality>
END DO
!$OMP END DO
And:
!$OMP PARALLEL DO PRIVATE(i)
DO i=0,99
<some functionality>
END DO
!$OMP PARALLEL END DO
(Just to point out the difference: the first one has OMP DO but no PARALLEL directive AT ALL. The second just has the PARALLEL directive added)
Thanks!

The !$OMP DO PRIVATE(i) instructs the compiler how to divide the work between the threads, but does not start any threads. It will do any worksharing only if it is (even indirectly) inside a $OMP PARALLEL region, otherwise it will not do anything.
!$OMP PARALLEL DO PRIVATE(i)
!$OMP END PARALLEL DO
does the same as
!$OMP PARALLELPRIVATE(i)
!$OMP DO
!$OMP END DO
!$OMP END PARALLEL
So it both starts the threads and distributes the work beteen them.
If you had just
!$OMP PARALLEL PRIVATE(i)
!$OMP END PARALLEL
all threads would do all the work inside the parallel region.

If the OpenMP do directive is encountered outside a parallel region it is executed in serial by one thread -- it behaves as if it were not parallelised at all. Of course, that's because it isn't.
The first of your snippets isn't parallelised, the second is.
I'm not sure what you mean by the F95 manual nor why you would look there for information about OpenMP.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio