Parallelization of an openMP nested do loop

Parallelization of an openMP nested do loop - parallel-processing

I have a nested do loop in an openmp fortran 77 code that I am unable to parallelize (the code gives a segmentation fault error when it is run). I have a very similar nested do loop in a different subroutine of the same code that runs parallel with no issues.
Here is the nested do loop that I am having problems with:
do n=1,num_p
C$OMP PARALLEL DO DEFAULT(SHARED), PRIVATE(l,i1,i2,j1,j2,k1,k2
C$OMP& ,i,j,k,i_t,j_t,i_ddf,j_ddf,ddf_dum)
do l=1,n_l(n)
call del_fn(l,n)
i1=p_iw(l,n)
i2=p_ie(l,n)
j1=p_js(l,n)
j2=p_jn(l,n)
k1=p_kb(l,n)
k2=p_kt(l,n)
do i=i1,i2
i_ddf=i-i1+1
if(i .lt. 1) then
i_t=nx+i
elseif (i .gt. nx) then
i_t=i-nx
else
i_t=i
endif
do j=j1,j2
j_ddf=j-j1+1
if(j .lt.1) then
j_t=ny+j
elseif(j .gt. ny) then
j_t=j-ny
else
j_t=j
endif
do k=k1,k2
ddf(l,n,i_ddf,j_ddf,k-k1+1) = ddf_dum(i_t,j_t,k)
enddo
enddo
enddo
enddo
C$OMP END PARALLEL DO
enddo
I have narrowed the problem down to ddf_dum(i_t,j_t,k). When this term is turned off (say I replace it by 0.d0), the code runs fine.
On the other hand, I have a very similar nested do loop that runs parallel with no issues. Below is that nested do loop that runs parallel with no issues. Can anyone please identify what I am missing here?
do n=1,1
C$OMP PARALLEL DO DEFAULT(SHARED), PRIVATE(l,i1,i2,j1,j2,k1,k2
C$OMP& ,i,j,k,i_f,j_f,i_ddf,j_ddf)
do l=1,n_l(n)
i1=p_iw(l,n)
i2=p_ie(l,n)
j1=p_js(l,n)
j2=p_jn(l,n)
k1=p_kb(l,n)
k2=p_kt(l,n)
u_forcing(l,n)= (u_p(l,n)-up_tilde(l,n))/dt
v_forcing(l,n)= (v_p(l,n)-vp_tilde(l,n))/dt
w_forcing(l,n)= (w_p(l,n)-wp_tilde(l,n))/dt
do i=i1,i2
i_ddf=i-i1+1
if(i .lt. 1) then
i_f=nx+i
elseif (i .gt. nx) then
i_f=i-nx
else
i_f=i
endif
do j=j1,j2
j_ddf=j-j1+1
if(j .lt.1) then
j_f=ny+j
elseif(j .gt. ny) then
j_f=j-ny
else
j_f=j
endif
do k=k1,k2
forcing_x(i_f,j_f,k)=forcing_x(i_f,j_f,k)+u_forcing(l,n)
& *ddf_n(l,n,i_ddf,j_ddf,k-k1+1)*dv_l(l,n)
forcing_y(i_f,j_f,k)=forcing_y(i_f,j_f,k)+v_forcing(l,n)
& *ddf_n(l,n,i_ddf,j_ddf,k-k1+1)*dv_l(l,n)
forcing_z(i_f,j_f,k)=forcing_z(i_f,j_f,k)+w_forcing(l,n)
& *ddf_n(l,n,i_ddf,j_ddf,k-k1+1)*dv_l(l,n)
enddo
enddo
enddo
enddo
C$OMP END PARALLEL DO
enddo

As you noted, your problem is ddf_dum. It should be a shared variable, not private, because it is only being read from and never written to. You are getting a segfault because you are attempting to access uninitialized memory on all the threads that aren't your master thread.
A good rule of thumb that you could have used to find this mistake yourself: all variables that are only found on the RHS of your equal signs within your parallel region should always be shared.

Related

OpenMPI IPC performance is worse than reading/writing to file

I am trying out various ways of IPC to do the following:
Master starts.
Master starts a slave.
Master passes an array to slave.
Slave processes the array.
Slave sends the array back to master.
I have tried using OpenMPI to solve this by having the parent process spawn a child which in turn does the aforementioned processing. However, I have also tried - what I thought would be the worst possible way to do this - letting master write the data to a file and have slave read and write back to that file. The result is stunning.
Below is the two ways in which I achieve this. The first way is the "file" way, the second one is by using OpenMPI.
Master.f90
program master
implicit none
integer*4, dimension (10000) :: matrix
integer :: length, i, exitstatus, cmdstatus
logical :: waistatus
! put integers in matrix and output data into a file
open(1, file='matrixdata.dat', status='new')
length = 10000
do i=1,length
matrix(i) = i
write(1,*) matrix(i)
end do
close(1)
call execute_command_line("./slave.out", wait = .true., exitstat=exitstatus)
if(exitstatus .eq. 0) then
! open and read the file changed by subroutine slave
open(1, file= 'matrixdata.dat', status='old')
do i = 1, length
read(1,*) matrix(i)
end do
close(1)
endif
end program master
Slave.f90
program slave
implicit none
integer*4, dimension (10000) :: matrix
integer :: length, i
! Open and read the file made by master into a matrix
open (1, file= 'matrixdata.dat', status = 'old')
length = 10000
do i = 1, length
read(1,*) matrix(i)
end do
close(1)
! Square all numbers and write over the file with new data
open(1, file= 'matrixdata.dat', status = 'old')
do i=1,length
matrix(i) = matrix(i)**2
write(1,*) matrix(i)
end do
close(1)
end program slave
* OpenMPI *
Master.f90
program master
use mpi
implicit none
integer :: ierr, num_procs, my_id, intercomm, i, siz, array(10000000), s_tag, s_dest, siffra
CALL MPI_INIT(ierr)
CALL MPI_COMM_RANK(MPI_COMM_WORLD, my_id, ierr)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, num_procs, ierr)
siz = 10000
!print *, "S.Rank =", my_id
!print *, "S.Size =", num_procs
if (.not. (ierr .eq. 0)) then
print*, "S.Unable to initilaize bös!"
stop
endif
do i=1,size(array)
array(i) = 2
enddo
if (my_id .eq. 0) then
call MPI_Comm_spawn("./slave.out", MPI_ARGV_NULL, 1, MPI_INFO_NULL, my_id, &
& MPI_COMM_WORLD, intercomm, MPI_ERRCODES_IGNORE, ierr)
s_dest = 0 !rank of destination (integer)
s_tag = 1 !message tag (integer)
call MPI_Send(array(1), siz, MPI_INTEGER, s_dest, s_tag, intercomm, ierr)
call MPI_Recv(array(1), siz, MPI_INTEGER, s_dest, s_tag, intercomm, MPI_STATUS_IGNORE, ierr)
!do i=1,10
! print *, "S.Array(",i,"): ", array(i)
!enddo
endif
call MPI_Finalize(ierr)
end program master
Slave.f90
program name
use mpi
implicit none
! type declaration statements
integer :: ierr, parent, my_id, n_procs, i, siz, array(10000000), ctag, csource, intercomm, siffra
logical :: flag
siz = 10000
! executable statements
call MPI_Init(ierr)
call MPI_Initialized(flag, ierr)
call MPI_Comm_get_parent(parent, ierr)
call MPI_Comm_rank(MPI_COMM_WORLD, my_id, ierr)
call MPI_Comm_size(MPI_COMM_WORLD, n_procs, ierr)
csource = 0 !rank of source
ctag = 1 !message tag
call MPI_Recv(array(1), siz, MPI_INTEGER, csource, ctag, parent, MPI_STATUS_IGNORE, ierr)
!do i=1,10
! print *, "C.Array(",i,"): ", array(i)
!enddo
do i=1,size(array)
array(i) = array(i)**2
enddo
!do i=1,10
! print *, "C.Array(",i,"): ", array(i)
!enddo
call MPI_Send(array(1), siz, MPI_INTEGER, csource, ctag, parent, ierr)
call MPI_Finalize(ierr)
end program name
Now, the interesting part is that by using the time program I have measured that it takes 19.8 ms to execute the "file version of the program". The OpenMPI version takes 60 ms. Why? Is there really so much overhead in OpenMPI that it is faster to read/write to file if you're working with <400 KiB?
I tried increasing the array to 10^5 integers. The file version executes in 114ms, OpenMPI in 53ms. When increasing to 10^6 integers file: 1103 ms, OpenMPI: 77ms.
Is the overhead really that much?

Fundamentally, it doesn't make sense to use distributed processing for problem sizes that fit in cache (except in some trivially parallel cases). The typical usage scenario is for data transfer much larger than LLC. Even you biggest case (10^6) fits in modern caches.
Firstly, for the method of writing to disk, you have to be aware of the influence of a page cache in your operating system. If your MPI processes are on the same chip, the operating system just hears 'do a write' then 'do a read'. If, in the interim, nothing pollutes the page cache then it will just fetch the data from RAM as oppose to the disk. A better experiment would be to flush the page cache between the write and read (this is possible, at least on linux, via a shell command). In effect you are performing shared memory processing if you're grabbing the data from the page cache.
Also, you are using time on the command line so you're incorporating the time it takes for MPI to initialize and establish communication interfaces with a few function calls. This is not a good benchmark because the interface provided for disk IO method has already been initialized by the operating system. Also for such a small problem size, the initialization of MPI is nontrivial compared to the runtime of the body of the program. The proper way to do this is to do the timing in the code.
For both methods, you should expect linear scaling biased by the overhead of the method. In fact, you should see a few regimes as the data size surpasses LLC and page cache. The best way to do this is to repeat your runs with ARRAY_SIZE=2^n for n=12,13,..24 and check out the curve.

Parallel Fortran Program Crash

I have some code that has been optimized and parallelized with OpenMP.
The code works fine with small datasets (tested with 4,000 datapoints) but fails with larger datasets (tested with 12,000 datapoints). I compiled with gfortran (gcc 4.9.3) on Windows 7 under Cygwin.
The program reads data into arrays and calculates some statistics using parameters from an input parameter file:
real,allocatable :: x(:),y(:),z(:),vr(:,:),azm(:),atol(:),
+ bandwh(:),dip(:),dtol(:),bandwd(:)
real*8,allocatable :: sills(:),dis(:),gam(:),hm(:),
+ tm(:),hv(:),tv(:),np(:)
integer,allocatable :: ivtail(:),ivhead(:),ivtype(:)
character*12,allocatable :: names(:)
Here are user defined parameters to run the program
real EPSLON,VERSION
real xlag,xltol,tmin,tmax
integer nd,nlag,ndir,nvarg,isill,test
character outfl*512
This section is related to OpenMP. There are definitions of reduced varaibles for OpenMP.
parameter(PI=3.14159265)
real uvxazm(100),uvyazm(100),uvzdec(100),
+ uvhdec(100),csatol(100),csdtol(100)
logical omni
integer threadId,numThreads
real*8,allocatable :: reducedVariables(:,:,:)
integer extractValue
integer i,j,id,ii,il,it,iv,jj
real dx,dy,dz,dxs,dys,dzs,hs,h
integer lagbeg,lagend,ilag
real band,dcazm,dcdec,dxy,gamma,vrh,vrhpr,vrt,vrtpr
real xi,yi,zi
integer liminf,limsup,iinf,isup,k
real xlaginv
The main parallel loop is called here:
c$omp parallel default(firstprivate)
c$omp& shared(x,y,z,reducedVariables,vr)
#ifdef _OPENMP
threadId = int(OMP_get_thread_num())+1
print *,'Thread ',threadId
#else
threadId = 1
#endif
This section recombines the variables:
#ifdef _OPENMP
c$omp barrier
reducedVariables(1,:,threadId)=dis(:)
reducedVariables(2,:,threadId)=gam(:)
reducedVariables(3,:,threadId)=np(:)
reducedVariables(4,:,threadId)=hm(:)
reducedVariables(5,:,threadId)=tm(:)
reducedVariables(6,:,threadId)=hv(:)
reducedVariables(7,:,threadId)=tv(:)
#endif
c$omp end parallel
#ifdef _OPENMP
dis(:)=0.0
gam(:)=0.0
np(:)=0.0
hm(:)=0.0
tm(:)=0.0
hv(:)=0.0
tv(:)=0.0
do ii=1,numThreads
do jj=1,mxdlv
dis(jj) = dis(jj) + reducedVariables(1,jj,ii)
gam(jj) = gam(jj) + reducedVariables(2,jj,ii)
np(jj) = np(jj) + reducedVariables(3,jj,ii)
hm(jj) = hm(jj) + reducedVariables(4,jj,ii)
tm(jj) = tm(jj) + reducedVariables(5,jj,ii)
hv(jj) = hv(jj) + reducedVariables(6,jj,ii)
tv(jj) = tv(jj) + reducedVariables(7,jj,ii)
end do
end do
#endif
I've debugged with gdb and get the following error:
At line 894 of file gamv.fpp Fortran runtime error: Index '13' of
dimension 1 of array 'np' above upper bound of 12
[Thread 9336.0x1494 exited with code 2] [Thread 9336.0x22f4 exited
with code 2] [Inferior 1 (process 9336) exited with code 02]
The offending section of code:
if(it.eq.1.or.it.eq.5.or.it.ge.9) then
do il=lagbeg,lagend
ii = (id-1)*nvarg*(nlag+2)+(iv-1)*(nlag+2)+il
np(ii) = np(ii) + 1.
dis(ii) = dis(ii) + dble(h)
tm(ii) = tm(ii) + dble(vrt)
hm(ii) = hm(ii) + dble(vrh)
gam(ii) = gam(ii) + dble((vrh-vrt)*(vrh-vrt))
end do
I don't see anywhere that array np was defined as having an upper bound of 12.
Is this an issue with using dynamic arrays with OpenMP?

Waiting for a task to be completed on remote processor in Julia

In a parallel application mimicking distributed inference, I would like to have an "initialization step" where all the "slaves" receive some initial information from the "master" then start their task.
At the moment I have a working implementation based on the sendTo function (the code was found here on stack overflow) but I don't think it guarantees that the worker won't start its task before it has received the initial objects.
Here's a rough MWE
function sendTo(p::Int; args...)
for (nm, val) in args
#spawnat(p, eval(Main, Expr(:(=), nm, val)))
end
end
a = 5
addprocs(4)
[sendTo(worker,a=a+randn()) for worker in workers()]
#everywhere begin
println(a)
end
The above "works" but how can I be sure that the commands in the #everywhere block does not get executed before the worker has received the definition of a?
Rmk: for the context I'm working in, I would like to keep two distinct blocks, one that spreads the data and one that does stuff on it.
Other rmk: apologies if this is trivial, I'm quite new to dealing with parallelism (and quite new to Julia too)

you can just fetch the results for every process. See the example in the docs
function sendTo(p::Int; args...)
r = []
for (nm, val) in args
s = #spawnat(p, eval(Main, Expr(:(=), nm, val)))
vcat([s],r)
end
end
#...
[fetch(r) for r in [sendTo(worker,a=a+randn()) for worker in workers()]]

Fortran issue with undefined statements in subroutine

I have a bunch of errors coming up that say that my variables are not variables and that my statements are undefined. This subroutine is needed to compile my larger program, so I just copy pasted it at the end of my program. It didn't work if I compiled them together on different files.
How would I go about fixing those "undefined" issues? Is it because of the way I copy pasted my subroutine at the end of my program?
(I compiled with g77 and with gfortran, same thing happens)
geomalb.f:1088.6:
TEMP(J)= TLINAL(J)
1
Error: Unclassifiable statement at (1)
geomalb.f:1089.6:
DEN(J)= DLINAL(J)
1
Error: Unclassifiable statement at (1)
geomalb.f:1090.6:
PRESS(J)=PLINAL(J)
1
Error: Unclassifiable statement at (1)
geomalb.f:1110.6:
XMU(J)=28.0134*XN2(J)+2.158*H2(J)+16.0426*CH4(J)+39.948*AR(J)
1
Error: Unclassifiable statement at (1)
geomalb.f:1132.6:
1001 IF (IPRINT .LT. 0) RETURN
1
Error: Bad continuation line at (1)
geomalb.f:1152.72:
ENDDO
1
Error: Statement label in ENDDO at (1) doesn't match DO label
geomalb.f:1130.72:
IF (DT .LT. 0.001) GO TO 1001
1
Error: Label 1001 referenced at (1) is never defined
geomalb.f:72.27:
CALL ATMSETUP(NLEVEL,Z,RHCH4,FH2,FARGON,TEMP,PRESS,DEN,XMU,
1
Warning: Rank mismatch in argument 'z' at (1) (scalar and rank-1)
Here is the subroutine part of the program:
SUBROUTINE ATMSETUP(NLEVEL,Z,RHCH4,FH2,FARGON,TEMP,PRESS,DEN,XMU,
& CH4,H2,XN2,AR,IPRINT)
PARAMETER (NMAX=201)
DIMENSION CH4(1),H2(1),XN2(1),AR(1)
DIMENSION TLINAL(NMAX),DLINAL(NMAX),PLINAL(NMAX)
CALL LINDAL(NLEVEL,Z,TLINAL,DLINAL,PLINAL)
DO J=1,NLEVEL
TEMP(J)= TLINAL(J)
DEN(J)= DLINAL(J)
PRESS(J)=PLINAL(J)
ENDDO
DO 1000 ITS =1,20
CH4(NLEVEL)=PCH4(TEMP(NLEVEL))*RHCH4/PRESS(NLEVEL)
DO 134 J=NLEVEL-1,1,-1
CH4SAT=PCH4(TEMP(J))/PRESS(J)
CH4(J)=AMIN1(CH4SAT,CH4(NLEVEL),CH4(J+1))
134 CONTINUE
DO 20 J=1,NLEVEL
H2(J)=FH2
IF (FARGON .LT. 0.) THEN
AR(J)=(-FARGON-28.0134+25.8554*H2(J)+11.9708*CH4(J))/11.9346
ELSE
IF (FARGON .EQ. 0.) THEN
AR(J)=0.0
ELSE
AR(J)=FARGON
ENDIF
ENDIF
XN2(J)=1.0 - H2(J) - CH4(J) -AR(J)
XMU(J)=28.0134*XN2(J)+2.158*H2(J)+16.0426*CH4(J)+39.948*AR(J)
20 CONTINUE
SUMT=PLINAL(1)*6.02E23/10.
SUMB=SUMT
TLAST=TEMP(NLEVEL)
DO J=2,NLEVEL
DENF=294.1/(XN2(J)*294.1 + CH4(J)*410. + H2(J)*136. + AR(J)*277.8)
DEN(J) = DLINAL(J)*DENF
ADEN=(DEN(J)-DEN(J-1))/ALOG(DEN(J)/DEN(J-1))
SUMT=SUMT+(EFFG(Z(J))*ADEN)*( Z(J-1)-Z(J))*XMU(J)
ADEN=(DLINAL(J)-DLINAL(J-1))/ALOG(DLINAL(J)/DLINAL(J-1))
SUMB=SUMB+(EFFG(Z(J))*ADEN)*( Z(J-1)-Z(J))*28.01340
PRESS(J)=PLINAL(J)*SUMT/SUMB
TEMP(J) =TLINAL(J)*(SUMT/SUMB)*(1./DENF)
ENDDO
30 CONTINUE
DT= ABS(TEMP(NLEVEL)-TLAST)
IF (DT .LT. 0.001) GO TO 1001
1000 CONTINUE
1001 IF (IPRINT .LT. 0) RETURN
WRITE (6,139)RHCH4,FH2,FARGON,DT
DO 135 J=1,NLEVEL-1
WRITE(6,140)J,Z(J),PRESS(J),DEN(J),TEMP(J),
& CH4(J)*PRESS(J)/PCH4(TEMP(J))
& ,CH4(J)*100.,XN2(J)*100.,H2(J)*100.,AR(J)*100.,XMU(J)
& ,(TEMP(J+1)-TEMP(J))/(Z(J+1)-Z(J))
135 CONTINUE
J=NLEVEL
WRITE(6,140)J,Z(J),PRESS(J),DEN(J),TEMP(J),
& CH4(J)*PRESS(J)/PCH4(TEMP(J))
& ,CH4(J)*100.,XN2(J)*100.,H2(J)*100.,AR(J)*100.,XMU(J)
139 FORMAT(///' BACKGROUNG ATMOSPHERE AT LEVELS'/
& ' SURFACE HUMIDITY OF CH4:',F5.3,' H2 MIXING RATIO:',F6.4,
& ' ARGON SETTING:',F8.4/' FINAL CONVERGENCE ON TEMP:',F10.5
& ' LINDAL ET AL SCALING'/
&' LVL ALTITUDE P(BARS) DEN(CM-3) TEMP RH-CH4'
& , ' %CH4 %N2 %H2 %AR MU DT/DZ' )
140 FORMAT(1X,I3,F8.3,1P2E10.3,0PF7.2,F5.2,2F6.2,2F5.2,4F6.2)
RETURN
ENDDO
END

You will need a
DIMENSION DEN(1), PRESS(1), TEMP(1)
statement in the subroutine. Otherwise the subroutine does not "know" that these variables are to be treated as arrays.

application exits prematurely with OpenMp with the error code : Fatal User Error 1002: Not all work-sharing constructs executed by all threads

I added openMp code to some serial code in a simulator applicaton, when I run a program that uses this application the program exits unexpectedly with the output "The thread 'Win32 Thread' (0x1828) has exited with code 1 (0x1)", this happens in the parallel region where I added the OpenMp code,
here's a code sample:
#pragma omp parallel for private (curr_proc_info, current_writer, method_h) shared (exceptionOccured) schedule(dynamic, 1)
for (i = 0 ; i < method_process_num ; i++)
{
current_writer = 0;
// we need to add protection before we can dequeue a method from the methods queue,
#pragma omp critical(dequeueMethod)
method_h = pop_runnable_method(curr_proc_info, current_writer);
if(method_h !=0 && exceptionOccured == false){
try {
method_h->semantics();
}
catch( const sc_report& ex ) {
::std::cout << "\n" << ex.what() << ::std::endl;
m_error = true;
exceptionOccured = true; // we cannot jump outside the loop, so instead of return we use a flag and return somewhere else
}
}
}
The scheduling was static before I made it dynamic, after I added dynamic with a chunk size of 1 the application proceeded a little further before it exited, can this be an indication of what is happening inside the parallel region?
thanks

As I read it, and I'm more of a Fortran programmer than C/C++, your private variable curr_proc_info is not declared (or defined ?) before it first appears in the call to pop_runnable_method. But private variables are undefined on entry to the parallel region.
I also think your sharing of exception_occurred is a little fishy since it suggests that an exception on any thread should be noticed by any thread, not just the thread in which it is noticed. Of course, that may be your intent.
Cheers
Mark

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio