OpenMP slower than serial Fortran90 code - parallel-processing

In my Fortran90 code I have an outer cycle with several nested loops. In order to speed up my code, I tried to use OpenMP on the outer loop, but I have a very strange problem: when I use more than 1 thread the program runs slower than using OMP with 1 thread which is in turn slower than using the original serial program (in terms of wall clock time. I tried with 1, 2, 3 or 4 threads). In all the cases, however, I get the right result.
I conducted several tests on my code and at last I noticed that the problem is in one subroutine, because if I comment the call to that routine my parallel program works as expected, i.e. the bigger the number of threads, the lower the wall clock time.
Now, that routine takes in input 4 vectors, "ks1", "ks2", "ket1", "ket2" and performs the union between "ks1" and "ks2" obtaining "kstot". Then it creates 2 new vectors, "ket1tot" and "ket2tot", where ket1tot(i) is equal to ket1(j) if ks1tot(i) is equal to ks1(j), otherwise ket1tot(i)=0. The same for ket2tot.
Then combining the values stored in vectors "ks1tot", "ket1tot", "ket2tot" I compute which lines of a vector (matFC) contain the values I need and, by multiplication of that values, I obtain the final result (FCtot).
So I turn that routine in a simple program, adding some initial lines in order to mimic the real program. I mean I added:
1) a loop (on i) that mimics the outer loop of the real program I am trying to parallelize;
2) I implemented the fact that each thread works on a different file (so I should not have a false sharing problem)
3) I added another loop (on k) that mimics that I call the routine several times.
Here is the code (the part that constitutes the original subroutine that gives me problems is indicated in the text):
program evaluatefc
#ifdef _OPENMP
use omp_lib
#endif
implicit none
integer::i,ii,j,jj,jjj,k,sizeks1,sizeks2,sizec,sizekstot,NR,NR1,maxnq
integer::line,ierr,fileunit,mythread,nfreqtot
real*8::FCtot,time1,time2
integer,allocatable,dimension(:)::ks1,ket1,ks2,ket2
integer,dimension(:),allocatable::c,kstot,ket1tot,ket2tot
real*8,allocatable,dimension(:)::matFC
character*15,allocatable,dimension(:)::matfileFC
character::fileFC*15
real*4::tstarting,tending
! This program was originally a subroutine
! that takes in input 4 vectors, ks1, ks2, ket1, ket2
!---------------------------------------------------------------------------
! I initialize some values that in the original subroutine were computed by
!the main program
allocate(matfileFC(3),stat=ierr)
matfileFC(1)='filea.dat'
matfileFC(2)='fileb.dat'
matfileFC(3)='filec.dat'
sizeks1=2
sizeks2=2
maxnq=11
allocate(ks1(sizeks1),stat=ierr)
allocate(ket1(sizeks1),stat=ierr)
allocate(ks2(sizeks2),stat=ierr)
allocate(ket2(sizeks2),stat=ierr)
nfreqtot=42
NR1=nfreqtot*(maxnq**2)+nfreqtot
NR=nfreqtot*(maxnq**2)
allocate(matFC(NR),stat=ierr)
!Call two intrinsic to evaluate CPU and wall clock time
call cpu_time(time1)
tstarting=secnds(0.0)
!$OMP PARALLEL DO &
!$OMP DEFAULT(NONE) &
!$OMP firstprivate(sizeks1,sizeks2,maxnq,matfileFC,NR,NR1) &
!$OMP PRIVATE(i,ii,j,jj,k,ierr,mythread,fileunit,c,sizec,line,sizekstot) &
!$OMP PRIVATE(jjj,ket1,ks1,ket1tot,kstot,ket2,ks2,ket2tot,FCtot,matFC,fileFC)
do ii=1,3
#ifdef _OPENMP
mythread=OMP_GET_THREAD_NUM()
#else
mythread=10
#endif
fileFC=matfileFC(ii)
! Read some lines of a given file.
fileunit=50+mythread
open(unit=fileunit,name=fileFC,status='old',form='formatted')
read(fileunit,*)!Do not read first line
jjj=0
do jj=1,NR1-1
if(mod(jj,(maxnq**2+1)).eq.0) then
read(fileunit,*)
else
jjj=jjj+1
read(fileunit,*)j,k,i,matFC(jjj)
! I actually need only the fourth valor of the line to be stored
endif
enddo
close(fileunit)
do k=1,10000000
! Again I initialize the abovementioned values that in the actual
! subroutine are computed by the main program
ks1(1)=mod(k,30)+1
ks1(2)=mod(k,30)+2
ks2(1)=mod(k,17)+1
ks2(2)=mod(k,17)+3
ket1(1)=mod(k,2)
ket1(2)=mod(k,3)
ket2(1)=mod(k,5)
ket2(2)=mod(k,7)
sizec=sizeks1+sizeks2
allocate(c(sizec),stat=ierr)
do i=1,sizeks1
c(i)=ks1(i)
enddo
do i=sizeks1+1,sizec
c(i)=ks2(i-sizeks1)
enddo
sizekstot=sizec
do i=1,sizeks1
do j=1,sizeks2
if(ks1(i).eq.ks2(j)) then
sizekstot=sizekstot-1
endif
enddo
enddo
allocate(kstot(sizekstot),stat=ierr)
jjj=1
i=1
jj=0
do i=1,sizec-1
jjj=jjj+1
do j=jjj,sizec
if(c(i).eq.c(j)) then
exit
elseif(c(i).ne.c(j).and.j.eq.sizec) then
jj=jj+1
kstot(jj)=c(i)
endif
enddo
enddo
kstot(sizekstot)=c(sizec)
allocate(ket1tot(sizekstot),stat=ierr)
do i=1,sizekstot
ket1tot(i)=0
enddo
allocate(ket2tot(sizekstot),stat=ierr)
do i=1,sizekstot
ket2tot(i)=0
enddo
do i=1,sizekstot
do j=1,sizeks1
if(kstot(i).eq.ks1(j))then
ket1tot(i)=ket1(j)
endif
enddo
enddo
do i=1,sizekstot
do j=1,sizeks2
if(kstot(i).eq.ks2(j))then
ket2tot(i)=ket2(j)
endif
enddo
enddo
FCtot=1
do i=1,sizekstot
line=(kstot(i)-1)*(maxnq)**2+ket1tot(i)*(maxnq)+ket2tot(i)+1
FCtot=matFC(line)*FCtot
enddo
deallocate(c,stat=ierr)
deallocate(kstot,stat=ierr)
deallocate(ket1tot,stat=ierr)
deallocate(ket2tot,stat=ierr)
enddo
enddo
!$OMP END PARALLEL DO
call cpu_time(time2)
tending=secnds(tstarting)
write(*,*)
write(*,*)'CPU time is:'
write(*,*)time2-time1
write(*,*)
write(*,*)'Wall clock time is:'
write(*,*)tending
end program
Still, I get the same problem, i.e. the wall clock time using 4 thread is bigger than using 1 thread.
For example I get (in seconds):
type Wtime CPU time
1 thread 20.37 20.37
4 thread 31.26 91.61
serial 19.64 19.64
I am aware that the call to the OMP library introduces an overhead and in fact the 1-thread OMP program is slower than the serial one. But I cannot understand why the 4-thread OMP code is slower.
I use Intel fortran compiler 2013 on Linux.
Any suggestions?
Thank you for any time you can dedicate to this problem.

Ok, I fixed my own problem.
Thank you all for your suggestions, in particular #Jorge Bellón and #High Performance Mark.
As their comments said, the problem was actually the high number of allocation/deallocation. If I move the allocations out of the loops or at least if I put them right after the first loop, I get the "normal" OpenMP behaviour, i.e. the bigger the number of threads, the lower the wall clock time.
For the example above the wall clock time using 4 threads is now about 7 seconds.
Thank you all for your help.

Related

Runtime error claiming a negative or zero argument to the logarithm function in Box-Mueller algorithm

The following code is a part of a Fortran 90 program that I wrote in Plato IDE:
It is just the Box-Mueller algorithm to generate Gaussian random numbers.
Program brownstep2_single_stage
Integer:: i,j,m,n,countsucc!,a
Real:: dt,D,epsa,r1,r2,w,fptsum,fptdef1,fptdef2
Real,Dimension(0:100002) :: fx !gt
!T=1000.0 and n*dt=T
dt=0.001
m=100000
n=100000
D=1.0
!a=7
w=2
epsa=0.00001
fx(0)=6.0
!gt(0)=0
fptsum=0
countsucc=0
Call random_seed()
Do i=0,m
!Call random_seed(a)
Do j=0,n
Do while (w>=1.0.and.w<0.0)
Call random_number(r1)
Call random_number(r2)
!r=rand()
r1=2.0*r1-1
r2=2.0*r2-1
w=r1*r1+r2*r2
End do
w=sqrt((-2.0*log(w))/w)
r1=r1*w
r2=r2*w
If(mod(j,2)==0) then
w=r1
Else if(mod(j,2)==1) then
w=r2
End if
fx(j+1)=fx(j)+w*sqrt(2.0*D*dt)
If(fx(j+1)<epsa) then
fptsum=fptsum+(j+1)*dt
countsucc=countsucc+1
exit
End if
print *,i,j
End do
End do
fptdef1=fptsum/m
fptdef2=fptsum/countsucc
print *,'The value of fpt by 1st definition is:',fptdef1
print *,'The value of fpt by 2nd definition is:',fptdef2
print *,'The number of successful events is:',countsucc
print *,'The total number of events is:',m
End program brownstep2_single_stage
During compilation, it shows no error, but when run, it shows the following runtime error, claiming a negative or zero argument to the logarithm function.
Runtime error from program:e:\my files\sample2brownstep_gauss.exe
Run-time Error
Error: Negative or zero argument to logarithm routine
BROWNSTEP2_SINGLE_STAGE - in file sample2brownstep_gauss.f90 at line 31 [+02cc]
What should I do to avoid this?
The changed code above still has problems. w is still not set before the do while loop is reached for the first time and w is used in the condition. Use an 'infinite' do loop with an exit statement. This ensures that one attempt at w is always attempted. This would be better:
do
Call random_number(r1)
Call random_number(r2)
r1=2.0*r1-1
r2=2.0*r2-1
w=r1*r1+r2*r2
if (w .lt. 1.0) exit
End do
w=sqrt((-2.0*log(w))/w)
r1=r1*w
r2=r2*w

The most efficient way to read a unformatted file

Now I am data-processing 100,000 files by using Fortran. These data are generated by HPC using MPI I/O. Now I can just figure out the following ways to read the raw, which is not efficient. Is it possible that read every to read ut_yz(:,J,K), at one one time insteading of reading one by one? Thanks
The old code is as follows and the efficiency is not so high.
OPEN(10,FILE=trim(filename)//".dat",FORM='UNFORMATTED',&
ACCESS='DIRECT', RECL=4, STATUS='OLD')
!,CONVERT='big_endian'
COUNT = 1
DO K=1,nz
DO J=1,ny
DO I=1,nxt
READ(10,REC=COUNT) ut_yz(I,J,K)
COUNT = COUNT + 1
ENDDO
ENDDO
ENDDO
CLOSE(10)
The desired one is
OPEN(10,FILE=trim(filename)//".dat",FORM='UNFORMATTED', RECL=4, STATUS='OLD')
!,CONVERT='big_endian'
COUNT = 1
DO K=1,nz
DO J=1,ny
READ(10,REC=COUNT) TEMP(:)
COUNT = COUNT + 153
ut_yz(:,J,K)=TEMP(:)
ENDDO
ENDDO
CLOSE(10)
However, it always fails. Can anyone make a comment on this? Thanks.
Direct IO read will read a single record, if I am not mistaken. Thus, in your new code version you need to increase the record length accordingly:
inquire(iolength=rl) ut_yz(:,1,1)
open(10, file=trim(filename)//'.dat', form='UNFORMATTED', recl=rl, status='OLD', action='READ')
count = 1
do k=1,nz
do j=1,ny
read(10, rec=count) ut_yz(:,j,k)
count = count + 1
end do
end do
close(10)
Of course, in this example you could also read the complete array at once, which should be the fastest option:
inquire(iolength=rl) ut_yz
open(10, file=trim(filename)//'.dat', form='UNFORMATTED', recl=rl, status='OLD', action='READ')
read(10, rec=1) ut_yz
close(10)

How to use reduction on an array in Fortran?

I'm just starting to learn openMP and I have the following...
do 100 k=1,lines
!$OMP PARALLEL DO PRIVATE(dotprod) REDUCTION(+:co(k),si(k))
do 110,i=1,ION_COUNT
dotprod=(rx(k)*x(i)+ry(k)*y(i)...)
co(k)=co(k)+COS(dotprod)
si(k)=si(k)+SIN(dotprod)
110 continue
!$OMP END PARALLEL DO
I've figured out (i think) that I need to do a reduction on co(k) and si(k) if I want to add them correctly, but as far as I can tell, you can't have an array like that within the reduction clause. How can I go about doing this?
If I understand correctly, using temporary variables would work:
do 100 k=1,lines
co_tmp = 0.0
si_tmp = 0.0
!$OMP PARALLEL DO PRIVATE(dotprod) REDUCTION(+:co_tmp,si_tmp)
do 110,i=1,ION_COUNT
dotprod=(rx(k)*x(i)+ry(k)*y(i)...)
co_tmp=co_tmp+COS(dotprod)
si_tmp=si_tmp+SIN(dotprod)
110 continue
!$OMP END PARALLEL DO
co(k) = co_tmp
si(k) = si_tmp
100 continue
You could also use the associate block (F2003) or pointers (F90) to avoid the temporary variable. In any case I would use end do, as I see you are using the free form source so you have F90.
do k=1,lines
c => co(k)
s => si(k) !use associate in Fortran 2003 here
!$OMP PARALLEL DO PRIVATE(dotprod) REDUCTION(+:c,s)
do i=1,ION_COUNT
dotprod = (rx(k)*x(i) + ry(k)*y(i) ...)
c = c + COS(dotprod)
s = s + SIN(dotprod)
end do
!$OMP END PARALLEL DO
end do
Also think about moving the creation of threads out of the loop.

Reading columns from data file in fortran

I wrote the following block to read from an external data file:
open(unit=338,file='bounnodes.dat',form='formatted')
DO I=1,NQBOUN
DO J=1,NUMBOUNNODES(I)
read(338,2001) NODEBOUN(i,j)
write(6,*) 'BOUNDARY NODES', NODEBOUN(i,j)
ENDDO
ENDDO
2001
FORMAT(32I5)
As far as I understood, this should read a 2 x 32 array from bounnodes.dat.
However, I get an error end-of-file during read and it prints the first column.
I tried to read a 32 x 2 array using the same code, and it reads 32 elements of the first column, but outputs 0s for the next column.
Can you please explain what is happening? Is my formatting wrong?
Every read statement in Fortran advances to the next record. This means a new line in normal text files. Try this:
DO I=1,NQBOUN
DO J=1,NUMBOUNNODES(I)
read(338,2001,advance='no') NODEBOUN(i,j)
write(*,*) 'BOUNDARY NODES', NODEBOUN(i,j)
ENDDO
read(338,*)
ENDDO
where NQBOUN is number of rows and NUMBOUNNODES(I) is number of columns in a row. (I have allway problems, what is 32x2 vs. 2x32)
You can make it even shorter, using the implied do
DO I=1,NQBOUN
read(338,2001) ( NODEBOUN(i,j) , j=1,NUMBOUNNODES(I) )
write(*,*) ( 'BOUNDARY NODES', NODEBOUN(i,j) , j=1,NUMBOUNNODES(I) )
ENDDO
or even
DO I=1,NQBOUN
read(338,2001) NODEBOUN(i,:)
write(*,*) 'BOUNDARY NODES', NODEBOUN(i,1:NUMBOUNNODES(I))
ENDDO
All of these use Fortran 90 features.

OpenMP runtime fluctuations

I am currently testing OpenMP in a big loop in my FORTRAN code. The code is part of a simulation module which is called from a VB.NET user interface; this interface also does the timing measurements. So I start a simulation, and at the end the software shows me how long it took (I only write this to show that for timing measurements I don't use wtime or cpu_time).
Now when I repeatedly start a simulation with my parallelized loop, I always get different simulation times, reaching, in one example, from 1min30sec to almost 3min! The results are always correct.
I tried different schedules for the loop (static, guided, dynamic), I tried to calculate the chunks that are assigned to each thread manually (do i=1,N -> do i=i_start,i_end), I tried to change the number of threads taking part in the calculation of the loop - with no change of the situation. When I remove the OpenMP directives from the code this does not occur, so they must be the reason for this behavior.
My machine is a quadcore Intel Xeon(R) CPU X3470 #2.93GHz with Win7 installed. I tried to run the program with both enabled and disabled multithreading (in the bios), however, this also didn't change anything.
Do you have any ideas what could go wrong? A web search for a situation like this showed that similar behavior occured in test environments of other programmers as well, however a solution / reason has never been mentioned. Thanks in advance for your thoughts.
Martin
EDIT
Here's the code:
!$OMP PARALLEL DO DEFAULT(SHARED) &
!$OMP PRIVATE(n,k,nk,i,j,l,List,Vx,Vz,cS,AE1,RootCh,Ec1,Ec2,Ec3,FcE,GcE,VxE,VzE,SMuL1,SMuL2) &
!$OMP PRIVATE(W1,W2,W3,Wx,Wz,S,i1,j1,AcE,j2,ic,iB,iBound,i2) &
!$OMP FIRSTPRIVATE(NumSEL) REDUCTION(-:Cum0,Cum1) REDUCTION(+:CumR)
DO n=1, NumEl
! Loop on subelements
DO k=1, Elements(n)%nCorners-2
nk = (k-1) * 3
NumSEL=NumSEL+1
!
i=Elements(n)%KX(1)
j=Elements(n)%KX(k+1)
l=Elements(n)%KX(k+2)
List(1)=i
List(2)=j
List(3)=l
!
IF(Level == NLevel) THEN
Vx(1)=Nodes(i)%VxO
Vx(2)=Nodes(j)%VxO
Vx(3)=Nodes(l)%VxO
Vz(1)=Nodes(i)%VzO
Vz(2)=Nodes(j)%VzO
Vz(3)=Nodes(l)%VzO
ELSE
Vx(1)=Nodes(i)%VxN
Vx(2)=Nodes(j)%VxN
Vx(3)=Nodes(l)%VxN
Vz(1)=Nodes(i)%VzN
Vz(2)=Nodes(j)%VzN
Vz(3)=Nodes(l)%VzN
END IF
!
cS=cBound(sol,5)
cS=(MIN(cS,Nodes(i)%Conc(sol))+MIN(cS,Nodes(j)%Conc(sol))+MIN(cS,Nodes(l)%Conc(sol)))/3.0D0
AE1=Elements(n)%xMul(k)*Elements(n)%Area(k)*dt*Eps
RootCh=AE1*cS*(Nodes(i)%Sink+Nodes(j)%Sink+Nodes(l)%Sink)/3.0D0
Cum0=Cum0-AE1*(Nodes(i)%Gc1+Nodes(j)%Gc1+Nodes(l)%Gc1)/3.0D0
Cum1=Cum1-AE1*(Nodes(i)%Fc1+Nodes(j)%Fc1+Nodes(l)%Fc1)/3.0D0
CumR=CumR+RootCh
Ec1=(Nodes(i)%Dispxx+Nodes(j)%Dispxx+Nodes(l)%Dispxx)/3.0D0
Ec2=(Nodes(i)%Dispxz+Nodes(j)%Dispxz+Nodes(l)%Dispxz)/3.0D0
Ec3=(Nodes(i)%Dispzz+Nodes(j)%Dispzz+Nodes(l)%Dispzz)/3.0D0
!
IF (Level == NLevel) AcE=(Nodes(i)%Ac+Nodes(j)%Ac+Nodes(l)%Ac)/3.0D0
!
FcE=(Nodes(i)%Fc+Nodes(j)%Fc+Nodes(l)%Fc)/3.0D0
GcE=(Nodes(i)%Gc+Nodes(j)%Gc+Nodes(l)%Gc)/3.0D0
VxE=(Vx(1)+Vx(2)+Vx(3))/3.0D0
VzE=(Vz(1)+Vz(2)+Vz(3))/3.0D0
SMul1=-Elements(n)%AMul(k)
SMul2=Elements(n)%Area(k)/20.0D0*Elements(n)%XMul(k)
!
If (lUpw) THEN
!W1=WeTab(1,NumSEl)
!W2=WeTab(2,NumSEl)
!W3=WeTab(3,NumSEl)
W1=WeTab(1,(n-1)*(Elements(n)%nCorners-2)+k)
W2=WeTab(2,(n-1)*(Elements(n)%nCorners-2)+k)
W3=WeTab(3,(n-1)*(Elements(n)%nCorners-2)+k)
Wx(1)=2.0D0*Vx(1)*(W2-W3)+Vx(2)*(W2-2.0D0*W3)+Vx(3)*(2.0D0*W2-W3)
Wx(2)=Vx(1)*(2.0D0*W3-W1)+2.0D0*Vx(2)*(W3-W1)+Vx(3)*(W3-2.0D0*W1)
Wx(3)=Vx(1)*(W1-2.0D0*W2)+Vx(2)*(2.0D0*W1-W2)+2.0D0*Vx(3)*(W1-W2)
Wz(1)=2.0D0*Vz(1)*(W2-W3)+Vz(2)*(W2-2.0D0*W3)+Vz(3)*(2.0D0*W2-W3)
Wz(2)=Vz(1)*(2.0D0*W3-W1)+2.0D0*Vz(2)*(W3-W1)+Vz(3)*(W3-2.0D0*W1)
Wz(3)=Vz(1)*(W1-2.0D0*W2)+Vz(2)*(2.0D0*W1-W2)+2.0D0*Vz(3)*(W1-W2)
END IF
!
DO j1=1, 3
i1=List(j1)
!$OMP ATOMIC
Nodes(i1)%F=Nodes(i1)%F+Elements(n)%GMul(k)*(GcE+Nodes(i1)%Gc/3.0D0)
IF (Level==NLevel) then
!$OMP ATOMIC
Nodes(i1)%DS=Nodes(i1)%DS+Elements(n)%GMul(k)*(Ace+Nodes(i1)%Ac/3.0D0)
end if
iBound=0
IF (Nodes(i)%Kode/=0) THEN
BP_Loop : DO id=1, NumBP
IF((KXB(id)==i1) .AND. (KodCB(id) > 0)) THEN
iBound=1
EXIT BP_Loop
END IF
END DO BP_Loop
END IF
!
DO j2=1, 3
i2=List(j2)
S(j1,j2)=SMul1*(Ec1*Elements(n)%dz(nk+j1)*Elements(n)%dz(nk+j2)+ &
Ec3*Elements(n)%dx(nk+j1)*Elements(n)%dx(nk+j2)+ &
Ec2*(Elements(n)%dz(nk+j1)*Elements(n)%dx(nk+j2)+ &
Elements(n)%dx(nk+j1)*Elements(n)%dz(nk+j2)))
S(j1,j2)=S(j1,j2)-(Elements(n)%dz(nk+j2)/8.0D0*(VxE+Vx(j1)/3.0D0)+ &
Elements(n)%dx(nk+j2)/8.0D0*(VzE+Vz(j1)/3.0D0))*Elements(n)%xMul(k)
IF(lUpw) S(j1,j2)=S(j1,j2)-Elements(n)%xMul(k)* &
(Elements(n)%dz(nk+j2)/40.0D0*Wx(j1)+ &
Elements(n)%dx(nk+j2)/40.0D0*Wz(j1))
ic=1
IF (i1==i2) ic=2
S(j1,j2)=S(j1,j2)+SMul2*ic*(FcE+(Nodes(i1)%Fc+Nodes(i2)%Fc)/3.0D0)
IF (iBound==1) then
if(j2.eq.1) then
!$OMP ATOMIC
Nodes(i1)%Qc(sol)=Nodes(i1)%Qc(sol)-Eps*S(j1,j2)*Nodes(i2)%Conc(sol)-Eps*Elements(n)%GMul(k)*(GcE+Nodes(i1)%Gc/3.0D0)
else
!$OMP ATOMIC
Nodes(i1)%Qc(sol)=Nodes(i1)%Qc(sol)-Eps*S(j1,j2)*Nodes(i2)%Conc(sol)
end if
end if
IF (Level/=NLevel) THEN
!$OMP ATOMIC
B(i1)=B(i1)-alf*S(j1,j2)*Nodes(i2)%Conc(sol)
ELSE
IF (lOrt) THEN
CALL rFIND(i1,i2,kk,NumNP,MBandD,IAD,IADN)
iB=kk
ELSE
iB=MBand+i2-i1
END IF
!$OMP ATOMIC
A(iB,i1)=A(iB,i1)+epsi*S(j1,j2)
END IF
END DO
END DO
END DO
END DO
!$OMP END PARALLEL DO
If you want to check the performance in the program i would suggest you did timings in the program with the OpenMP timing functions. See OpenMP Ref. sheet.
So you need to do something like:
USE omp_lib
t1 = omp_get_wtime()
! Big loop
t_final = omp_get_wtime() - t1
I some time find these to reflect the actual parallization timings better. Do you use those?
As FFox says it can simply be due to the ATOMIC statements which is delaying in different manors on each run. Remember that the threads are created at run time, so the layout of the threads may not be the same for each run.
With such a loop i would try to see if you could gain speed by splitting it up. Of course this is not needed if the speedup is around 2 for 2 threads. Just an idea.

Resources