How to use reduction on an array in Fortran? - syntax

I'm just starting to learn openMP and I have the following...
do 100 k=1,lines
!$OMP PARALLEL DO PRIVATE(dotprod) REDUCTION(+:co(k),si(k))
do 110,i=1,ION_COUNT
dotprod=(rx(k)*x(i)+ry(k)*y(i)...)
co(k)=co(k)+COS(dotprod)
si(k)=si(k)+SIN(dotprod)
110 continue
!$OMP END PARALLEL DO
I've figured out (i think) that I need to do a reduction on co(k) and si(k) if I want to add them correctly, but as far as I can tell, you can't have an array like that within the reduction clause. How can I go about doing this?

If I understand correctly, using temporary variables would work:
do 100 k=1,lines
co_tmp = 0.0
si_tmp = 0.0
!$OMP PARALLEL DO PRIVATE(dotprod) REDUCTION(+:co_tmp,si_tmp)
do 110,i=1,ION_COUNT
dotprod=(rx(k)*x(i)+ry(k)*y(i)...)
co_tmp=co_tmp+COS(dotprod)
si_tmp=si_tmp+SIN(dotprod)
110 continue
!$OMP END PARALLEL DO
co(k) = co_tmp
si(k) = si_tmp
100 continue

You could also use the associate block (F2003) or pointers (F90) to avoid the temporary variable. In any case I would use end do, as I see you are using the free form source so you have F90.
do k=1,lines
c => co(k)
s => si(k) !use associate in Fortran 2003 here
!$OMP PARALLEL DO PRIVATE(dotprod) REDUCTION(+:c,s)
do i=1,ION_COUNT
dotprod = (rx(k)*x(i) + ry(k)*y(i) ...)
c = c + COS(dotprod)
s = s + SIN(dotprod)
end do
!$OMP END PARALLEL DO
end do
Also think about moving the creation of threads out of the loop.

Related

Matlab ROS slow publishing + subscribing

In my experience Matlab performs publish subscribe operations with ROS slow for some reason. I work with components as defined in an object class as shown below, where I made a test-class. Normally objects of comparable structure are used to control mobile robots.
To quantify performance tested required time for an operation and got the following results:
1x publishing a message + 1x simple subscriber callback : 3.7ms
Simply counting in a callback (per count): 2.1318e-03 ms
Creating a new message with msg1 = rosmessage(obj.publisher) adds 3.6-4.3ms per iteration
Pinging myself indicated communication latency of 0.05 ms
The times required for a simple publish + start of a subscribe callback seems oddly slow.
I want to have multiple system components as objects in my workspace such that they respond to ROS topic updates or on timer events. The pc used for testing is not a monster but should not be garbage either.
Do you also think the shown time requirements are unneccesary large? this allows barely to publish a single topic at 200hz without doing anything else. Normally I have multiple lower frequency topics (e.g.20hz) but the total consumed time becomes significant.
Do you know any practices to make the system operate quicker?
What do you think of the OOP style of making control system components in general?
classdef subpubspeedMonitor < handle
% Use: call in matlab console, after initializing ros:
%
% SPM1 = subpubspeedMonitor()
%
% This will create an object which starts a set repetitive task upon creation
% and finally destructs itself after posting results in console.
properties
node
subscriber
publisher
timestart
messagetotal
end
methods
function obj = subpubspeedMonitor()
obj.node = ros.Node('subspeedmonitor1');
obj.subscriber = ros.Subscriber(obj.node,'topic1','sensor_msgs/NavSatFix',{#obj.rosSubCallback});
obj.publisher = ros.Publisher(obj.node,'topic1','sensor_msgs/NavSatFix');
obj.timestart = tic;
obj.messagetotal = 0;
msg1 = rosmessage(obj.publisher);
% Choose to evaluate subscriber + publisher loop or just counting
if 1
send(obj.publisher,msg1);
else
countAndDisplay(obj)
end
end
%% Test method one: repetitive publishing and subscribing
function rosSubCallback(obj,~,msg_) % ~3.7 ms per loop for a simple publish+subscribe action
% Latency to self is 0.05ms on average, according to "pinging" in terminal
obj.messagetotal = obj.messagetotal+1;
if obj.messagetotal <10000
%msg1 = rosmessage(obj.publisher); % this line adds 4.3000ms per loop
msg_.Longitude = 51; % this line adds 0.25000 ms per loop
send(obj.publisher,msg_)
else
% Display some results
timepassed = toc(obj.timestart);
time_per_pubsub = timepassed/obj.messagetotal
delete(obj);
end
end
%% Test method two: simply counting
function countAndDisplay(obj) % this costs 2.1318e-03 ms(!) per loop
obj.messagetotal = obj.messagetotal+1;
if obj.messagetotal <10000
%msg1 = rosmessage(obj.publisher); %adds 3.6ms per loop
%i = 1% adds 5.7532e-03 ms per loop
%msg1 = rosmessage("std_msgs/Bool"); %adds 1.5ms per loop
countAndDisplay(obj);
else
% Display some results
timepassed = toc(obj.timestart);
time_per_count_FCN = timepassed/obj.messagetotal
delete(obj);
end
end
%% Deconstructor
function delete(obj)
delete(obj.subscriber)
delete(obj.publisher)
delete(obj.node)
end
end
end

Distributed Julia: parallel map (pmap) with a timeout / time limit for each map task to complete

My project involves computing in parallel a map using Julia's Distributed's pmap function.
Mapping a given element could take a few seconds, or it could take essentially forever. I want a timeout or time limit for an individual map task/computation to complete.
If a map task finishes in time, great, return the result of the computation. If the task doesn't complete by the time limit, stop computation when the time limit has been reached, and return some value or message indicating a timeout occurred.
A minimal example follows. First are imported modules, and then worker processes are launched:
num_procs = 1
using Distributed
if num_procs > 1
# The main process (no calling addprocs) can be used for `pmap`:
addprocs(num_procs-1)
end
Next, the mapping task is defined for all the worker processes. The mapping task should timeout after 1 second:
#everywhere import Random
#everywhere begin
"""
Compute stuff for `wait_time` seconds, and return `wait_time`.
If `timeout` seconds elapses, stop computation and return something else.
"""
function waitForTimeUnlessTimeout(wait_time, timeout=1)
# < Insert some sort of timeout code? >
# This block of code simulates a long computation.
# (pretend the computation time is unknown)
x = 0
while time()-t0 < wait_time
x += Random.rand() - 0.5
end
# computation completed before time limit. Return wait_time.
round(wait_time, digits=2)
end
end
The function that executes the parallel map (pmap) is defined on the main process. Each map task randomly takes up to 2 seconds to complete, but should time out after 1 second.
function myParallelMapping(num_tasks = 20, max_runtime=2)
# random task runtimes between 0 and max_runtime
runtimes = Random.rand(num_tasks) * max_runtime
# return the parallel computation of the mapping tasks
pmap((runtime)->waitForTimeUnlessTimeout(runtime), runtimes)
end
print(myParallelMapping())
How should this time-limited parallel map be implemented?
You could put something like this inside your pmap body
pmap(runtimes) do runtime
t0 = time()
task = #async waitForTimeUnlessTimeout(runtime)
while !istaskdone(task) && time()-t0 < time_limit
sleep(1)
end
istaskdone(task) && (return fetch(task))
error("time over")
end
Also note that (runtime)->waitForTimeUnlessTimeout(runtime) is the same as just waitForTimeUnlessTimeout .
Following #Fredrik Bagge's very helpful answer, here is the full working example implementation with some extra explanation.
num_procs = 8
using Distributed
if num_procs > 1
addprocs(num_procs-1)
end
#everywhere import Random
#everywhere begin
function waitForTime(wait_time)
# This code block simulates a long computation.
# Pretend the computation time is unknown.
t0 = time()
x = 0
while time()-t0 < wait_time
x += Random.rand() - 0.5
yield() # CRITICAL to release computation to check if task is done.
# If you comment out #yield(), you will see timeout doesn't work!
end
return round(wait_time, digits=2)
end
end
function myParallelMapping(num_tasks = 16, max_runtime=2, time_limit=1)
# random task runtimes between 0 and max_runtime
runtimes = Random.rand(num_tasks) * max_runtime
# parallel compute the mapping tasks. See "do block" in
# the Julia documentation, it's just syntactic sugar.
return pmap(runtimes) do runtime
t0 = time()
task = #async waitForTime(runtime)
while !istaskdone(task) && time()-t0 < time_limit
# releases computation to waitForTime
sleep(0.1)
# nothing past here will run until waitForTime calls yield()
# *and* 0.1 seconds have passed.
end
# equal to if istaskdone(task); return fetch(task); end
istaskdone(task) && (return fetch(task))
return "TimeOut"
# `return error("TimeOut")` halts pmap unless pmap is
# given an error handler argument. See pmap documentation.
end
end
The output is
julia> print(myParallelMapping())
Any["TimeOut", "TimeOut", 0.33, 0.35, 0.56, 0.41, 0.08, 0.14, 0.72,
"TimeOut", "TimeOut", "TimeOut", 0.52, "TimeOut", 0.33, "TimeOut"]
Note that there are two tasks per process in this example. The original task (the "time checker") is checking every 0.1 seconds if the other task has completed computation. The other task (created with #async) is computing something, periodically calling yield() to release control to the time checker; if it doesn't call yield(), time checking cannot occur.

OpenMP slower than serial Fortran90 code

In my Fortran90 code I have an outer cycle with several nested loops. In order to speed up my code, I tried to use OpenMP on the outer loop, but I have a very strange problem: when I use more than 1 thread the program runs slower than using OMP with 1 thread which is in turn slower than using the original serial program (in terms of wall clock time. I tried with 1, 2, 3 or 4 threads). In all the cases, however, I get the right result.
I conducted several tests on my code and at last I noticed that the problem is in one subroutine, because if I comment the call to that routine my parallel program works as expected, i.e. the bigger the number of threads, the lower the wall clock time.
Now, that routine takes in input 4 vectors, "ks1", "ks2", "ket1", "ket2" and performs the union between "ks1" and "ks2" obtaining "kstot". Then it creates 2 new vectors, "ket1tot" and "ket2tot", where ket1tot(i) is equal to ket1(j) if ks1tot(i) is equal to ks1(j), otherwise ket1tot(i)=0. The same for ket2tot.
Then combining the values stored in vectors "ks1tot", "ket1tot", "ket2tot" I compute which lines of a vector (matFC) contain the values I need and, by multiplication of that values, I obtain the final result (FCtot).
So I turn that routine in a simple program, adding some initial lines in order to mimic the real program. I mean I added:
1) a loop (on i) that mimics the outer loop of the real program I am trying to parallelize;
2) I implemented the fact that each thread works on a different file (so I should not have a false sharing problem)
3) I added another loop (on k) that mimics that I call the routine several times.
Here is the code (the part that constitutes the original subroutine that gives me problems is indicated in the text):
program evaluatefc
#ifdef _OPENMP
use omp_lib
#endif
implicit none
integer::i,ii,j,jj,jjj,k,sizeks1,sizeks2,sizec,sizekstot,NR,NR1,maxnq
integer::line,ierr,fileunit,mythread,nfreqtot
real*8::FCtot,time1,time2
integer,allocatable,dimension(:)::ks1,ket1,ks2,ket2
integer,dimension(:),allocatable::c,kstot,ket1tot,ket2tot
real*8,allocatable,dimension(:)::matFC
character*15,allocatable,dimension(:)::matfileFC
character::fileFC*15
real*4::tstarting,tending
! This program was originally a subroutine
! that takes in input 4 vectors, ks1, ks2, ket1, ket2
!---------------------------------------------------------------------------
! I initialize some values that in the original subroutine were computed by
!the main program
allocate(matfileFC(3),stat=ierr)
matfileFC(1)='filea.dat'
matfileFC(2)='fileb.dat'
matfileFC(3)='filec.dat'
sizeks1=2
sizeks2=2
maxnq=11
allocate(ks1(sizeks1),stat=ierr)
allocate(ket1(sizeks1),stat=ierr)
allocate(ks2(sizeks2),stat=ierr)
allocate(ket2(sizeks2),stat=ierr)
nfreqtot=42
NR1=nfreqtot*(maxnq**2)+nfreqtot
NR=nfreqtot*(maxnq**2)
allocate(matFC(NR),stat=ierr)
!Call two intrinsic to evaluate CPU and wall clock time
call cpu_time(time1)
tstarting=secnds(0.0)
!$OMP PARALLEL DO &
!$OMP DEFAULT(NONE) &
!$OMP firstprivate(sizeks1,sizeks2,maxnq,matfileFC,NR,NR1) &
!$OMP PRIVATE(i,ii,j,jj,k,ierr,mythread,fileunit,c,sizec,line,sizekstot) &
!$OMP PRIVATE(jjj,ket1,ks1,ket1tot,kstot,ket2,ks2,ket2tot,FCtot,matFC,fileFC)
do ii=1,3
#ifdef _OPENMP
mythread=OMP_GET_THREAD_NUM()
#else
mythread=10
#endif
fileFC=matfileFC(ii)
! Read some lines of a given file.
fileunit=50+mythread
open(unit=fileunit,name=fileFC,status='old',form='formatted')
read(fileunit,*)!Do not read first line
jjj=0
do jj=1,NR1-1
if(mod(jj,(maxnq**2+1)).eq.0) then
read(fileunit,*)
else
jjj=jjj+1
read(fileunit,*)j,k,i,matFC(jjj)
! I actually need only the fourth valor of the line to be stored
endif
enddo
close(fileunit)
do k=1,10000000
! Again I initialize the abovementioned values that in the actual
! subroutine are computed by the main program
ks1(1)=mod(k,30)+1
ks1(2)=mod(k,30)+2
ks2(1)=mod(k,17)+1
ks2(2)=mod(k,17)+3
ket1(1)=mod(k,2)
ket1(2)=mod(k,3)
ket2(1)=mod(k,5)
ket2(2)=mod(k,7)
sizec=sizeks1+sizeks2
allocate(c(sizec),stat=ierr)
do i=1,sizeks1
c(i)=ks1(i)
enddo
do i=sizeks1+1,sizec
c(i)=ks2(i-sizeks1)
enddo
sizekstot=sizec
do i=1,sizeks1
do j=1,sizeks2
if(ks1(i).eq.ks2(j)) then
sizekstot=sizekstot-1
endif
enddo
enddo
allocate(kstot(sizekstot),stat=ierr)
jjj=1
i=1
jj=0
do i=1,sizec-1
jjj=jjj+1
do j=jjj,sizec
if(c(i).eq.c(j)) then
exit
elseif(c(i).ne.c(j).and.j.eq.sizec) then
jj=jj+1
kstot(jj)=c(i)
endif
enddo
enddo
kstot(sizekstot)=c(sizec)
allocate(ket1tot(sizekstot),stat=ierr)
do i=1,sizekstot
ket1tot(i)=0
enddo
allocate(ket2tot(sizekstot),stat=ierr)
do i=1,sizekstot
ket2tot(i)=0
enddo
do i=1,sizekstot
do j=1,sizeks1
if(kstot(i).eq.ks1(j))then
ket1tot(i)=ket1(j)
endif
enddo
enddo
do i=1,sizekstot
do j=1,sizeks2
if(kstot(i).eq.ks2(j))then
ket2tot(i)=ket2(j)
endif
enddo
enddo
FCtot=1
do i=1,sizekstot
line=(kstot(i)-1)*(maxnq)**2+ket1tot(i)*(maxnq)+ket2tot(i)+1
FCtot=matFC(line)*FCtot
enddo
deallocate(c,stat=ierr)
deallocate(kstot,stat=ierr)
deallocate(ket1tot,stat=ierr)
deallocate(ket2tot,stat=ierr)
enddo
enddo
!$OMP END PARALLEL DO
call cpu_time(time2)
tending=secnds(tstarting)
write(*,*)
write(*,*)'CPU time is:'
write(*,*)time2-time1
write(*,*)
write(*,*)'Wall clock time is:'
write(*,*)tending
end program
Still, I get the same problem, i.e. the wall clock time using 4 thread is bigger than using 1 thread.
For example I get (in seconds):
type Wtime CPU time
1 thread 20.37 20.37
4 thread 31.26 91.61
serial 19.64 19.64
I am aware that the call to the OMP library introduces an overhead and in fact the 1-thread OMP program is slower than the serial one. But I cannot understand why the 4-thread OMP code is slower.
I use Intel fortran compiler 2013 on Linux.
Any suggestions?
Thank you for any time you can dedicate to this problem.
Ok, I fixed my own problem.
Thank you all for your suggestions, in particular #Jorge Bellón and #High Performance Mark.
As their comments said, the problem was actually the high number of allocation/deallocation. If I move the allocations out of the loops or at least if I put them right after the first loop, I get the "normal" OpenMP behaviour, i.e. the bigger the number of threads, the lower the wall clock time.
For the example above the wall clock time using 4 threads is now about 7 seconds.
Thank you all for your help.

The most efficient way to read a unformatted file

Now I am data-processing 100,000 files by using Fortran. These data are generated by HPC using MPI I/O. Now I can just figure out the following ways to read the raw, which is not efficient. Is it possible that read every to read ut_yz(:,J,K), at one one time insteading of reading one by one? Thanks
The old code is as follows and the efficiency is not so high.
OPEN(10,FILE=trim(filename)//".dat",FORM='UNFORMATTED',&
ACCESS='DIRECT', RECL=4, STATUS='OLD')
!,CONVERT='big_endian'
COUNT = 1
DO K=1,nz
DO J=1,ny
DO I=1,nxt
READ(10,REC=COUNT) ut_yz(I,J,K)
COUNT = COUNT + 1
ENDDO
ENDDO
ENDDO
CLOSE(10)
The desired one is
OPEN(10,FILE=trim(filename)//".dat",FORM='UNFORMATTED', RECL=4, STATUS='OLD')
!,CONVERT='big_endian'
COUNT = 1
DO K=1,nz
DO J=1,ny
READ(10,REC=COUNT) TEMP(:)
COUNT = COUNT + 153
ut_yz(:,J,K)=TEMP(:)
ENDDO
ENDDO
CLOSE(10)
However, it always fails. Can anyone make a comment on this? Thanks.
Direct IO read will read a single record, if I am not mistaken. Thus, in your new code version you need to increase the record length accordingly:
inquire(iolength=rl) ut_yz(:,1,1)
open(10, file=trim(filename)//'.dat', form='UNFORMATTED', recl=rl, status='OLD', action='READ')
count = 1
do k=1,nz
do j=1,ny
read(10, rec=count) ut_yz(:,j,k)
count = count + 1
end do
end do
close(10)
Of course, in this example you could also read the complete array at once, which should be the fastest option:
inquire(iolength=rl) ut_yz
open(10, file=trim(filename)//'.dat', form='UNFORMATTED', recl=rl, status='OLD', action='READ')
read(10, rec=1) ut_yz
close(10)

OpenMP runtime fluctuations

I am currently testing OpenMP in a big loop in my FORTRAN code. The code is part of a simulation module which is called from a VB.NET user interface; this interface also does the timing measurements. So I start a simulation, and at the end the software shows me how long it took (I only write this to show that for timing measurements I don't use wtime or cpu_time).
Now when I repeatedly start a simulation with my parallelized loop, I always get different simulation times, reaching, in one example, from 1min30sec to almost 3min! The results are always correct.
I tried different schedules for the loop (static, guided, dynamic), I tried to calculate the chunks that are assigned to each thread manually (do i=1,N -> do i=i_start,i_end), I tried to change the number of threads taking part in the calculation of the loop - with no change of the situation. When I remove the OpenMP directives from the code this does not occur, so they must be the reason for this behavior.
My machine is a quadcore Intel Xeon(R) CPU X3470 #2.93GHz with Win7 installed. I tried to run the program with both enabled and disabled multithreading (in the bios), however, this also didn't change anything.
Do you have any ideas what could go wrong? A web search for a situation like this showed that similar behavior occured in test environments of other programmers as well, however a solution / reason has never been mentioned. Thanks in advance for your thoughts.
Martin
EDIT
Here's the code:
!$OMP PARALLEL DO DEFAULT(SHARED) &
!$OMP PRIVATE(n,k,nk,i,j,l,List,Vx,Vz,cS,AE1,RootCh,Ec1,Ec2,Ec3,FcE,GcE,VxE,VzE,SMuL1,SMuL2) &
!$OMP PRIVATE(W1,W2,W3,Wx,Wz,S,i1,j1,AcE,j2,ic,iB,iBound,i2) &
!$OMP FIRSTPRIVATE(NumSEL) REDUCTION(-:Cum0,Cum1) REDUCTION(+:CumR)
DO n=1, NumEl
! Loop on subelements
DO k=1, Elements(n)%nCorners-2
nk = (k-1) * 3
NumSEL=NumSEL+1
!
i=Elements(n)%KX(1)
j=Elements(n)%KX(k+1)
l=Elements(n)%KX(k+2)
List(1)=i
List(2)=j
List(3)=l
!
IF(Level == NLevel) THEN
Vx(1)=Nodes(i)%VxO
Vx(2)=Nodes(j)%VxO
Vx(3)=Nodes(l)%VxO
Vz(1)=Nodes(i)%VzO
Vz(2)=Nodes(j)%VzO
Vz(3)=Nodes(l)%VzO
ELSE
Vx(1)=Nodes(i)%VxN
Vx(2)=Nodes(j)%VxN
Vx(3)=Nodes(l)%VxN
Vz(1)=Nodes(i)%VzN
Vz(2)=Nodes(j)%VzN
Vz(3)=Nodes(l)%VzN
END IF
!
cS=cBound(sol,5)
cS=(MIN(cS,Nodes(i)%Conc(sol))+MIN(cS,Nodes(j)%Conc(sol))+MIN(cS,Nodes(l)%Conc(sol)))/3.0D0
AE1=Elements(n)%xMul(k)*Elements(n)%Area(k)*dt*Eps
RootCh=AE1*cS*(Nodes(i)%Sink+Nodes(j)%Sink+Nodes(l)%Sink)/3.0D0
Cum0=Cum0-AE1*(Nodes(i)%Gc1+Nodes(j)%Gc1+Nodes(l)%Gc1)/3.0D0
Cum1=Cum1-AE1*(Nodes(i)%Fc1+Nodes(j)%Fc1+Nodes(l)%Fc1)/3.0D0
CumR=CumR+RootCh
Ec1=(Nodes(i)%Dispxx+Nodes(j)%Dispxx+Nodes(l)%Dispxx)/3.0D0
Ec2=(Nodes(i)%Dispxz+Nodes(j)%Dispxz+Nodes(l)%Dispxz)/3.0D0
Ec3=(Nodes(i)%Dispzz+Nodes(j)%Dispzz+Nodes(l)%Dispzz)/3.0D0
!
IF (Level == NLevel) AcE=(Nodes(i)%Ac+Nodes(j)%Ac+Nodes(l)%Ac)/3.0D0
!
FcE=(Nodes(i)%Fc+Nodes(j)%Fc+Nodes(l)%Fc)/3.0D0
GcE=(Nodes(i)%Gc+Nodes(j)%Gc+Nodes(l)%Gc)/3.0D0
VxE=(Vx(1)+Vx(2)+Vx(3))/3.0D0
VzE=(Vz(1)+Vz(2)+Vz(3))/3.0D0
SMul1=-Elements(n)%AMul(k)
SMul2=Elements(n)%Area(k)/20.0D0*Elements(n)%XMul(k)
!
If (lUpw) THEN
!W1=WeTab(1,NumSEl)
!W2=WeTab(2,NumSEl)
!W3=WeTab(3,NumSEl)
W1=WeTab(1,(n-1)*(Elements(n)%nCorners-2)+k)
W2=WeTab(2,(n-1)*(Elements(n)%nCorners-2)+k)
W3=WeTab(3,(n-1)*(Elements(n)%nCorners-2)+k)
Wx(1)=2.0D0*Vx(1)*(W2-W3)+Vx(2)*(W2-2.0D0*W3)+Vx(3)*(2.0D0*W2-W3)
Wx(2)=Vx(1)*(2.0D0*W3-W1)+2.0D0*Vx(2)*(W3-W1)+Vx(3)*(W3-2.0D0*W1)
Wx(3)=Vx(1)*(W1-2.0D0*W2)+Vx(2)*(2.0D0*W1-W2)+2.0D0*Vx(3)*(W1-W2)
Wz(1)=2.0D0*Vz(1)*(W2-W3)+Vz(2)*(W2-2.0D0*W3)+Vz(3)*(2.0D0*W2-W3)
Wz(2)=Vz(1)*(2.0D0*W3-W1)+2.0D0*Vz(2)*(W3-W1)+Vz(3)*(W3-2.0D0*W1)
Wz(3)=Vz(1)*(W1-2.0D0*W2)+Vz(2)*(2.0D0*W1-W2)+2.0D0*Vz(3)*(W1-W2)
END IF
!
DO j1=1, 3
i1=List(j1)
!$OMP ATOMIC
Nodes(i1)%F=Nodes(i1)%F+Elements(n)%GMul(k)*(GcE+Nodes(i1)%Gc/3.0D0)
IF (Level==NLevel) then
!$OMP ATOMIC
Nodes(i1)%DS=Nodes(i1)%DS+Elements(n)%GMul(k)*(Ace+Nodes(i1)%Ac/3.0D0)
end if
iBound=0
IF (Nodes(i)%Kode/=0) THEN
BP_Loop : DO id=1, NumBP
IF((KXB(id)==i1) .AND. (KodCB(id) > 0)) THEN
iBound=1
EXIT BP_Loop
END IF
END DO BP_Loop
END IF
!
DO j2=1, 3
i2=List(j2)
S(j1,j2)=SMul1*(Ec1*Elements(n)%dz(nk+j1)*Elements(n)%dz(nk+j2)+ &
Ec3*Elements(n)%dx(nk+j1)*Elements(n)%dx(nk+j2)+ &
Ec2*(Elements(n)%dz(nk+j1)*Elements(n)%dx(nk+j2)+ &
Elements(n)%dx(nk+j1)*Elements(n)%dz(nk+j2)))
S(j1,j2)=S(j1,j2)-(Elements(n)%dz(nk+j2)/8.0D0*(VxE+Vx(j1)/3.0D0)+ &
Elements(n)%dx(nk+j2)/8.0D0*(VzE+Vz(j1)/3.0D0))*Elements(n)%xMul(k)
IF(lUpw) S(j1,j2)=S(j1,j2)-Elements(n)%xMul(k)* &
(Elements(n)%dz(nk+j2)/40.0D0*Wx(j1)+ &
Elements(n)%dx(nk+j2)/40.0D0*Wz(j1))
ic=1
IF (i1==i2) ic=2
S(j1,j2)=S(j1,j2)+SMul2*ic*(FcE+(Nodes(i1)%Fc+Nodes(i2)%Fc)/3.0D0)
IF (iBound==1) then
if(j2.eq.1) then
!$OMP ATOMIC
Nodes(i1)%Qc(sol)=Nodes(i1)%Qc(sol)-Eps*S(j1,j2)*Nodes(i2)%Conc(sol)-Eps*Elements(n)%GMul(k)*(GcE+Nodes(i1)%Gc/3.0D0)
else
!$OMP ATOMIC
Nodes(i1)%Qc(sol)=Nodes(i1)%Qc(sol)-Eps*S(j1,j2)*Nodes(i2)%Conc(sol)
end if
end if
IF (Level/=NLevel) THEN
!$OMP ATOMIC
B(i1)=B(i1)-alf*S(j1,j2)*Nodes(i2)%Conc(sol)
ELSE
IF (lOrt) THEN
CALL rFIND(i1,i2,kk,NumNP,MBandD,IAD,IADN)
iB=kk
ELSE
iB=MBand+i2-i1
END IF
!$OMP ATOMIC
A(iB,i1)=A(iB,i1)+epsi*S(j1,j2)
END IF
END DO
END DO
END DO
END DO
!$OMP END PARALLEL DO
If you want to check the performance in the program i would suggest you did timings in the program with the OpenMP timing functions. See OpenMP Ref. sheet.
So you need to do something like:
USE omp_lib
t1 = omp_get_wtime()
! Big loop
t_final = omp_get_wtime() - t1
I some time find these to reflect the actual parallization timings better. Do you use those?
As FFox says it can simply be due to the ATOMIC statements which is delaying in different manors on each run. Remember that the threads are created at run time, so the layout of the threads may not be the same for each run.
With such a loop i would try to see if you could gain speed by splitting it up. Of course this is not needed if the speedup is around 2 for 2 threads. Just an idea.

Resources