I don't get any speedup when I try to do the following in the subroutine:
!$ call omp_set_num_threads(threadno)
call system_clock(x1)
!$OMP PARALLEL do private(i), reduction(+:total)
do i = 1,m
total = 0.d0
call result(a,l,b,qm,q,en) !here l is input for subroutine and en is output
qm(:,i) = q
qtv(i) = qt
mean = sum(q)/size(q)
do i2 = 1,k
total = total + ((mean-q(i2))**2)/(a+b)
end do
qvv(i1) = total
end do
call system_clock(x2)
print *, x2-x1
!$OMP END PARALLEL do
Comments on the OpenMP part:
total should not be reset in the loop but before the !$OMP clause.
i2 and mean should be private.
If q does not change between iterations of the loop, sum(q)/size(q) should be placed outside.
The lack of private setting can lead to memory access conflicts (and thus slowdowns).
I guess that the code you show is close to but not really the one that you compile. It would be useful to have a compiled code to provide a better help.
Cheers,
Pierre
EDIT: for timing OpenMP code, you should use omp_get_wtime (see https://gcc.gnu.org/onlinedocs/libgomp/omp_005fget_005fwtime.html) that gives the walltime https://en.wikipedia.org/wiki/Wall-clock_time. The module for openmp routines is loaded with use omp_lib
Related
As of lately I have been reading and playing around with OpenMP parallel do's in Fortran 95. However, I still have not figured out how the parallel do would be used in a code like the one beneath:
I=1
DO WHILE I<100
A=2*I
B=3*I
C=A+B
SUM(I)=C
I=I+1
END DO
Using simply !$OMP PARALLEL DO before the do loop and !$OMP END PARALLEL DO doesn't seem to work. I have read a couple of things about private and shared variables however I think that each successive loop of the code above is completely independent. Any help would be appreciated greatly.
The parallel do construct doesn't work with do while loops. You need to change the do while loop to a standard DO loop. This is from the OpenMP 4.0 standard on the parallel do construct at https://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf, page 59:
• The associated do-loops must be structured blocks.
• Only an iteration of the innermost associated loop may be curtailed by a CYCLE statement.
• No statement in the associated loops other than the DO statements can cause a branch out of the loops.
• The do-loop iteration variable must be of type integer.
• The do-loop cannot be a DO WHILE or a DO loop without loop control.
The following example may help with understanding your approach for what you have outlined.
It shows the use of !$OMP and also identifies the thread being used for each iteration of the loop.
I changed SUM to SUMI to retain SUM as an intrinsic function.
Hopefully you can build on this.
use omp_lib
real sumi(99), a,b,c
integer thread_used(0:9), I
nThreads = omp_get_max_threads ()
thread_used = 0
!$OMP PARALLEL DO &
!$OMP SHARED (SUMI,thread_used) &
!$OMP PRIVATE (i,a,b,c,iThread)
DO I = 1,99
iThread = omp_get_thread_num ()
thread_used(iThread) = thread_used(iThread) + 1
A=2*I
B=3*I
C=A+B
SUMI(I)=C
END DO
!$OMP END PARALLEL DO
write (*,*) sum (SUMI)
do i = 0, nThreads
write (*,*) i, thread_used(i)
end do
end
If I want to calculate for things in Julia
invQa = ChebyExp(g->1/Q(g),0,1,5)
a1Inf = ChebyExp(g->Q(g),1,10,5)
invQb = ChebyExp(g->1/Qd(g),0,1,5)
Qb1Inf = ChebyExp(g->Qd(g),1,10,5)
How can I count the time? How many seconds do i have to wait for the four things up be done? Do I put tic() at the beginning and toc() at the end?
I tried #elapsed, but no results.
The basic way is to use
#time begin
#code
end
But note that you never should benchmark in the global scope.
A package that can help you benchmark your code is BenchmarkTools.jl which you should check out as well.
You could do something like this (I guess that g is input parameter):
function cheby_test(g::Your_Type)
invQa = ChebyExp(g->1/Q(g),0,1,5)
a1Inf = ChebyExp(g->Q(g),1,10,5)
invQb = ChebyExp(g->1/Qd(g),0,1,5)
Qb1Inf = ChebyExp(g->Qd(g),1,10,5)
end
function test()
g::Your_Type = small_quick #
cheby_test(g) #= function is compiled here and
you like to exclude compile time from test =#
g = real_data()
#time cheby_test(g) # here you measure time for real data
end
test()
I propose to call #time not in global scope if you like to get proper allocation info from time macro.
I am new to OpenMP and find it a little bit hard to understand how locks in OpenMP really work. Here is an example code written in Fortran 90 to do LU factorization. Can anyone explain how locks work in this code?
program lu
implicit none
integer, parameter :: DP=kind(0.0D0),n=20
!-- Variables
integer :: i,j,k,nthr,thrid,chunk=1
real(kind=DP), dimension(:,:),allocatable :: A,B,L,U
real(kind=DP) :: timer,error,walltime
integer(kind=8), dimension(n)::lck
integer::omp_get_thread_num,omp_get_max_threads
nthr=omp_get_max_threads()
allocate(A(n,n))
allocate(B(n,n))
allocate(L(n,n))
allocate(U(n,n))
!-- Set up locks for each column
do i=1,n
call omp_init_lock(lck(i))
end do
timer=walltime()
!$OMP PARALLEL PRIVATE(i,j,k,thrid)
thrid=omp_get_thread_num();
!-- Initiate matrix
!$OMP DO SCHEDULE(STATIC,chunk)
do j=1,n
do i=1,n
A(i,j)=1.0/(i+j)
end do
call omp_set_lock(lck(j))
end do
!$OMP END DO
!-- First column of L
if (thrid==0) then
do i=2,n
A(i,1)=A(i,1)/A(1,1)
end do
call omp_unset_lock(lck(1))
end if
!-- LU-factorization
do k=1,n
call omp_set_lock(lck(k))
call omp_unset_lock(lck(k))
!$OMP DO SCHEDULE(STATIC,chunk)
do j=1,n
if (j>k) then
do i=k+1,n
A(i,j)=A(i,j)-A(i,k)*A(k,j)
end do
if (j==k+1) then
do i=k+2,n
A(i,k+1)=A(i,k+1)/A(k+1,k+1)
end do
call omp_unset_lock(lck(k+1))
end if
end if
end do
!$OMP END DO NOWAIT
end do
!$OMP END PARALLEL
timer=walltime()-timer
write(*,*) 'n = ',n,' time = ',timer,' nthr = ',nthr
! CHECK CORRECTNESS
do j=1,n
L(j,j)=1
U(j,j)=A(j,j)
do i=j+1,n
L(i,j)=A(i,j)
U(i,j)=0
end do
do i=1,j-1
U(i,j)=A(i,j)
L(i,j)=0
end do
end do
B=0
do j=1,n
do k=1,n
do i=1,n
B(i,j)=B(i,j)+L(i,k)*U(k,j)
end do
end do
end do
error=0.0
do j=1,n
do i=1,n
error=error+abs(1.0/(i+j)-B(i,j))
end do
end do
write(*,*) 'ERROR: ',error
end program lu
Another file is listed below which contains the walltime function. It should be compiled with the main file together.
function walltime()
integer, parameter:: DP = kind(0.0D0)
real(DP) walltime
integer::count,count_rate,count_max
call system_clock(count,count_rate,count_max)
walltime=real(count,DP)/real(count_rate,DP)
end function walltime
DISCLAIMER: I don't have experience with the lock mechanism and took a look to the standard to learn, how this will work. I might be wrong...
At first some problems with your code: This code won't compile with a recent version of gfortran. You have to move the function walltime to the contains section of your program and you should use USE omp_lib which defines all necessary functions (and remove the resulting duplicate definitions). Additionally, you have to define your lock in the standard way:
integer(kind=OMP_LOCK_KIND), dimension(n) :: lck
Now to your question: The call to OMP_INIT_LOCK initializes your lck array to unlocked state. All threads will get a copy of this variable. Then the parallel section is started.
In the first loop, the array is initialized as something similar to a Hilbert matrix and each lock is set.
The second block is only executed by the first thread and the first lock is released. Still nothing interesting. The following loop is entered by all threads and all threads are waiting for the k-th lock, because omp_set_lock waits, till the lock is acquired. The following omp_unset_lock lets all other threads follow. Due to the already released 1st lock, all threads will immediately enter the inner loop and finally one of the threads will release the next lock. By the time, this thread releases this lock, the other threads might already be waiting for this lock.
In principle, this algorithm provides some form of synchronization, to make sure, that the data, which is required by the k+1-th loop is already calculated, when entering it.
I am trying to test the effect of parfor compared to for in matlab, I built simple function calculates π :
here is the function with the parfor:
function [calc_pi,epsilon] = calcPi(max)
format long;
in = 0;
tic
parfor k=1:max
x = rand();
y = rand();
if sqrt(x^2 + y^2)<1
in = in + 1;
end
end
toc
calc_pi = 4*in/max;
epsilon = abs(pi - calc_pi);
end
I run it with parfor and got this output:
>> [calc,err] = calcPi(1000000000)
Elapsed time is 92.2923 seconds.
calc =
3.141638468000000
err =
4.581441020690136e-05
>>
with the for loop I came with:
>> [calc,err] = calcPi(1000000000)
Elapsed time is 121.3432 seconds.
calc =
3.141645132000000
err =
5.247841020672439e-05
I have two questions:
Why both take about the same amount of time ? (Unlike showed here)
I would like to add an argument to the function indicates whether to
use for or parfor with the minimal change in code:
i.e. :
if (use_par):
parfor k=1:10
else
for k=1:10
end
<--rest of code here-->
How can I write it with the minimal amount of code ?
The main requirement of parfor is that the loop executions are independant. Here they are clearly not as each iteration can update the variable in.
The good news is that you may be able to solve this by using in(k) instead.
One way to use one loop or the other without using extra code would be to put everything you do in a function or script, for example doeverything.m
then write
if (use_par):
parfor k=1:10
doeverything
end
else
for k=1:10
doeverything
end
end
Ok here's a basic for loop
local a = {"first","second","third","fourth"}
for i=1,#a do
print(i.."th iteration")
a = {"first"}
end
As it is now, the loop executes all 4 iterations.
Shouldn't the for-loop-limit be calculated on the go? If it is calculated dynamically, #a would be 1 at the end of the first iteration and the for loop would break....
Surely that would make more sense?
Or is there any particular reason as to why that is not the case?
The main reason why numerical for loops limits are computed only once is most certainly for performance.
With the current behavior, you can place arbitrary complex expressions in for loops limits without a performance penalty, including function calls. For example:
local prod = 1
for i = computeStartLoop(), computeEndLoop(), computeStep() do
prod = prod * i
end
The above code would be really slow if computeEndLoop and computeStep required to be called at each iteration.
If the standard Lua interpreter and most notably LuaJIT are so fast compared to other scripting languages, it is because a number of Lua features have been designed with performance in mind.
In the rare cases where the single evaluation behavior is undesirable, it is easy to replace the for loop with a generic loop using while end or repeat until.
local prod = 1
local i = computeStartLoop()
while i <= computeEndLoop() do
prod = prod * i
i = i + computeStep()
end
The length is computed once, at the time the for loop is initialized. It is not re-computed each time through the loop - a for loop is for iterating from a starting value to an ending value. If you want the 'loop' to terminate early if the array is re-assigned to, you could write your own looping code:
local a = {"first", "second", "third", "fourth"}
function process_array (fn)
local inner_fn
inner_fn =
function (ii)
if ii <= #a then
fn(ii,a)
inner_fn(1 + ii)
end
end
inner_fn(1, a)
end
process_array(function (ii)
print(ii.."th iteration: "..a[ii])
a = {"first"}
end)
Performance is a good answer but I think it also makes the code easier to understand and less error-prone. Also, that way you can (almost) be sure that a for loop always terminates.
Think about what would happen if you wrote that instead:
local a = {"first","second","third","fourth"}
for i=1,#a do
print(i.."th iteration")
if i > 1 then a = {"first"} end
end
How do you understand for i=1,#a? Is it an equality comparison (stop when i==#a) or an inequality comparison (stop when i>=#a). What would be the result in each case?
You should see the Lua for loop as iteration over a sequence, like the Python idiom using (x)range:
a = ["first", "second", "third", "fourth"]
for i in range(1,len(a)+1):
print(str(i) + "th iteration")
a = ["first"]
If you want to evaluate the condition every time you just use while:
local a = {"first","second","third","fourth"}
local i = 1
while i <= #a do
print(i.."th iteration")
a = {"first"}
i = i + 1
end