openmp computation time is not improved proportionally to thread number - performance

I have a fortran program using openmp which is simplified as below:
program test
use omp_lib
use function_module
use data_module
implicit none
real*8 :: theta,f
call readdata()
call func_subroutine(theta,f)
end program
subroutine func_subroutine(theta,f)
use omp_lib
use data_module
implicit none
integer :: i
real*8 :: theta,f
real*8 :: t1,t2,t_vec(20),tsum
!$omp parallel do shared(data_variables,t_vec) private(i,t1,t2)
do i=1,20
! read individual data
! count time
t1=omp_get_wtime()
call func(theta,f_ind)
t2=omp_get_wtime()
t_vec(i)=t2-t1
end do
!$end omp parallel do
tsum=sum(t_vec)
end subroutine
In short, I calculate the function value for individual 1 (f_ind) for 20 times. And tsum only counts the time of calculating f_ind.
With 5 threads, tsum=7.2s, and the program takes 33s (without considering the time spent on the subroutine readdata()).
However, with 10 threads, tsum=9.5s, and the program takes 20s.
And with 20 threads, tsum=12s, and the program takes 12s.
The computation time is not improved proportionally to thread number I am using. I am very confused. Is there something wrong?

Related

Can reading a variable be a data race in OpenMP?

Why does this OpenMP fortran program work (every element of out is equal to num)? Each thread in the parallel loop might read the variable num simultaneously. I thought this was not acceptable?
program example
implicit none
integer i
integer, parameter :: n = 100000
double precision :: num
double precision, dimension(n) :: out
num = 1.123456789123456789123456d-5
out = 0.d0
!$OMP PARALLEL
!$OMP DO
do i=1,n
out(i) = num
enddo
!$OMP END DO
!$OMP END PARALLEL
do i=1,n
if (out(i).ne.num) print*,'Problem with ',i
enddo
end program
Thanks so much for any insights.
Can reading a variable be a data race in OpenMP?
Any race is between two things happening, so a read can be part of a race. However for the competition between two actions to be a race, there has to be a different outcome depending on the order in which the two actions occur.
Given that the possible actions in a parallel program which we are considering are read and write occurring in different threads, we have four possible cases:
Read, Read: no values are changed, and no code can detect which order the two reads occurred in (at least, not without looking at meta-data such as code performance in a system with caches :-)).
Read, Write: this clearly can be a race; whether the write wins the race or not affects the value which will be read.
Write, Read: as with case 2 (Read,Write), the result seen by the read is affected by the order.
Write, Write: here we have a race too, since we asssume that someone will ultimately read the value, and which value they see will depend on the order of the writes.
So, reading a variable can be part of a race.
However, if your question is really "Is there a race if a variable is only read?", then the answer is "No".
Variables are shared by default in openMP so they are accessible from all the threads. Furthermore, you're not writing to num so even if all the threads were accessing the same memory (which here they probably aren't) there would be no issue.

Consecutive calls to same subroutine in fortran with significantly slower performance

TL;DR
My program calls a subroutine twice. The subroutine performs basically identical operations both times, but takes significantly longer to run the second time. The only difference between both calls is that data produced in the first call is used as an input in the second call. This data is stored in allocatable arrays declared in a module before the first call to the subroutine.
Full Story
The following shows the relevant portions of my code for the problem:
program Economy
!! Declarations !!
use Modern_mod, only: Modern
use Globals, only: Na, Ny, Ne, Vimp, Xmp, Pmp, FCp, Vimu,& ! Globals is a module containing vbles.
& Xmu, Pmu, FCu
implicit none
real(kind=nag_wp) :: param(4)
!! Execution !!
! First call to modern !
param = (/1.0d0,2.0d0,3.0d0,4.0d0/)
allocate(Vimp(Na,Ne),FCp(4,Na*Ne),Pmp(Ny,Ne),Xmp(Ny,Ne))
call Modern(param,Vimp,FCp,Pmp,Xmp)
! Second call to modern !
param = (/5.0d0,6.0d0,7.0d0,8.0d0/)
allocate(Vimu(Na,Ne),FCu(4,Na*Ne),Pmu(Ny,Ne),Xmu(Ny,Ne))
call Modern(param,Vimu,FCu,Pmu,Xmu)
end program Economy
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
module Modern_mod
implicit none
contains
subroutine Modern(param,Vim,FCm,Pm,Xm)
!! Declarations !!
! Modules !
use Globals, only: Na, Ne, Ny
use FindVmp_mod, only: FindVmp
use FindVmu_mod, only: FindVmu
implicit none
! Declaring other variables !
real(kind=nag_wp), intent(in) :: param(4) ! param-Parameters specific to one of the modern sectors;
real(kind=nag_wp), intent(out), allocatable :: FCm(:,:), Xm(:,:),& ! FCm-Firm choices; Xm-Policy fun;
& Pm(:,:), Vim(:,:) ! Pm-Price of a share; Vim-Start of period value function;
real(kind=nag_wp), allocatable :: Vm1(:,:), Vim1(:,:), Pm1(:,:),& ! Vm1-Vm next guess; Pm1-Next share price guess;
& Vm(:,:) ! Vm-End of period value function; Vim1-Next Vim guess;
!! Execution !!
! Allocating and initializing functions !
allocate(Vim(Na,Ne),FCm(4,Na*Ne),Vm(Ny,Ne),Pm(Ny,Ne),Xm(Ny,Ne))
allocate(Vim1(Na,Ne),Vm1(Ny,Ne),Pm1(Ny,Ne))
! Inizializing arrays !
Vm = ...
Vim = ...
...
! Doing calculations !
if(param(1) .eq. 1.0d0) then
call FindVmp(FCm,Vim,Pm,Vm1,Pm1,Xm) ! New value funciton guess for productive guys
else
call FindVmu(FCm,Vim,Pm,Vm1,Pm1,Xm) ! New value funciton guess for unproductive guys
end if
end subroutine Modern
end module Modern_mod
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
module FindVmp_mod
implicit none
contains
subroutine FindVmp(FCm,Vim0,P0,Vm,P,x)
!! Declarations !!
use VmFp_mod, only: Calculations ! Contains the operations computing the final values of the outputs to FindVmp
implicit none
real(kind=nag_wp), allocatable, intent(out) :: Vm(:,:), x(:,:), P(:,:) ! Vm-New value function; x-Policy function; P-Share price;
real(kind=nag_wp), intent(in) :: P0(:,:), Vim0(:,:), FCm(:,:) ! P0-Initial share price guess; Vim-Initial guess for beginning of period value function;
! FCm-Firm choices;
!! Execution !!
! Allocate matrices !
allocate(Vm(Ny,Ne), x(Ny,Ne), P(Ny,Ne))
! Compute results !
call Calculations(FCm,Vim0,P0,Vm,P,x)
end subroutine FindVmp
end module FindVmp_mod
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
module FindVmu_mod
implicit none
contains
subroutine FindVmu(FCm,Vim0,P0,Vm,P,x)
!! Declarations !!
use Globals, only: Vmp, Pmp
use VmFu_mod, only: Calculations ! Contains the operations computing the final values of the outputs to FindVmu
implicit none
real(kind=nag_wp), allocatable, intent(out) :: Vm(:,:), x(:,:), P(:,:) ! Vm-New value function; x-Policy function; P-Share price;
real(kind=nag_wp), intent(in) :: P0(:,:), Vim0(:,:), FCm(:,:) ! P0-Initial share price guess; Vim-Initial guess for beginning of period value function;
! FCm-Firm choices;
!! Execution !!
! Allocate matrices !
allocate(Vm(Ny,Ne), x(Ny,Ne), P(Ny,Ne))
! Compute results !
call Calculations(FCm,Vim0,P0,Vm,P,x,Vmp,Pmp) ! Using the values of Vmp and Pmp computes in the first call to Modern
end subroutine FindVmu
end module FindVmu_mod
Each run, Modern is fed different arrays of the same size and type (*p and *u respectively) which are declared in the module Globals. Modern similarly calls one of two very similar subroutines FindVm?, feeding them the corresponding arrays. FindVmp and FindVmu compute almost identical operations, only that the latter uses the values of Vimp, Pmp (computed in FindVmp) as inputs.
I've been trying to figure out why the second call to Modern takes up to an order of magnitude longer to complete compared to the first one.
My first guess was that maybe by allocating Vimp and Pmp at the beginning of the program, and allocating a bunch of other arrays afterwards, each reference to the former arrays might be costly because their memory addresses were far away from the arrays currently being computed in FindVmu (for reference, Na = 101, Ny = 91, Ne = 9). But then I read that allocated arrays are stored in the heap, and that data in the heap isn't stacked (no pun intended) so that this was not necessarily the origin of my problem. As a matter of fact, I've tried allocating all matrices at different points and in different orders in Modern, but I get roughly the same execution times.
In the same spirit, I've tried to vary how I declare some of the arrays in different subroutines (e.g. making some automatic instead of allocatable and using the compiler (ifort18) option to force them on the stack) and although I do get overall performance variations throughout the code, the relative performance of the two calls to Modern does not change.
Finally, I read in this thread that the more arrays you have in memory, the slower your code generally becomes. Although the explanation does make sense to me, this is the first time I experience such a significant performance loss throughout a fortran program. If this were actually the problem I'm facing, I would have expected to run into it in any number of my previous projects. Is this nonetheless a plausible cause of what is happening here?
And I'm basically out of ideas...
Bonus Question
While we're at it, I've found that leaving out the following lines in Economy surprisingly (to me) does not lead to a segfault:
allocate(Vimp(Na,Ne),FCp(4,Na*Ne),Pmp(Ny,Ne),Xmp(Ny,Ne))
allocate(Vimu(Na,Ne),FCu(4,Na*Ne),Pmu(Ny,Ne),Xmu(Ny,Ne))
In words: if I don't manually allocate the arrays declared in Globals, it seems like the program does so automatically once I pass them to Modern. Is this standard behavior or was I just lucky when I was initially not allocating them myself?

openmp slows down near the end of execution

I edited a program to a parallel edition, but met a problem. This program is large and I tried to parallelize it at the outmost layer. This program is written in fortran and the following is the scheme of my code.
program main
use omp_lib
implicit none
declare some variables...
call omp_set_num_threads(n)
!$OMP parallel do
!$OMP private(a,b,c,d)
do i=1,5000000
call sub1(a,b,c)
b=c+d;c=b+d;...
call sub2(b,c,d)
if (logical_expression) cycle
call sub3()
call sub4()
enddo
end program
There are some conditions that the outmost loop will cycle, and these conditions mostly happen when is are small. So when I tried to print out which loop i is performing, I finally got that near the end of the execution, the i is continuous and large. And if I print out i on the screen, I saw that the speed of my program was as slow as a sequential edition near the end of the execution. Is there anyone know why and how to solve this problem?

parallel fortran program willl sleep at a certain time

Recently I had to change a serial program written in fortran to a parallel version to make it faster to get the result. But I met some problems.
I'm using ubuntu os and gfortran compiler, and as for the parallel API, I'm using OpeMP. In the previous (serial) version, I use many modules to share the data, but in openmp version, I make the variables threadprivate attribute, and some of these variables has allocatable attribute. In the previous version, I allocate the space for the variable before the do-loop, but in the openmp version, if I do the same, the program will report an error as invalid memory reference, although I give it the threadprivate attribute. So I allocate the variable in the loop and deallocate it also in the loop. And I make that do-loop in a parallel region. It give no error and the program can run. But there is another problem. As it runs about 800min cpu time, and I use ps -ux command to see the status of this parallel program, its status changes from Rl to Sl. I search for the meaning of S, and it represent
Interruptible sleep (waiting for an event to complete)
So why this problem appears? Is it because I frequently alloc and free the space? The following is the sample code:
module variables
real, dimension(:), allocatable, save :: a
real, dimension(:,:), allocatable, save :: b
!$omp threadprivate(a,b)
integer, parameter :: n=100
contains
subroutine alloc_var
integer :: status
allocate(a(100),stat=status)
allocate(b(100:100),stat=status)
end subroutine
subroutine free_var
integer :: status
deallocate(a,stat=status)
deallocate(b,stat=status)
end subroutine
end module
for other subroutines, there are some using variable a and b.
subroutine cal_sth
use variables, only a
...
end subroutine
for the serial version main program
program main
implicit none
external :: cal_sth
use variables, only alloc_var,free_var
integer :: i, j
call alloc_var
do j=1, count1
...
other expresion ...
do i=1, count2
call cal_sth
end do
end do
call free_var
end program
for parallel region,
program main
implicit none
external :: cal_sth
use variables, only alloc_var, free_var
integer :: i,j
!$omp parallel do private(i,j)
do j=1, count1
...
other expression ...
do i=1, count2
call alloc_var
call cal_sth
if (logical expression) then
call free_var
cycle
end if
call free_var
end do
end do
end program
Either split the combined parallel do directive and rewrite the parallel loop so:
!$omp parallel
call alloc_var
!$omp do
do i=1, count
call cal_sth
end do
!$omp end do
call free_var
!$omp end parallel
or use dedicated parallel regions as per Gilles' comment:
program main
implicit none
external :: cal_sth
use variables, only alloc_var, free_var
integer :: i
!$omp parallel
call alloc_var
!$omp end parallel
...
!$omp parallel do
do i=1, count
call cal_sth
end do
!$omp end parallel do
...
! other OpenMP regions
...
!$omp parallel
call free_var
!$omp end parallel
end program
With your updated code, I think you have two different path to explore for improving performance:
The memory allocation: As previously mentioned, the calls to alloc_var and free_var only need to be made in a parallel region, but definitely not necessarily inside the do loop. By splitting the parallel do into a parallel, then a do, it gives you room for calling alloc_var prior to entering the loop, and calling free_var after exiting it. And the potential early exit from the inner loop, possibly necessitating a release / re-allocation of the memory isn't by itself a constraint preventing you from doing this. (see the code below for an example on how this can be done)
The scheduling: the early terminations of some of you inner iterations might translate into some load imbalance between threads. This could explain the waiting times you experiment. Explicitly setting the scheduling to dynamic might permit to reduce this effect and improve performance. This will need to be experimented a bit with to find the best scheduling policy to apply, but dynamic seems a good starting point.
So here is your code as it could look like once these two idea implemented:
program main
implicit none
external :: cal_sth
use variables, only alloc_var, free_var
integer :: i,j
!$omp parallel schedule(dynamic)
call alloc_var
!$omp do private(i,j)
do j=1, count1
...
other expression ...
do i=1, count2
call cal_sth
if (logical expression) then
!uncomment these only if needed for some reasons
!call free_var
!call alloc_var
cycle
end if
end do
end do
!$omp end do
call free_var
!$omp end parallel
end program

What causes the runtime difference in this trivial fortran code?

I observed a very curious effect in this trivial program
module Moo
contains
subroutine main()
integer :: res
real :: start, finish
integer :: i
call cpu_time(start)
do i = 1, 1000000000
call Squared(5, res)
enddo
call cpu_time(finish)
print '("Time = ",f6.3," seconds.")',finish-start
end subroutine
subroutine Squared(v, res)
integer, intent(in) :: v
integer, intent(out) :: res
res = v*v
end subroutine
! subroutine main2()
! integer :: res
! real :: start, finish
! integer :: i
!
! call cpu_time(start)
!
! do i = 1, 1000000000
! res = v*v
! enddo
! call cpu_time(finish)
!
! print '("Time = ",f6.3," seconds.")',finish-start
! end subroutine
end module
program foo
use Moo
call main()
! call main2()
end program
Compiler is gfortran 4.6.2 on mac. If I compile with -O0 and run the program, the timing is 4.36 seconds. If I uncomment the subroutine main2(), but not its call, the timing becomes 4.15 seconds on average. If I also uncomment the call main2() the first timing becomes 3.80 and the second 1.86 (understandable, I have no function call).
I compared the assembler produced in the second and third cases (routine uncommented; call commented and uncommented) and they are exactly the same, save for the actual invocation of the main2 routine.
How can the code get this performance increase from a call to a routine that is going to happen in the future, and basically no difference in the resulting code?
First thing I noticed was that your program is way too short for proper benchmarking. How many runs do you use to average? What is the standard deviation? I added a nested do loop to your code to make it longer:
do i = 1, 1000000000
do j=1,10
call Squared(5, res)
enddo
enddo
I looked at only case 1 and case 2 (main2 commented and uncommented) because case 3 is different and irrelevant for this comparison. I would expect a slight increase in runtime in case 2, because of needing to load a larger executable into memory, even though that part is not used in the program.
So I did timing (3 runs each) for cases 1 and 2, for three compilers:
pgf90 10.6-0 64-bit target on x86-64 Linux -tp istanbul-64
Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 12.0.2.137 Build 20110112
GNU Fortran (GCC) 4.1.2 20080704 (Red Hat 4.1.2-51)
on AMD Opteron(tm) Processor 6134
The output of my script is:
exp 1 with pgf90:
Time = 30.619 seconds.
Time = 30.620 seconds.
Time = 30.686 seconds.
exp 2 with pgf90:
Time = 30.606 seconds.
Time = 30.693 seconds.
Time = 30.635 seconds.
exp 1 with ifort:
Time = 77.412 seconds.
Time = 77.381 seconds.
Time = 77.395 seconds.
exp 2 with ifort:
Time = 77.834 seconds.
Time = 77.853 seconds.
Time = 77.825 seconds.
exp 1 with gfortran:
Time = 68.713 seconds.
Time = 68.659 seconds.
Time = 68.650 seconds.
exp 2 with gfortran:
Time = 71.923 seconds.
Time = 74.857 seconds.
Time = 72.126 seconds.
Notice the time difference between case 1 and case 2 is largest for gfortran, and smallest for pgf90.
EDIT: After Stefano Borini pointed out that I overlooked the fact that only the looping is being benchmarked using call to cpu_time, executable load-time is out of the equation. Answer by AShelley suggests a possible reason for this. For longer runtimes, the difference between the 2 cases becomes minimal. Still - I observe a significant difference in case of gfortran (see above)
I think #IRO-bot has the right answer, but I would like to point out that code placement can influence timing, even for identical assembly.
I have 2 embedded applications running on identical processors. Each has the same hand-code assembly routine to provide the tightest possible busy-loop (for inserting sub-microsecond delays). I was recently suprised to learn that in one app, the loop took 50%! longer than the other one. Both generated the exact same assembly.
It turns out that in one executable, the starting address of the loop body allowed it to fall entirely within the processor's sole instruction cache line. On the slower one, the same function started at an address which caused it to span two lines. The extra fetch required dominated the timing of such a tight loop.
So it is possible to find instances where adding unexecuted code will affect code timing, due a change in the instruction caching sequence.

Resources