What causes the runtime difference in this trivial fortran code? - performance

I observed a very curious effect in this trivial program
module Moo
contains
subroutine main()
integer :: res
real :: start, finish
integer :: i
call cpu_time(start)
do i = 1, 1000000000
call Squared(5, res)
enddo
call cpu_time(finish)
print '("Time = ",f6.3," seconds.")',finish-start
end subroutine
subroutine Squared(v, res)
integer, intent(in) :: v
integer, intent(out) :: res
res = v*v
end subroutine
! subroutine main2()
! integer :: res
! real :: start, finish
! integer :: i
!
! call cpu_time(start)
!
! do i = 1, 1000000000
! res = v*v
! enddo
! call cpu_time(finish)
!
! print '("Time = ",f6.3," seconds.")',finish-start
! end subroutine
end module
program foo
use Moo
call main()
! call main2()
end program
Compiler is gfortran 4.6.2 on mac. If I compile with -O0 and run the program, the timing is 4.36 seconds. If I uncomment the subroutine main2(), but not its call, the timing becomes 4.15 seconds on average. If I also uncomment the call main2() the first timing becomes 3.80 and the second 1.86 (understandable, I have no function call).
I compared the assembler produced in the second and third cases (routine uncommented; call commented and uncommented) and they are exactly the same, save for the actual invocation of the main2 routine.
How can the code get this performance increase from a call to a routine that is going to happen in the future, and basically no difference in the resulting code?

First thing I noticed was that your program is way too short for proper benchmarking. How many runs do you use to average? What is the standard deviation? I added a nested do loop to your code to make it longer:
do i = 1, 1000000000
do j=1,10
call Squared(5, res)
enddo
enddo
I looked at only case 1 and case 2 (main2 commented and uncommented) because case 3 is different and irrelevant for this comparison. I would expect a slight increase in runtime in case 2, because of needing to load a larger executable into memory, even though that part is not used in the program.
So I did timing (3 runs each) for cases 1 and 2, for three compilers:
pgf90 10.6-0 64-bit target on x86-64 Linux -tp istanbul-64
Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 12.0.2.137 Build 20110112
GNU Fortran (GCC) 4.1.2 20080704 (Red Hat 4.1.2-51)
on AMD Opteron(tm) Processor 6134
The output of my script is:
exp 1 with pgf90:
Time = 30.619 seconds.
Time = 30.620 seconds.
Time = 30.686 seconds.
exp 2 with pgf90:
Time = 30.606 seconds.
Time = 30.693 seconds.
Time = 30.635 seconds.
exp 1 with ifort:
Time = 77.412 seconds.
Time = 77.381 seconds.
Time = 77.395 seconds.
exp 2 with ifort:
Time = 77.834 seconds.
Time = 77.853 seconds.
Time = 77.825 seconds.
exp 1 with gfortran:
Time = 68.713 seconds.
Time = 68.659 seconds.
Time = 68.650 seconds.
exp 2 with gfortran:
Time = 71.923 seconds.
Time = 74.857 seconds.
Time = 72.126 seconds.
Notice the time difference between case 1 and case 2 is largest for gfortran, and smallest for pgf90.
EDIT: After Stefano Borini pointed out that I overlooked the fact that only the looping is being benchmarked using call to cpu_time, executable load-time is out of the equation. Answer by AShelley suggests a possible reason for this. For longer runtimes, the difference between the 2 cases becomes minimal. Still - I observe a significant difference in case of gfortran (see above)

I think #IRO-bot has the right answer, but I would like to point out that code placement can influence timing, even for identical assembly.
I have 2 embedded applications running on identical processors. Each has the same hand-code assembly routine to provide the tightest possible busy-loop (for inserting sub-microsecond delays). I was recently suprised to learn that in one app, the loop took 50%! longer than the other one. Both generated the exact same assembly.
It turns out that in one executable, the starting address of the loop body allowed it to fall entirely within the processor's sole instruction cache line. On the slower one, the same function started at an address which caused it to span two lines. The extra fetch required dominated the timing of such a tight loop.
So it is possible to find instances where adding unexecuted code will affect code timing, due a change in the instruction caching sequence.

Related

openmp computation time is not improved proportionally to thread number

I have a fortran program using openmp which is simplified as below:
program test
use omp_lib
use function_module
use data_module
implicit none
real*8 :: theta,f
call readdata()
call func_subroutine(theta,f)
end program
subroutine func_subroutine(theta,f)
use omp_lib
use data_module
implicit none
integer :: i
real*8 :: theta,f
real*8 :: t1,t2,t_vec(20),tsum
!$omp parallel do shared(data_variables,t_vec) private(i,t1,t2)
do i=1,20
! read individual data
! count time
t1=omp_get_wtime()
call func(theta,f_ind)
t2=omp_get_wtime()
t_vec(i)=t2-t1
end do
!$end omp parallel do
tsum=sum(t_vec)
end subroutine
In short, I calculate the function value for individual 1 (f_ind) for 20 times. And tsum only counts the time of calculating f_ind.
With 5 threads, tsum=7.2s, and the program takes 33s (without considering the time spent on the subroutine readdata()).
However, with 10 threads, tsum=9.5s, and the program takes 20s.
And with 20 threads, tsum=12s, and the program takes 12s.
The computation time is not improved proportionally to thread number I am using. I am very confused. Is there something wrong?

Speed up random number generation in MATLAB

Is there any way to generate pseudo-random numbers to less precision and thus speed the process up?
Another thing is that I know it saves time if random numbers are generated all at once (e.g. rand(100,1000)), instead of one by one. Could someone explain why this is true?
If you have a CUDA-capable GPU, you can do random number generation on it, as it's supposed to be much faster... Specifically Philox4x32-10:
parallel.gpu.rng(0, 'Philox4x32-10');
R = gpuArray.rand(sZ,'single'); % run this for more info: doc('gpuArray/rand')
MATLAB actually implements more than one random number generator. They differ significantly in terms of execution time and in terms of "randomness" (I think, but I didn't verify). However, I understand from your question that speed is more important for you.
% 'twister' is the default in MATLAB Versions 7.4 and later
tic();
for i=1:1000000
rand('twister');
end
toc();
%Elapsed time is 2.912960 seconds.
% 'state' is the default in MATLAB versions 5 through 7.3
tic();
for i=1:1000000
rand('state');
end
toc();
% Elapsed time is 2.162040 seconds.
% 'seed' is the default in MATLAB version 4
tic();
for i=1:1000000
rand('seed');
end
toc();
% Elapsed time is 0.758830 seconds.
Important note: I ran the script above with an rather old version of MATLAB (v.7.6, a.k.a. R2008a). In newer versions, the syntax rand(generator) is discouraged . Instead, you should use the function rng(seed, generator) (online documentation). As a side effect, rng(seed, generator) gives you even more random number generators to choose from. Check the documentation for details.
Regarding the second question: Whatever generator you pick, generating many random numbers at once will always be faster than generating many single random numbers. This is because MATLAB's internals are heavily optimized for parallel processing.
tic();
for i=1:100000
rand();
end
toc();
% Elapsed time is 0.024388 seconds.
tic();
rand(100, 1000);
toc();
% Elapsed time is 0.000680 seconds.
Since R2015a the rng function for configuring and seeding the global generator has a 'simdTwister' option that uses a faster "SIMD-oriented Fast Mersenne Twister" algorithm:
rng(1,'twister');
R = rand(1e4); % Warmup for timing
tic
R = rand(1e4);
toc
rng(1,'simdTwister');
R = rand(1e4); % Warmup for timing
tic
R = rand(1e4);
toc
This will probably be the fastest builtin generator for your system (excepting the possibility of GPU-based generators). On my computer it's a little more than twice as fast as the default Mersenne Twister algorithm for large arrays.

MPI Fortran WTIME not working well

I am coding using Fortran MPI and I need to get the run time of the program. Therefore I tried to use the WTIME() function but I am getting some strange results.
Part of the code is like this:
program heat_transfer_1D_parallel
implicit none
include 'mpif.h'
integer myid,np,rc,ierror,status(MPI_STATUS_SIZE)
integer :: N,N_loc,i,k,e !e = number extra points (filled with 0s)
real :: time,tmax,start,finish,dt,dx,xmax,xmin,T_in1,T_in2,T_out1,T_out2,calc_T,t1,t2
real,allocatable,dimension(:) :: T,T_prev,T_loc,T_loc_prev
call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD,np,ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD,myid,ierror)
...
t1 = MPI_WTIME()
time = 0.
do while (time.le.tmax)
...
end do
...
call MPI_BARRIER(MPI_COMM_WORLD,ierror)
t2 = MPI_WTIME()
call MPI_FINALIZE(ierror)
if(myid.eq.0) then
write(*,"(8E15.7)") T(1:N-e)
write(*,*)t2
write(*,*)t1
end if
And the output value for t1 and t2 is the same and a very big: 1.4240656E+09
Any ideas why? Thank you so much.
From the documentation: Return value: Time in seconds since an arbitrary time in the past. They didn't specify how far back ;-) Only t2-t1 is meaningful here...
Also, the return value of MPI_Wtime() is double precision! t1 and t2 are declared as single precision floats.
Building on the answer from Alexander Vogt, I'd like to add that many Unix implementations of MPI_WTIME use gettimeofday(2) (or similar) to retrieve the system time and then convert the returned struct timeval into a floating-point value. Timekeeping in Unix is done by tracking the number of seconds elapsed since the Epoch (00:00 UTC on 01.01.1970). While writing this, the value is 1424165330.897136 seconds and counting.
With many Fortran compilers, REAL defaults to single-precision floating point representation that can only hold 7.22 decimal digits, while you need at least 9 (more if subsecond precision is needed). The high-precision time value above becomes 1.42416538E9 when stored in a REAL variable. The next nearest value that can be represented by the type is 1.4241655E9. Therefore you cannot measure time periods shorter than (1.4241655 - 1.42416538).109 or 120 seconds.
Use double precision.

Are gfortan whole array expressions, enabled?

I'm new to fortran and to gfortran. I learned that whole expression arrays are calculated in parallel, but I see that calculations only take place in just one core of my computer.
I use the following code:
program prueba_matrices
implicit none
integer, parameter :: num = 5000
double precision, dimension(1:num,1:num) :: A, B, C
double precision, dimension (num*num) :: temp
integer :: i
temp = (/ (i/2.0, i=1,num*num) /)
A = reshape(temp, (/ num, num/) )
B = reshape(temp, (/ num, num/) )
C = matmul(A , B)
end program prueba_matrices
I complie like this:
gfortran prueba_matrices.f03 -o prueba_gfortran
And, watching the graphs produced in real time by gnome-system-monitor, I can see that there is only one core working. If I substitute the line with the calculation
C = matmul(A , B)
for
C = A * B
It yields the same behaviour.
What am I doing wrong?
GFortran/GCC does have some automatic parallelization features, see http://gcc.gnu.org/wiki/AutoParInGCC . They are frequently not that good, so they are not enabled at any of the -ON optimization levels, you have to select it specifically with -ftree-parallelize-loops=N, where N is the number of threads you want to use. Note however that in your example above a loop like "A*B" is likely constrainet by memory bandwidth (for sufficiently large arrays), and thus adding cores might not help that much. Furthermore, the MATMUL intrinsic leads to an implementation in the gfortran runtime library, which is not compiled with the autopar options (unless you have specifically built it that way).
What could help your example code above more is to actually enable any optimization at all. With -O3 Gfortran automatically enables vectorization, which can be seen as a way to parallelize loops as well, although not over several cpu cores.
If you want your call to matmult from gfortran to be multithreaded, easiest is to simply link to external BLAS package that has been compiled with multithreading support. Candidates include OpenBlas (née Goto Blas), ATLAS, or commercial packages like Intel's MKL, AMD's ACML, or Apple's accelerate framework.
So for instance, for this simple example:
program timematmult
real, allocatable, dimension(:,:) :: A, B, C
integer, parameter :: N = 2048
allocate( A(N,N) )
allocate( B(N,N) )
allocate( C(N,N) )
call random_seed
call random_number(A)
call random_number(B)
C = matmul(A,B)
print *, C(1,1)
deallocate(C)
deallocate(B)
deallocate(A)
end program timematmult
With the base matmul:
$ gfortran -o matmult matmult.f90
$ time ./matmult
514.38751
real 0m6.518s
user 0m6.374s
sys 0m0.021s
and with the multithreaded gotoblas library:
$ gfortran -o matmult matmult.f90 -fexternal-blas -lgoto2
$ time ./matmult
514.38696
real 0m0.564s
user 0m2.202s
sys 0m0.964s
Note in particular here that the real time is less than the user time, indicating multiple cores are being used.
I think that a key sentence in the course that you cited is "With array assignment there is no implied order of the individual assignments, they are performed, conceptually, in parallel." The key word is "conceptually". It isn't saying that whole array expressions are actually executed in parallel; you shouldn't expect more than one core to be used. For that, you need to use OpenMP or MPI (outside of Fortran itself) or the coarrays of Fortran 2008.
EDIT: Fortran didn't have, as part of the language, actual parallel execution until the coarrays of Fortran 2008. Some compilers might provide parallelization otherwise and some language features make it easier for compilers to implement parallel execution (optionally). The sentence that I cited from the web article better states reality than the portion you cite. Whole-array expressions were not intended to require parallel execution; they are a syntactical convenience to the programmer, making the language higher level, so that array operations can be expressed in single statements, without writing do loops. In any case, no article on the web is definitive. Your observation of the lack of parallel executions shows which statement is correct. It does not contradict the Fortran language.

does gFortran's cpu_time() return user time, system time, or the sum of both?

I need to do some timing to compare the performance of some Fortran Vs C code.
In C I can get both user time and system time independently.
When using gFortran's cpu_time() what does it represent?
With in IBM's fortran compiler one can choose what to output by setting an environment variable (see CPU_TIME() )
I found no reference to something similar in gFortran's documentation.
So, does anybody know if gFortran's cpu_time() returns user time, system time, or the sum of both?
Gfortran CPU_TIME returns the sum of the user and system time.
On MINGW it uses GetProcessTimes(), on other platforms getrusage() or if getrusage() is not available, times().
See
http://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libgfortran/intrinsics/cpu_time.c;h=619f8d25246409e0f32c96299db724213aa62b45;hb=refs/heads/master
and
http://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libgfortran/intrinsics/time_1.h;h=12d79ebc12fecf52baa0895c7ab8accc41dab500;hb=refs/heads/master
FWIW, if you wish to measure the wallclock time rather than CPU time, please use the SYSTEM_CLOCK intrinsic instead of CPU_TIME.
my guess: the total of user and system time, otherwise it would be mentioned? Probably depends on the OS anyway, maybe not all of them make the distinction. As far a s I know, CPU time is the time which the OS assigns to your process, be it in user mode or in kernel mode executed on behalf of the process.
Is it important for you to have that distinction?
For performance comparison, I would probably go for wall-time anyway, and use CPU time to guess how much I/O it is doing by subtracting it from the wall-time.
If you need wallclock time, you may use date_and_time, http://gcc.gnu.org/onlinedocs/gcc-4.0.2/gfortran/DATE_005fAND_005fTIME.html
I'm not sure how standard it is, but in my experience it works on at least four different platforms, including exotic Cray designs.
One gotcha here is to take care of the midnight, like this:
character*8 :: date
character*10 :: time
character*5 :: zone
integer :: tvalues(8)
real*8 :: time_prev, time_curr, time_elapsed, time_limit
integer :: hr_curr, hr_prev
! set the clock
call date_and_time(date, time, zone, tvalues)
time_curr = tvalues(5)*3600.d0 + tvalues(6)*60.d0 + tvalues(7) ! seconds
hr_curr = tvalues(5)
time_prev=0.d0; time_elapsed = 0.d0; hr_prev = 0
!... do something...
time_prev = time_curr; hr_prev = hr_curr
call date_and_time(date, time, zone, tvalues)
time_curr = tvalues(5)*3600.d0 + tvalues(6)*60.d0 + tvalues(7) ! seconds
hr_curr = tvalues(5)
dt = time_curr - time_prev
if( hr_curr < hr_prev )dt = dt + 24*3600.d0 ! across the midnight
time_elapsed = time_elapsed + dt
#Emanual Ey - In continuation to your comment on #steabert's post - (what follows goes for Intel's; I don't know whether something differs on other compilers). User cpu time + system cpu time should equal cpu time. Elapsed, real, or "wall clock" time should be greater than total charged cpu time. To measure wallclock time, it is best to put the time command, before and after the tricky part. Ugh, I'm gonna make this more complicated than it should be. Could you read the part on Timing your application on Intel's manual page (you'll have to find the "Timing your application" in the index). Should clear up a few things.
As I said before, that goes for Intel's. I don't have access to gfortran's compiler.

Resources