Are gfortan whole array expressions, enabled? - parallel-processing

I'm new to fortran and to gfortran. I learned that whole expression arrays are calculated in parallel, but I see that calculations only take place in just one core of my computer.
I use the following code:
program prueba_matrices
implicit none
integer, parameter :: num = 5000
double precision, dimension(1:num,1:num) :: A, B, C
double precision, dimension (num*num) :: temp
integer :: i
temp = (/ (i/2.0, i=1,num*num) /)
A = reshape(temp, (/ num, num/) )
B = reshape(temp, (/ num, num/) )
C = matmul(A , B)
end program prueba_matrices
I complie like this:
gfortran prueba_matrices.f03 -o prueba_gfortran
And, watching the graphs produced in real time by gnome-system-monitor, I can see that there is only one core working. If I substitute the line with the calculation
C = matmul(A , B)
for
C = A * B
It yields the same behaviour.
What am I doing wrong?

GFortran/GCC does have some automatic parallelization features, see http://gcc.gnu.org/wiki/AutoParInGCC . They are frequently not that good, so they are not enabled at any of the -ON optimization levels, you have to select it specifically with -ftree-parallelize-loops=N, where N is the number of threads you want to use. Note however that in your example above a loop like "A*B" is likely constrainet by memory bandwidth (for sufficiently large arrays), and thus adding cores might not help that much. Furthermore, the MATMUL intrinsic leads to an implementation in the gfortran runtime library, which is not compiled with the autopar options (unless you have specifically built it that way).
What could help your example code above more is to actually enable any optimization at all. With -O3 Gfortran automatically enables vectorization, which can be seen as a way to parallelize loops as well, although not over several cpu cores.

If you want your call to matmult from gfortran to be multithreaded, easiest is to simply link to external BLAS package that has been compiled with multithreading support. Candidates include OpenBlas (née Goto Blas), ATLAS, or commercial packages like Intel's MKL, AMD's ACML, or Apple's accelerate framework.
So for instance, for this simple example:
program timematmult
real, allocatable, dimension(:,:) :: A, B, C
integer, parameter :: N = 2048
allocate( A(N,N) )
allocate( B(N,N) )
allocate( C(N,N) )
call random_seed
call random_number(A)
call random_number(B)
C = matmul(A,B)
print *, C(1,1)
deallocate(C)
deallocate(B)
deallocate(A)
end program timematmult
With the base matmul:
$ gfortran -o matmult matmult.f90
$ time ./matmult
514.38751
real 0m6.518s
user 0m6.374s
sys 0m0.021s
and with the multithreaded gotoblas library:
$ gfortran -o matmult matmult.f90 -fexternal-blas -lgoto2
$ time ./matmult
514.38696
real 0m0.564s
user 0m2.202s
sys 0m0.964s
Note in particular here that the real time is less than the user time, indicating multiple cores are being used.

I think that a key sentence in the course that you cited is "With array assignment there is no implied order of the individual assignments, they are performed, conceptually, in parallel." The key word is "conceptually". It isn't saying that whole array expressions are actually executed in parallel; you shouldn't expect more than one core to be used. For that, you need to use OpenMP or MPI (outside of Fortran itself) or the coarrays of Fortran 2008.
EDIT: Fortran didn't have, as part of the language, actual parallel execution until the coarrays of Fortran 2008. Some compilers might provide parallelization otherwise and some language features make it easier for compilers to implement parallel execution (optionally). The sentence that I cited from the web article better states reality than the portion you cite. Whole-array expressions were not intended to require parallel execution; they are a syntactical convenience to the programmer, making the language higher level, so that array operations can be expressed in single statements, without writing do loops. In any case, no article on the web is definitive. Your observation of the lack of parallel executions shows which statement is correct. It does not contradict the Fortran language.

Related

random_number subroutine run time comparison ifort vs gfortran

I wrote this code:
program random
implicit none
integer :: i, j, limit
real(8) :: r, max_val
real(8) :: start, finish
max_val = 0.d0
limit = 10000
call CPU_TIME(start)
do i=1, limit
do j=1, limit
call random_number(r)
max_val = max(max_val, r)
end do
end do
call CPU_TIME(finish)
print *, max_val
print '("Time = ",f6.3," seconds.")',finish-start
end program random
And I compiled it with gfortran 10.1.0 and ifort 19.1.3.304 on CentOS Linux 7 using:
ifort *.f90 -O3 -no-vec -o intel.out
gfortran *.f90 -O3 -fno-tree-vectorize -o gnu.out
and the outputs are:
gnu:
0.9999999155521957
Time = 0.928 seconds.
intel:
0.999999968800691 (same for every run btw)
Time = 1.989 seconds.
When I run a few times, the run time of each is pretty much the same.
Why is gfortran faster than ifort and how can I make ifort run as fast as gfortran?
Different compilers have their libraries with implementations of their intrinsic functions and subroutines. They will differ in performance and may also differ in their results. Gfortran uses the GLIBC library for many general intrinsics and the libgfortran library for many Fortran-spific ones. The Intel compiler comes with its own runtime-library suite.
Notably, the Fortran standard gives no guarantees about the quality of the pseudo-random generator used for random_number(). Even if it did, the actual implementation in code could always differ and hence the actual performance.
There are many external pseudo-random number generator libraries available. Some faster, some slower. Some more robust, some fail certain randomness tests. (sometimes that does matter, sometimes it does not). Some give more random bits in a single call, some give fewer random bits in a single call. If you need some particular properties for the generator in all your compilers, you might be better off with an external library.

Intrinsic dot_product slower than a*a+b*b+c*c?

Recently I tested the runtime difference of explicit summation and intrinsic functions to calculate a dot product. Surprisingly the naïve explicit writing was faster.
program test
real*8 , dimension(3) :: idmat
real*8 :: dummy(3)
idmat=0
dummy=0
do i=1,3
idmat(i)=1
enddo
do j=1,10**10
! dummy(mod(j,3)+1)=dot_product(idmat,idmat)
dummy(mod(j,3)+1)=idmat(1)*idmat(1)+idmat(2)*idmat(2)+idmat(3)*idmat(3)
enddo
print*, dummy
end program test
Here is what confuses me:
1. No -O3 Optimization
If I use: gfortran test.f90 -o test ; time ./test
I find a runtime of 6,297s using the function dot_product (commented above) and 4,486s using the manual explicit writing.
How does that make sense?
2. Including -O3 Optimization
If I use: gfortran test.f90 -O3 -o test ; time ./test
I find a runtime of 1,808s and 1,803s respectively. So both are actually of the same speed.
3. What I actually expect
...is the intrinsic function to be faster, as it could:
compute the 3 products in parallel
add the 3 products
where the explicit form has to sequentially:
compute product 1
compute product 2
compute product 3
add the 3 products
Do I have to create a new parallel dot_product function to be faster? Or is there an additional option for the gfortran compiler which I don't know?
Please note: I read across the internet about SIMD, auto-vectorization and parallelisation in modern Fortran. Although I learned something my question wasn't answered anywhere.
It makes no sense even looking at the non-optimized numbers. The optimized numbers are the same, so everything is fine.
"...is the intrinsic function to be faster, as it could: compute the 3 products in parallel"
There will be nothing done in parallel unless you enable specific parallel optimizations. These optimizations will be as easy to do for the loop as for the intrinsic and often even much easier for the loop.
Well, at least for the normal sense of parallel using threads or similar. What can be done in parallel is to use the vector instructions and to schedule the instructions to overlap in the CPU pipeline. That can be done by the optimizing compiler and is likely done for both versions when you use -O3. You should not expect this to happen when no optimizations are enabled.
The use of the "parallel" instructions (SIMD) can be sometimes improved by using compiler directives like !$omp simd or !$DEC VECTOR.
"Do I have to create a new parallel dot_product function to be faster?"
Yes, normally you do. For example using OpenMP. Or you could:
"Or is there an additional option for the gfortran compiler which i don't know?"
Yes, the automatic parallelization https://gcc.gnu.org/wiki/AutoParInGCC , for example -floop-parallelize-all -ftree-parallelize-loops=4
Note that it will not make those individual multiplications in parallel, it will make the i loop parallel.

random_number() gives vastly different behavior between GNU and PGI fortran compilers

Here's a simple fortran program I was using to understand the behavior of the fortran intrinsic uniform random number generator.
program test_prog
implicit none
integer, allocatable :: seed(:)
real(8), dimension(2) :: unif_rand
integer :: nseed ! minimum number of random seed value
integer :: i,n
call random_seed( size=nseed )
nseed = 100
allocate( seed(nseed) )
write(*,*) "nseed: ",nseed
do n = 1,5
seed(:) = n**10
call random_seed( put=seed )
call random_number(harvest=unif_rand)
write(*,1000) seed(nseed),unif_rand(1),unif_rand(2)
write(*,*) ""
1000 format(i12," ",f12.8," ",f12.8)
enddo
end program test_prog
When I compile with gfortran I get sensible results:
1 0.76322100 0.72975598
1024 0.30901699 0.80380552
59049 0.05916934 0.69849271
1048576 0.59972035 0.71558547
9765625 0.79167428 0.37621382
But when I compile with pgf90 I get very different results:
1 0.00000024 0.00000024
1024 0.00024414 0.00024414
59049 0.01407838 0.01407838
1048576 0.25000003 0.25000003
9765625 0.32830648 0.32830648
With small seed values the PGI results are always very close to zero, so it seems the PGI compiler does something to make the random values such that they are scaled by the seed value. This is very problematic for my current project because I need it to give consistent results for different compilers.
Some google searches haven't turned up any explanation, so I'm wondering if anyone here can explain why these results are so different?
Or does anyone know of a trick to make the PGI compiler results more in line with the GNU compiler results?
Or does anyone know of some code for a decent random number generator available online that I could implement as an alternative to the intrinsic routines?
Firstly, as discussed here several times before, the Fortran standard gives NO requirement or guarantee about the quality of the intrinsic random number generator. One has to use external libraries to have any quality guarantee.
Secondly, the seed in a PRNG is there to give the source of initial entropy. Different generators are able to cope with different seeds with various success but in poor generators you need the bits in the seed to also to be distributed relatively "randomly". Too much coherence of the seed bits and the results may be terrible.
As to which external library to use - recommendations are off topic, but there are plenty of them and you will even find articles with comparisons of the quality of individual methods. Many recommendations were already given on this very site in the comments or answers under similar questions. Just search for random and fortran.

Boolean expression optimization in compiler and high end processor pipeline

I want to calculate a boolean expression. For ease of understanding let's assume the expression is,
O=( A & B & C) | ( D & E & F)---(eqn. 1),
Here A, B, C, D, E and F are random bits. Now, as my target platform is high-end intel i7-Haswell processor that supports 64 bit data type, I can make this much more efficient using bit-slicing.
So now, O, A, B, C, D, E and f are 64 bits data type,
O_64=( A_64 & B_64 & C_64) | ( D_64 & E_64 & F_64)---(eqn. 2), the & and | are bitwise operators similar to C language.
Now, I need the expression to take constant time to execute. That means, the calculation of Eqn. 2 should take the exact number of steps in the processor irrespective of the values in A_64, B_64, C_64, D_64, E_64, and F_64. The values are filled up using a random generator in the runtime.
Now my question is,
Considering I am using GCC or GCC-7 with -O3, How far can the compiler optimize the expression? for example, if A_64 becomes all zeroes (can happen with probability 2^{-64} ) Then we don't need to calculate the first part of eqn.2 then O_64 becomes equal to D_64 & E_64 & F_64. Is it possible for a c compiler to optimize such a way? We have to remember that the values are filled up at runtime and the boolean expressions have around 120 variables.
Is it possible for a for a processor to do such an optimization (List 1) during runtime? As my boolean expression is very long, the execution will be heavily pipelined, now is it possible for a processor to pull out an operation out of the pipeline in if such a situation arises?
Please, let me know if any part of the question is not understandable.
I appreciate your help.
Is it possible for a c compiler to optimize such a way?
It's allowed to do it, but it probably won't. There is nothing to gain in general. If part of the expression was statically known to be zero, that would be used. But inserting branches inside bitwise calculations is almost always counterproductive, and I've never seen a compiler judge a sequence of ANDs to be "long enough to be worth inserting an early-out" (you can certainly do so manually, of course). If you need a hard guarantee of course I can't give you that, if you want to be sure you should always check the assembly.
What it probably will do (for longer expressions at least) is reassociate the expression for more instruction-level parallelism. So code like that probably won't be just two long (but parallel with each other) chains of dependent ANDs, but be split up into more chains. That still wouldn't make the time depend on the values.
Is it possible for a for a processor to do such an optimization during runtime?
Extremely hypothetically yes. No processor architecture that I am aware of does that. It would be a slightly tricky mechanism, and as a general rule it would almost never help.
Hypothetically it could work like this: when the operands for an AND instruction are looked up and one (or both) of them is found to be renamed to the hard-wired zero-register, the renamer can immediately rename the destination to zero as well (rather than allocating a new register for the result), effectively giving that AND instruction 0-latency. The flags output would also be known so the µop would not even have to be executed. It would roughly be a cross between copy-elimination and a zeroing idiom.
That mechanism wouldn't even trigger unless one of the inputs is set to zero with a zeroing idiom, if an input is accidentally zero that wouldn't be detected. It would also not completely remove the influence of the redundant AND instructions, they still have to go through (most of) the front-end of the processor even if it is just to find out that they didn't need to be executed after all.

How can I ensure that my Fortran FORALL construct is being parallelized?

I've been given a 2D matrix representing temperature points on the surface of a metal plate. The edges of the matrix (plate) are held constant at 20 degrees C and there is a constant heat source of 100 degrees C at one pre-defined point. All other grid points are initially set to 50 degrees C.
My goal is to take all interior grid points and compute its steady-state temperature by iteratively averaging over the surrounding four grid points (i+1, i-1, j+1, j-1) until I reach convergence (a change of less than 0.02 degrees C between iterations).
As far as I know, the order in which I iterate over the grid points is irrelevant.
To me, this sounds like a fine time to invoke the Fortran FORALL construct and explore the joys of parallelization.
How can I ensure that the code is indeed being parallelized?
For example, I can compile this on my single-core PowerBook G4 and I would expect no improvement in speed due to parallelization. But if I compile on a Dual Core AMD Opteron, I would assume that the FORALL construct can be exploited.
Alternatively, is there a way to measure the effective parallelization of a program?
Update
In response to M.S.B's question, this is with gfortran version 4.4.0. Does gfortran support automatic multi-threading?
That's remarkable that the FORALL construct has been rendered obsolete by, I suppose, what is then auto-vectorization.
Perhaps this is best for a separate question, but how does auto-vectorization work? Is the compiler able to detect that only pure functions or subroutines are being used in a loop?
FORALL is an assignment construct, not a looping construct. The semantics of FORALL state that the expression on the right hand side (RHS) of each assignment within the FORALL is evaluated completely before it is assigned to the left hand side (LHS). This has to be done no matter how complex the operations on the RHS, including cases where the RHS and the LHS overlap.
Most compilers punt on optimizing FORALL, both because it is difficult to optimize and because it is not commonly used. The easiest implementation is to simply allocate a temporary for the RHS, evaluate the expression and store it in the temporary, then copy the result into the LHS. Allocation and deallocation of this temporary is likely to make your code run quite slowly. It is very difficult for a compiler to automatically determine when the RHS can be evaluated without a temporary; most compilers don't make any attempt to do so. Nested DO loops turn out to be much easier to analyze and optimize.
With some compilers, you may be able to parallelize evaluation of the RHS by enclosing the FORALL with the OpenMP "workshare" directive and compiling with whatever flags are necessary to enable OpenMP, like so:
!$omp parallel workshare
FORALL (i=,j=,...)
<assignment>
END FORALL
!$omp end parallel
gfortran -fopenmp blah.f90 -o blah
Note that a compliant OpenMP implementation (including at least older versions of gfortran) is not required to evaluate the RHS in parallel; it is acceptable for an implementation to evaluate the RHS as though it is enclosed in an OpenMP "single" directive. Note also that the "workshare" likely will not eliminate the temporary allocated by the RHS. This was the case with an old version of the IBM Fortran compiler on Mac OS X, for instance.
If you use Intel Fortran Compiler, you can use a command line switch to turn on/increase the compliler's verbosity level for parallelization/vectorization. This way during compilation/linking you will be shown something like:
FORALL loop at line X in file Y has been vectorized
I admit that it has been a few of years since the last time I used it, so the compiler message might actually look very different, but that's the basic idea.
The best way is to measure the clock time of the calculation. Try it with and without parallel code. If the clock time decreases, then your parallel code is working. The Fortran intrinsic system_clock, called before and after the code block, will give you the clock time. The intrinsic cpu_time will give you the cpu time, which might go up when code in run multi-threaded due to overhead.
The lore is the FORALL is not as useful as was thought when introduced into the language -- that it is more of a initialization construct. Compilers are equally adept at optimizing regular loops.
Fortran compilers vary in their abilities to implement true parallel processing without it being explicitly specified, e.g., with OpenMP or MPI. What compiler are you using?
To get automatic multi-threading, I've used ifort. Manually, I've used OpenMP. With both of these, you can compile your program with and without the parallelization and measure the difference.

Resources