BLAS subroutines dgemm, dgemv and ddot doesn't work with scalars? - matrix

I have a Fortran subroutine which uses BLAS' subroutines dgemm, dgemv and ddot, which calculate matrix * matrix, matrix * vector and vector * vector. I have m * m matrices and m * 1 vectors. In some cases m=1. It seems that those subroutines doesn't work well in those cases. They doesn't give errors, but there seems to be some numerical unstability in results. So I have to write something like:
if(m>1) then
vtuni(i,t) = yt(i,t) - ct(i,t) - ddot(m, zt(i,1:m,(t-1)*tvar(3)+1), 1, arec, 1)
else
vtuni(i,t) = yt(i,t) - ct(i,t) - zt(i,1,(t-1)*tvar(3)+1)*arec(1)
So my actual question is, am I right that those BLAS' subroutines doesn't work properly when m=1 or is there just something wrong in my code? Can the compiler affect this? I'm using gfortran.

BLAS routines are supposed to behave correctly with objects of size 1. I don't think it can depend on compiler, but it could possible depend on the implementation of BLAS you're relying on (though I'd consider it a bug of the implementation). The reference (read: not target-optimised) implementation of BLAS, which can be found on Netlib, handles that case fine.
I've done some testing on both arrays of size 1, and size-1 slices of larger array (as in your own code), and they both work fine:
$ cat a.f90
implicit none
double precision :: u(1), v(1)
double precision, external :: ddot
u(:) = 2
v(:) = 3
print *, ddot(1, u, 1, v, 1)
end
$ gfortran a.f90 -lblas && ./a.out
6.0000000000000000
$ cat b.f90
implicit none
double precision, allocatable :: u(:,:,:), v(:)
double precision, external :: ddot
integer :: i, j
allocate(u(3,1,3),v(1))
u(:,:,:) = 2
v(:) = 3
i = 2
j = 2
print *, ddot(1, u(i,1:1,j), 1, v, 1)
end
$ gfortran b.f90 -lblas && ./a.out
6.0000000000000000
Things I'd consider to debug this problem further:
Check that your ddot definition is correct
Substitute the reference BLAS to your optimised one, to check if it changes anything (you can just compile and link in the ddot.f file I linked to earlier in my answer)

Related

Broken GFortran double precision in numerical multiplications and divisions? [duplicate]

This question already has an answer here:
Precision not respected
(1 answer)
Closed 1 year ago.
I run a gfortran code for multiplication and division and they give me different results when compared to results by c++,c etc. All use double precision. I believe that double precision I use in fortran is not correct or it is broken... I've checked numbers with calculator and fortran seems to create some numbers in the near end of decimals... Below is the code in gfortran,
PROGRAM problem
!-----------------------------------------------------------------------
integer, parameter :: dp = selected_real_kind(15,307)
real(dp) :: answer,w
open(1, file = 'problem.txt')
w=0.99
answer=w
do i = 10, 0, -1
answer = answer*w
write (*,"(E18.9)",advance="yes") answer
!-print results to text file
write (1,"(E18.9)",advance="yes") answer
end do
write(*,*) "Done."
close(1)
END PROGRAM problem
gfortran results are,
0.980100019E+00
0.970299028E+00
0.960596047E+00
0.950990096E+00
0.941480204E+00
0.932065411E+00
0.922744766E+00
0.913517327E+00
0.904382162E+00
0.895338349E+00
0.886384974E+00
my calculator shows,
0.9801
0.970299
0.96059601
0.950990049
Am I missing something in variable type declaration or is it intrinsic in gfortran?
Although you've declared w to be double precision, you've initialised it to 0.99, which is only a single-precision constant. In order to initialise w as a double precision constant, you need w = 0.99_dp.

Generating different sequence of random numbers in Fortran 95 for every run [duplicate]

I am trying to compile the following simple code using Gfortran 4.7 from mac-ports (OS-X):
program main
implicit none
integer :: n = 1, clock, i
integer, dimension(1) :: iseed
! initialize the random number generator
call random_seed(size = n)
call system_clock(COUNT=clock)
iseed = clock + 37 * (/ (i - 1, i = 1, n) /)
! iseed = clock
! iseed = abs( mod((clock*181)*((1-83)*359), 104729) )
call random_seed(PUT = iseed)
end program main
and have this error:
gfortran-mp-4.7 tmp.f90
tmp.f90:17.23:
call random_seed(PUT = iseed)
1
Error: Size of 'put' argument of 'random_seed' intrinsic at (1) too small (1/12)
I don't use Fortran at all (I am a C++ guy), so would really appreciate if someone could help and make it working.
p.s. On a similar issue i found couple of forum posts, the current uncomment solution is similar to the one mentioned in this GCC bug report.
The one with abs is mentioned in this stack overflow post (added it without PID since i don't run in parallel anyway.
UPDATE:
the following works:
program main
implicit none
integer :: n = 12, clock, i
integer, dimension(:), allocatable :: iseed
! initialize the random number generator
allocate(iseed(n))
call random_seed(size = n)
call system_clock(COUNT=clock)
iseed = clock + 37 * [(i, i = 0,n-1)]
call random_seed(PUT = iseed)
end program main
To amplify somewhat on #Yossarian's comment, this
call random_seed(size = n)
returns, in n, the size of the rank 1 integer array that you have to use if you want to initialise the RNG. I'd suggest making iseed allocatable by changing its declaration to:
integer, dimension(:), allocatable :: iseed
then, after getting a value for n, allocate it:
allocate(iseed(n))
populate it with your favourite values, then put it.
You might be able to allocate and populate it in one statement like this:
allocate(iseed(n), source = clock + 37 * [(i, i = 0,n-1)])
I write might because this depends on how up to date your compiler is.
EDIT, after OP comment
No, you have not quite understood what I suggested.
Get a value for n by executing
call random_seed(size = n)
don't initialise n to 12.
Then allocate the array and populate it, either in one statement (using sourced allocation) or an allocate statement followed by an assignment.
In
allocate(iseed(n))
call random_seed(size = n)
the sequence of operations is incorrect. This sets iseed to have 12 elements (which is the value of n when the first statement is executed), and then sets n to the size of the array required by the RNG. So long as that is 12 you won't see any problems, but as soon as you port your code to another compiler, possibly even another version of the same compiler, you risk running into an RNG which requires an integer array of a different size. There is no need to hardwire a value into your code, so don't.

Does gfortran take advantage of DO CONCURRENT?

I'm currently using gfortran 4.9.2 and I was wondering if the compiler actually know hows to take advantage of the DO CONCURRENT construct (Fortran 2008). I know that the compiler "supports" it, but it is not clear what that entails. For example, if automatic parallelization is turned on (with some number of threads specified), does the compiler know how to parallelize a do concurrent loop?
Edit: As mentioned in the comment, this previous question on SO is very similar to mine, but it is from 2012, and only very recent versions of gfortran have implemented the newest features of modern Fortran, so I thought it was worth asking about the current state of the compiler in 2015.
Rather than explicitly enabling some new functionality, DO CONCURRENT in gfortran seems to put restrictions on the programmer in order to implicitly allow parallelization of the loop when required (using the option -ftree-parallelize-loops=NPROC).
While a DO loop can contain any function call, the content of DO CONCURRENT is restricted to PURE functions (i.e., having no side effects). So when one attempts to use, e.g., RANDOM_NUMBER (which is not PURE as it needs to maintain the state of the generator) in DO CONCURRENT, gfortran will protest:
prog.f90:25:29:
25 | call random_number(x)
| 1
Error: Subroutine call to intrinsic ‘random_number’ in DO CONCURRENT block at (1) is not PURE
Otherwise, DO CONCURRENT behaves as normal DO. It only enforces use of parallelizable code, so that -ftree-parallelize-loops=NPROC succeeds. For instance, with gfortran 9.1 and -fopenmp -Ofast -ftree-parallelize-loops=4, both the standard DO and the F08 DO CONCURRENT loops in the following program run in 4 threads and with virtually identical timing:
program test_do
use omp_lib, only: omp_get_wtime
integer, parameter :: n = 1000000, m = 10000
real, allocatable :: q(:)
integer :: i
real :: x, t0
allocate(q(n))
t0 = omp_get_wtime()
do i = 1, n
q(i) = i
do j = 1, m
q(i) = 0.5 * (q(i) + i / q(i))
end do
end do
print *, omp_get_wtime() - t0
t0 = omp_get_wtime()
do concurrent (i = 1:n)
q(i) = i
do j = 1, m
q(i) = 0.5 * (q(i) + i / q(i))
end do
end do
print *, omp_get_wtime() - t0
end program test_do

Most efficient way to weight and sum a number of matrices in Fortran

I am trying to write a function in Fortran that multiplies a number of matrices with different weights and then adds them together to form a single matrix. I have identified that this process is the bottleneck in my program (this weighting will be made many times for a single run of the program, with different weights). Right now I'm trying to make it run faster by switching from Matlab to Fortran. I am a newbie at Fortran so I appreciate all help.
In Matlab the fastest way I have found to make such a computation looks like this:
function B = weight_matrices()
n = 46;
m = 1800;
A = rand(n,m,m);
w = rand(n,1);
tic;
B = squeeze(sum(bsxfun(#times,w,A),1));
toc;
The line where B is assigned runs in about 0.9 seconds on my machine (Matlab R2012b, MacBook Pro 13" retina, 2.5 GHz Intel Core i5, 8 GB 1600 MHz DDR3). It should be noted that for my problem, the tensor A will be the same (constant) for the whole run of the program (after initialization), but w can take any values. Also, typical values of n and m are used here, meaning that the tensor A will have a size of about 1 GB in memory.
The clearest way I can think of writing this in Fortran is something like this:
pure function weight_matrices(w,A) result(B)
implicit none
integer, parameter :: n = 46
integer, parameter :: m = 1800
double precision, dimension(num_sizes), intent(in) :: w
double precision, dimension(num_sizes,msize,msize), intent(in) :: A
double precision, dimension(msize,msize) :: B
integer :: i
B = 0
do i = 1,n
B = B + w(i)*A(i,:,:)
end do
end function weight_matrices
This function runs in about 1.4 seconds when compiled with gfortran 4.7.2, using -O3 (function call timed with "call cpu_time(t)"). If I manually unwrap the loop into
B = w(1)*A(1,:,:)+w(2)*A(2,:,:)+ ... + w(46)*A(46,:,:)
the function takes about 0.11 seconds to run instead. This is great and means that I get a speedup of about 8 times compared to the Matlab version. However, I still have some questions on readability and performance.
First, I wonder if there is an even faster way to perform this weighting and summing of matrices. I have looked through BLAS and LAPACK, but can't find any function that seems to fit. I have also tried to put the dimension in A that enumerates the matrices as the last dimension (i.e. switching from (i,j,k) to (k,i,j) for the elements), but this resulted in slower code.
Second, this fast version is not very flexible, and actually looks quite ugly, since it is so much text for such a simple computation. For the tests I am running I would like to try to use different numbers of weights, so that the length of w will vary, to see how it affects the rest of my algorithm. However, that means I quite tedious rewrite of the assignment of B every time. Is there any way to make this more flexible, while keeping the performance the same (or better)?
Third, the tensor A will, as mentioned before, be constant during the run of the program. I have set constant scalar values in my program using the "parameter" attribute in their own module, importing them with the "use" expression into the functions/subroutines that need them. What is the best way to do the equivalent thing for the tensor A? I want to tell the compiler that this tensor will be constant, after init., so that any corresponding optimizations can be done. Note that A is typically ~1 GB in size, so it is not practical to enter it directly in the source file.
Thank you in advance for any input! :)
Perhaps you could try something like
do k=1,m
do j=1,m
B(j,k)=sum( [ ( (w(i)*A(i,j,k)), i=1,n) ])
enddo
enddo
The square brace is a newer form of (/ /), the 1d matrix (vector). The term in sum is a matrix of dimension (n) and sum sums all of those elements. This is precisely what your unwrapped code does (and is not exactly equal to the do loop you have).
I tried to refine Kyle Vanos' solution.
Therefor I decided to use sum and Fortran's vector-capabilities.
I don't know, if the results are correct, because I only looked for the timings!
Version 1: (for comparison)
B = 0
do i = 1,n
B = B + w(i)*A(i,:,:)
end do
Version 2: (from Kyle Vanos)
do k=1,m
do j=1,m
B(j,k)=sum( [ ( (w(i)*A(i,j,k)), i=1,n) ])
enddo
enddo
Version 3: (mixed-up indices, work on one row/column at a time)
do j = 1, m
B(:,j)=sum( [ ( (w(i)*A(:,i,j)), i=1,n) ], dim=1)
enddo
Version 4: (complete matrices)
B=sum( [ ( (w(i)*A(:,:,i)), i=1,n) ], dim=1)
Timing
As you can see, I had to mixup the indices to get faster execution times. The third solution is really strange because the number of the matrix is the middle index, but this is necessary for memory-order-reasons.
V1: 1.30s
V2: 0.16s
V3: 0.02s
V4: 0.03s
Concluding, I would say, that you can get a massive speedup, if you have the possibility to change order of the matrix indices in arbitrary order.
I would not hide any looping as this is usually slower. You can write it explicitely, then you'll see that the inner loop access is over the last index, making it inefficient. So, you should make sure your n dimension is the last one by storing A is A(m,m,n):
B = 0
do i = 1,n
w_tmp = w(i)
do j = 1,m
do k = 1,m
B(k,j) = B(k,j) + w_tmp*A(k,j,i)
end do
end do
end do
this should be much more efficient as you are now accessing consecutive elements in memory in the inner loop.
Another solution is to use the level 1 BLAS subroutines _AXPY (y = a*x + y):
B = 0
do i = 1,n
CALL DAXPY(m*m, w(i), A(1,1,i), 1, B(1,1), 1)
end do
With Intel MKL this should be more efficient, but again you should make sure the last index is the one which changes in the outer loop (in this case the loop you're writing). You can find the necessary arguments for this call here: MKL
EDIT: you might also want to use some parallellization? (I don't know if Matlab takes advantage of that)
EDIT2: In the answer of Kyle, the inner loop is over different values of w, which is more efficient than n times reloading B as w can be kept in cache (using A(n,m,m)):
B = 0
do i = 1,m
do j = 1,m
B(j,i)=0.0d0
do k = 1,n
B(j,i) = B(j,i) + w(k)*A(k,j,i)
end do
end do
end do
This explicit looping performs about 10% better as the code of Kyle which uses whole-array operations. Bandwidth with ifort -O3 -xHost is ~6600 MB/s, with gfortran -O3 it's ~6000 MB/s, and the whole-array version with either compiler is also around 6000 MB/s.
I know this is an old post, however I will be glad to bring my contribution as I played with most of the posted solutions.
By adding a local unroll for the weights loop (from Steabert's answer ) gives me a little speed-up compared to the complete unroll version (from 10% to 80% with different size of the matrices). The partial unrolling may help the compiler to vectorize the 4 operations in one SSE call.
pure function weight_matrices_partial_unroll_4(w,A) result(B)
implicit none
integer, parameter :: n = 46
integer, parameter :: m = 1800
real(8), intent(in) :: w(n)
real(8), intent(in) :: A(n,m,m)
real(8) :: B(m,m)
real(8) :: Btemp(4)
integer :: i, j, k, l, ndiv, nmod, roll
!==================================================
roll = 4
ndiv = n / roll
nmod = mod( n, roll )
do i = 1,m
do j = 1,m
B(j,i)=0.0d0
k = 1
do l = 1,ndiv
Btemp(1) = w(k )*A(k ,j,i)
Btemp(2) = w(k+1)*A(k+1,j,i)
Btemp(3) = w(k+2)*A(k+2,j,i)
Btemp(4) = w(k+3)*A(k+3,j,i)
k = k + roll
B(j,i) = B(j,i) + sum( Btemp )
end do
do l = 1,nmod !---- process the rest of the loop
B(j,i) = B(j,i) + w(k)*A(k,j,i)
k = k + 1
enddo
end do
end do
end function

Auto vectorization on double and ffast-math

Why is it mandatory to use -ffast-math with g++ to achieve the vectorization of loops using doubles? I don't like -ffast-math because I don't want to lose precision.
You don’t necessarily lose precision with -ffast-math. It only affects the handling of NaN, Inf etc. and the order in which operations are performed.
If you have a specific piece of code where you do not want GCC to reorder or simplify computations, you can mark variables as being used using an asm statement.
For instance, the following code performs a rounding operation on f. However, the two f += g and f -= g operations are likely to get optimised away by gcc:
static double moo(double f, double g)
{
g *= 4503599627370496.0; // 2 ** 52
f += g;
f -= g;
return f;
}
On x86_64, you can use this asm statement to instruct GCC not to perform that optimisation:
static double moo(double f, double g)
{
g *= 4503599627370496.0; // 2 ** 52
f += g;
__asm__("" : "+x" (f));
f -= g;
return f;
}
You will need to adapt this for each architecture, unfortunately. On PowerPC, use +f instead of +x.
Very likely because vectorization means that you may have different results, or may mean that you miss floating point signals/exceptions.
If you're compiling for 32-bit x86 then gcc and g++ default to using the x87 for floating point math, on 64-bit they default to SSE, however the x87 can and will produce different values for the same computation so it's unlikely g++ will consider vectorizing if it can't guarantee that you will get the same results unless you use -ffast-math or some of the flags it turns on.
Basically it comes down to the floating point environment for vectorized code may not be the same as the one for non vectorized code, sometimes in ways that are important, if the differences don't matter to you, something like
-fno-math-errno -fno-trapping-math -fno-signaling-nans -fno-rounding-math
but first look up those options and make sure that they won't affect your program's correctness. -ffinite-math-only may help also
Because -ffast-math enables operands reordering which allows many code to be vectorized.
For example to calculate this
sum = a[0] + a[1] + a[2] + a[3] + a[4] + a[5] + … a[99]
the compiler is required to do the additions sequentially without -ffast-math, because floating-point math is neither commutative nor associative.
Is floating point addition commutative and associative?
Is floating point addition commutative in C++?
Are floating point operations in C associative?
Is Floating point addition and multiplication associative?
That's the same reason why compilers can't optimize a*a*a*a*a*a to (a*a*a)*(a*a*a) without -ffast-math
That means no vectorization available unless you have very efficient horizontal vector adds.
However if -ffast-math is enabled, the expression can be calculated like this (Look at A7. Auto-Vectorization)
sum0 = a[0] + a[4] + a[ 8] + … a[96]
sum1 = a[1] + a[5] + a[ 9] + … a[97]
sum2 = a[2] + a[6] + a[10] + … a[98]
sum3 = a[3] + a[7] + a[11] + … a[99]
sum’ = sum0 + sum1 + sum2 + sum3
Now the compiler can vectorize it easily by adding each column in parallel and then do a horizontal add at the end
Does sum’ == sum? Only if (a[0]+a[4]+…) + (a[1]+a[5]+…) + (a[2]+a[6]+…) + ([a[3]+a[7]+…) == a[0] + a[1] + a[2] + … This holds under associativity, which floats don’t adhere to, all of the time. Specifying /fp:fast lets the compiler transform your code to run faster – up to 4 times faster, for this simple calculation.
Do You Prefer Fast or Precise? - A7. Auto-Vectorization
It may be enabled by the -fassociative-math flag in gcc
Further readings
Semantics of Floating Point Math in GCC
What does gcc's ffast-math actually do?
To enable auto-vectorization with gcc, ffast-math is not actually necessary. See https://gcc.gnu.org/projects/tree-ssa/vectorization.html#using
To enable vectorization of floating point reductions use -ffast-math or -fassociative-math.
Using -fassociative-math should be sufficient.
This has been the case since 2007, see https://gcc.gnu.org/projects/tree-ssa/vectorization.html#oldnews
-fassociative-math can be used instead of -ffast-math to enable vectorization of reductions of floats (2007-09-04).

Resources