I read this post Puzzling performance difference between ifort and gfortran from 2012 and tried to run the code by myself:
PROGRAM PERFECT_SQUARE
IMPLICIT NONE
INTEGER*8 :: N, M, NTOT
LOGICAL :: IS_SQUARE
N=4
WRITE(*,*) IS_SQUARE(N)
NTOT = 0
DO N = 1, 1000000000
IF (IS_SQUARE(N)) THEN
NTOT = NTOT + 1
END IF
END DO
WRITE (*, *) NTOT ! should find 31622 squares
END PROGRAM
LOGICAL FUNCTION IS_SQUARE(N)
IMPLICIT NONE
INTEGER*8 :: N, M
! check if negative
IF (N .LT. 0) THEN
IS_SQUARE = .FALSE.
RETURN
END IF
! check if ending 4 bits belong to (0,1,4,9)
M = IAND(int(N, kind(8)), int(15, kind(8)))
IF (.NOT. (M .EQ. 0 .OR. M .EQ. 1 .OR. M .EQ. 4 .OR. M .EQ. 9)) THEN
IS_SQUARE = .FALSE.
RETURN
END IF
! try to find the nearest integer to sqrt(n)
M = DINT(SQRT(DBLE(N)))
IF (M**2 .NE. N) THEN
IS_SQUARE = .FALSE.
RETURN
END IF
IS_SQUARE = .TRUE.
RETURN
END FUNCTION
(I changed a few lines so that the code will compile).
I compiled with ifort 19.1.3.304 and gfortran 10.2.0.
The ifort executable ran way faster than gfortran and included way more commands. According to the post, ifort 12.1.2.273 should have fixed the problem, but it doesn't seem to be fixed.
If I turn off vectorization on ifort, I get much better results on ifort but still worse than gfortran.
If I remove lines:
N=4
WRITE(*,*) IS_SQUARE(N)
I get much worse results on gfortran compared to the version with this lines.
I can keep going with some weird phenomena I have noticed but I mainly just want to know how this simple code can cause such big difference in the run time between the two compilers, and how can I avoid cases like this.
I compiled with -O3 on both and tried adding -no-vec to ifort. I'm open to other compilation flags.
UPDATE:
ifort example.f90 -O3 -o example_ifort
gfortran example.f90 -O3 -o example_gnu
run time for example_ifort is 3 sec while example_gnu is 1 sec
Related
First of all my english is not good. I'm Sorry.
As far as I know. Fortran address is column major.
My old Fortran code is not optimized for long time.
I try to change my Fortran90 code index for better speed.
A code is almost 3-dimension matrix. (i, j, k)
and almost Do-loop is about i and j.
sizes of i and j are about 2000~3000 and k is just 2, it means x,y
my old code's index order is (i, k, j)
for example
Do j = 1 : 1500
Do i = 1 : 1024
AA(i, 1, j) = ... ;
AA(i, 2, j) = ... ;
end do
end do
There are a lot of these in my code.
So I changed the index order.
for example (i, j, k), (k, i, j), (i, k, j)
I think (k, i, j) is the best choice in fortran (column major).
but result is not.
all 3 case [ (i, j, k), (k, i, j), (i, k, j) ] are spend almost time.
(1961s, 1955s, 1692s).
My program code is so long and Iteration is enough to compare ( 32000 )
Below is my compile option.
ifort -O3 -xHost -ipo -qopenmp -fp-model strict -mcmodel=medium
I don't understand above result.
Please help me.
Thanks to read.
additionaly, below is one of my programs.
matrix L_X(i, :, j) is my target, : is 1 and 2
!$OMP Parallel DO private(j,i,ii,Tan,NormT)
do j=1,LinkPlusBndry
if (Kmax(j)>2) then
i=1; Tan=L_X(i+1,:,j)-L_X(i,:,j); NormT=sqrt(Tan(1)**2+Tan(2)**2)
if (NormT < min_dist) then
L_X(2:Kmax(j)-1,:,j)=L_X(3:Kmax(j),:,j)
Kmax(j)=Kmax(j)-1
elseif (NormT > max_dist) then
do i=Kmax(j)+1,3,-1; L_X(i,:,j)=L_X(i-1,:,j); end do
L_X(2,:,j)=(L_X(1,:,j)+L_X(3,:,j))/2.0_dp
Kmax(j)=Kmax(j)+1
end if
do i=2,M-1
if (i > (Kmax(j)-2) ) exit
Tan=L_X(i+1,:,j)-L_X(i,:,j); NormT=sqrt(Tan(1)**2+Tan(2)**2)
if (NormT < min_dist) then
L_X(i,:,j)=(L_X(i,:,j)+L_X(i+1,:,j))/2.0_dp
L_X(i+1:Kmax(j)-1,:,j)=L_X(i+2:Kmax(j),:,j)
Kmax(j)=Kmax(j)-1
elseif (NormT > max_dist) then
do ii=Kmax(j)+1,i+2,-1; L_X(ii,:,j)= L_X(ii-1,:,j); end do
L_X(i+1,:,j)=(L_X(i,:,j)+L_X(i+2,:,j))/2.0_dp
Kmax(j)=Kmax(j)+1
end if
end do
i=Kmax(j)-1;
if (i>1) then
Tan=L_X(i+1,:,j)-L_X(i,:,j); NormT=sqrt(Tan(1)**2+Tan(2)**2)
if (NormT < min_dist) then
L_X(Kmax(j)-1,:,j)=L_X(Kmax(j),:,j)
Kmax(j)=Kmax(j)-1
elseif (NormT > max_dist) then
L_X(Kmax(j)+1,:,j)= L_X(Kmax(j),:,j)
L_X(Kmax(j),:,j)=(L_X(Kmax(j)-1,:,j)+L_X(Kmax(j)+1,:,j))/2.0_dp
Kmax(j)=Kmax(j)+1
end if
end if
elseif (Kmax(j)==2) then
i=1; Tan=L_X(i+1,:,j)-L_X(i,:,j); NormT=sqrt(Tan(1)**2+Tan(2)**2)
if (NormT > max_dist) then
do i=Kmax(j)+1,3,-1; L_X(i,:,j)=L_X(i-1,:,j); end do
L_X(2,:,j)=(L_X(1,:,j)+L_X(3,:,j))/2.0_dp
Kmax(j)=Kmax(j)+1
end if
end if
do i=Kmax(j)+1,M; L_X(i,:,j)=L_X(Kmax(j),:,j); end do
end do
!$OMP End Parallel DO
I would not worry so much about loop ordering. ifort -O3 optimization is an aggressive loop optimizer. Its possible that reordering your 3-D arrays will have little to no affect.
As far as you thinking (k,i,j) is the best order. In general this would be best. But k only has 2 elements and i has 1024. Assuming you are using single precision real (4 bytes) This 2-D segment of your 3-D array fit in 8K ram. It is likely that your data, once the loop starts, is entirely on the CPU cache so index ordering would be irrelevant. You need much larger data dimensions for the effect your considering to take effect.
As far as your performance difference, that is likely the struggles of compiler optimizations.
I have some basic psuedo code as follows,
PROGRAM PSUEDOEXAMPLE
IMPLICIT NONE
!Define all types
!Load some data into arrays: Array_I, Array_J
!Do Loops
do i = 1,10
xi = Array_I(i,1)
yi = Array_I(i,2)
zi = Array_I(i,3)
do j =1,10
xj = Array_J(j,1)
yj = Array_J(j,2)
zj = Array_J(j,3)
separation = ((xi - xj)**2 + (yi-yj)**2 +(zi-zj)**2)**0.5
enddo
enddo
END PROGRAM PSUEDOEXAMPLE
I can time the time it takes for a single i-step to be ~0.3 seconds. What are the best ways to reduce this time? I can see potentially removing the square root would be effective. I am using gfortran as my compiler.
The following function is to find a number n which 1^3 + 2^3 + ... + (n-1) ^3 + n^3 = m. Is there any chance this function can be optimized for speed?
findNb :: Integer -> Integer
findNb m = findNb' 1 0
where findNb' n m' =
if m' == m then n - 1
else if m' < m then findNb' (n + 1) (m' + n^3)
else -1
I know there is a faster solution by using a math formula.
The reason I'm asking is that the similar implementation in JavaScript / C# seems far more faster than in Haskell. I'm just curious if it can be optimized. Thanks.
EDIT1: Add more evidences on the rum time
Haskell Version:
With main = print (findNb2 152000000000000000000000):
Compile with -O2 and profiling: ghc -o testo2.exe -O2 -prof -fprof-auto -rtsopts pileofcube.hs. Here is total time from profiling report:
total time = 0.19 secs (190 milliseconds) (190 ticks # 1000 us, 1 processor)
Compile with -O2 but no profiling: ghc -o testo22.exe -O2 pileofcube.hs. Run it with Measure-Command {./testo22.exe} in powershell. The result is:
Milliseconds : 157
JavaScript Version:
Code:
function findNb(m) {
let n = 0;
let sum = 0;
while (sum < m) {
n++;
sum += Math.pow(n, 3);
}
return sum === m ? n : -1;
}
var d1 = new Date();
findNb(152000000000000000000000);
console.log(new Date() - d1);
Result: 45 milliseconds running in Chrome on the same machine
EDIT2: Add C# Version
As #Berji and #Bakuriu commented, comparing to the JavaScript version above is not fair as it uses double-precision floating point numbers underlying and could not give the correct answer even. So I implemented it in C#, here is the code and result:
static void Main(string[] args)
{
BigInteger m = BigInteger.Parse("152000000000000000000000");
var s = new Stopwatch();
s.Start();
long n = 0;
BigInteger sum = 0;
while (sum < m)
{
n++;
sum += BigInteger.Pow(n, 3);
}
Console.WriteLine(sum == m ? n : -1);
s.Stop();
Console.WriteLine($"Escaped Time: {s.ElapsedMilliseconds} milliseconds.");
}
Result: Escaped Time: 457 milliseconds.
Conclusion
Haskell version is faster than C# one...
I was wrong at start because I didn't realized JavaScript use double-precision floating point numbers under the hood due to my poor JavaScript knowledge.
At this point seems the question does not make sense anymore...
Haskell too can use Double to get the wrong answer in less time:
% time ./so
./so 0.03s user 0.00s system 95% cpu 0.038 total
And Javascript too can get the correct result via npm-installing big-integer and using bigInt everywhere instead of Double:
% node so.js
^C
node so.js 35.62s user 0.30s system 93% cpu 38.259 total
... or maybe it isn't as trivial as that.
EDIT : I realized afterward that's not what the author of the question wanted. I'll keep it there as a in case someone wants to know the formula in question, but otherwise please disregard.
There is indeed a formula that lets you compute this in constant time (rather than n iterations). Since I couldn't remember the exact formula from school, I did a bit of searching, and here is is: https://proofwiki.org/wiki/Sum_of_Sequence_of_Cubes.
In haskell code, that would translate to
findNb n = n ^ 2 * (n + 1) ^ 2 / 4
which I believe should be much faster.
Not sure if this wording of that algorithm is faster, but try this?
findNb :: Integer -> Integer
findNb m = length $ takeWhile (<=m) $ scanl1 (+) [n^3 | n <- [1..]]
(This has different semantics in the undefined case, though.)
I have the following code for testing the intel mkl DAXPY routine.
program test
implicit none
integer, parameter :: n = 50000000
integer, parameter :: nloop = 100
real(8), dimension(:), allocatable :: a, b
integer start_t, end_t, rate, i
allocate(a(n))
allocate(b(n))
a = 1.0d0
b = 2.0d0
call system_clock(start_t, rate)
do i = 1, nloop
call sumArray(a, b, a, 3.0d0, n)
end do
call system_clock(end_t)
print *, sum(a)
print *, "sumArray time: ", real(end_t-start_t)/real(rate)
a = 1.0d0
b = 2.0d0
call system_clock(start_t, rate)
do i = 1, nloop
call daxpy(n, 3.0d0, b, 1, a, 1)
end do
call system_clock(end_t)
print *, sum(a)
print *, "daxpy time: ", real(end_t-start_t)/real(rate)
a = 1.0d0
b = 2.0d0
call system_clock(start_t, rate)
do i = 1, nloop
a = a + 3.0d0*b
end do
call system_clock(end_t)
print *, sum(a)
print *, "a + 3*b time: ", real(end_t-start_t)/real(rate)
end program test
subroutine sumArray(x, y, z, alfa, n)
implicit none
integer n, i
real(8) x(n), y(n), z(n), alfa
!$OMP PARALLEL DO
do i = 1, n
z(i) = x(i) + alfa*y(i)
end do
!$OMP END PARALLEL DO
end subroutine sumArray
Here, sumArray is a handwritten subroutine with openmp that does something similar to DAXPY.
When I compile the code with ifort test.f90 -o test -O3 -openmp -mkl the results are (aproximately):
sumArray time: 5.7 sec
daxpy time: 5.7 sec
a + 3*b time: 1.9 sec
However, when I compile it with ifort test.f90 -o test -O3 -openmp -mkl -ipo the results for a + 3*b change dramatically:
sumArray time: 5.7 sec
daxpy time: 5.7 sec
a + 3*b time: 9.3 sec
So firstly, why is the naive array sum better than mkl? And what does -ipo have to do with the slowdown of the naive array sum? Also, what bothers me is that when I eliminate the loops, that is, when I just time each operation once, the times are just like the first case divided by 1000 (around 5.7 ms for sumArray and daxpy, 9.3 ms for a + 3*b) regardless of using -ipo. My guess is that something about the naive sum in a loop allows the compiler to optimize further, but the -ipo flag messes up this optimization. Note: I know that -ipo in this case is useless since it is a single file.
I realize this question has been asked before, but not in the context of IO. Is there any reason to believe that:
!compiler can tell that it should write the whole array at once?
!but perhaps compiler allocates/frees temporary array?
write(UNIT) (/( arr(i), i=1,N )/)
would be any more efficient than:
!compiler does lots of IO here?
do i=1,N
write(UNIT) arr(i)
enddo
for a file which is opened as:
open(unit=UNIT,access='STREAM',file=fname,status='UNKNOWN')
There is a possibly that this will be used with compiler options to turn off buffered writing as well ...
As suggested by #HighPerformanceMark, here's a simple benchmark I set up:
Using gfortran:
program main
implicit none
include 'mpif.h'
integer, parameter :: N = 1000000
integer :: unit = 22
integer i
real*8 arr(N)
real*8 t1
integer repeat
external test1
external test2
external test3
repeat=15
call MPI_INIT(i)
arr = 0
call timeit(test1,repeat,arr,N,t1)
print*,t1/repeat
call timeit(test2,repeat,arr,N,t1)
print*,t1/repeat
call timeit(test3,repeat,arr,N,t1)
print*,t1/repeat
call MPI_Finalize(i)
end
subroutine timeit(sub,repeat,arr,size,time)
include 'mpif.h'
external sub
integer repeat
integer size
real*8 time,t1
real*8 arr(size)
integer i
time = 0
do i=1,repeat
open(unit=10,access='STREAM',file='test1',status='UNKNOWN')
t1 = mpi_wtime()
call sub(10,arr,size)
time = time + (mpi_wtime()-t1)
close(10)
enddo
return
end
subroutine test1(ou,a,N)
integer N
real*8 a(N)
integer ou
integer i
do i=1,N
write(ou),a(i)
enddo
return
end
subroutine test2(ou,a,N)
integer N
real*8 a(N)
integer ou
integer i
write(ou),(a(i),i=1,N)
return
end
subroutine test3(ou,a,N)
integer N
real*8 a(N)
integer ou
write(ou),a(1:N)
return
end
My results are (buffered):
temp $ GFORTRAN_UNBUFFERED_ALL=1 mpirun -np 1 ./test
6.2392100652058922
3.3046503861745200
9.76902325948079409E-002
(unbuffered):
temp $ GFORTRAN_UNBUFFERED_ALL=0 mpirun -np 1 ./test
2.7789104779561362
0.15584923426310221
9.82964992523193415E-002
I compiled and ran the above benchmark code using both gfortran (4.7.2 20120921) and ifort (13.0.0.079 Build 20120731). My results are as follows:
gfortran
UNBUFFERED BUFFERED
test1: 1.2614487171173097 0.20308602650960286
test2: 1.0525423844655355 3.4633986155192059E-002
test3: 5.9630711873372398E-003 6.0543696085611975E-003
ifort
UNBUFFERED BUFFERED
test1: 1.33864809672038 0.171342913309733
test2: 6.001885732014974E-003 6.095488866170247E-003
test3: 5.962880452473959E-003 6.007925669352213E-003
It would appear that the explicit loop in test1 is by far the most disadvantageous in both cases (without any optimisation flags set). Furthermore, with the Intel compiler there is no significant difference in execution time whether you run write(ou), (a(i), i=1, N) (case 2) or write(ou), a(1:N) (case 3, identical to simply write(ou), a in this case).
By the way, for this single-threaded process you can also just use the fortran 90 (or 95?) intrinsic subroutines cpu_time, which sums over all threads and returns a time in seconds. Otherwise there is also system_clock, which can return the number of elapsed clock cycles and the clock rate as integers, possibly to higher precision.