array initialization run time comparison ifort vs gfortran - performance

I would like to compare array initialization run time for ifort vs gfortran using this compilation lines with gfortran 10.1.0 and ifort 19.1.3.304 on CentOS Linux 7:
ifort array-initialize.f90 -O3 -init=arrays,zero,minus_huge,snan -g -o intel-array.out
gfortran array-initialize.f90 -O3 -finit-local-zero -finit-integer=-2147483647 -finit-real=snan -finit-logical=True -finit-derived -g -o gnu-array.out
array-initialize.f90:
program array_initialize
implicit none
integer :: i, j, limit
real :: my_max
real :: start, finish
my_max = -1.0
limit = 10000
call cpu_time(start)
do j=1, limit
do i=1, limit
my_max = max(my_max, initializer(i, j))
end do
end do
call cpu_time(finish)
print *, my_max
print '("Time = ", f6.3," seconds.")', finish-start
contains
function initializer(i, j)
implicit none
real :: initializer
real :: arr(2)
integer :: i, j
arr(1) = -1.0/(2*i+j+1)
arr(2) = -1.0/(2*j+i+1)
initializer = max(arr(1), arr(2))
end function
end program array_initialize
Run times for this code:
gnu - 0.096 sec
intel - 0.392 sec
When I remove the init flags:
gnu - 0.098 sec
intel - 0.057 sec
When I replace the array with two variables:
gnu - 0.099 sec
intel - 0.065 sec
What happens here? Does gnu not initialize its arrays? Does intel initialize arrays very slow?

OOPS.
I disabled vectorization using -no-vec on ifort and -fno-tree-vectorize on gfortran, and now the run times are same and about 0.39 sec (just like the original intel time).

Related

In parallel computing, why using all threads (4) execution time is longer than using only a half (2)?

E.g, I'm using this code (CPU: 4 cores (thread per core)):
program main
use omp_lib
implicit none
integer, parameter:: ma=100, n=10000, mb= 100
integer:: istart, iend
real, dimension (ma,n) :: a
real, dimension (n,mb) :: b
real, dimension (ma,mb) :: c = 0.
integer:: i,j,k, threads=2, ppt, thread_num
integer:: toc, tic, rate
real:: time_parallel, time
call random_number (a)
call random_number (b)
!/////////////////////// 1- PARALLEL PRIVATE ///////////////////////
CALL system_clock(count_rate=rate)
call system_clock(tic)
ppt = ma/threads
!$ call omp_set_num_threads(threads)
!$omp parallel default(shared) private(istart, iend, &
!$omp thread_num, i)
!$ thread_num = omp_get_thread_num()
!$ istart = thread_num*ppt +1
!$ iend = min(ma, thread_num*ppt + ppt)
do i= istart,iend
do j= 1,mb
do k= 1,n
c(i,j) = c(i,j) + a(i,k)*b(k,j)
end do
end do
end do
!$omp end parallel
print*, 'Result in parallel mode'
!$ print*, c(85:90,40)
call system_clock(toc)
time_parallel = real(toc-tic)/real(rate)
!/////////////////////// 2-normal execution ///////////////////////
c = 0
CALL system_clock(count_rate=rate)
call system_clock(tic)
call system_clock(tic)
do i= 1,ma
do j= 1,mb
do k= 1,n
c(i,j) = c(i,j) + a(i,k)*b(k,j)
end do
end do
end do
call system_clock(toc)
time = real(toc-tic)/real(rate)
print*, 'Result in serial mode'
print*, c(85:90,40)
print*, '------------------------------------------------'
print*, 'Threads: ', threads, '| Time Parallel Private', time_parallel, 's '
print*, ' Time Normal ', time, 's'
!----------------------------------------------------------------
end program main
I get the following results:
First execution:
Result in parallel mode
2477.89478 2528.50391 2511.84204 2528.12061 2500.79517
2510.69971
Result in serial mode
2477.89478 2528.50391 2511.84204 2528.12061 2500.79517
2510.69971
Threads: 2 | Time Parallel Private 0.379999995 s
Time Normal 0.603999972 s
Second execution:
Result in parallel mode
2492.20679 2496.56152 2500.58203 2516.51685 2516.43604
2530.71313
Result in serial mode
2492.20679 2496.56152 2500.58203 2516.51685 2516.43604
2530.71313
------------------------------------------------
Threads: 4 | Time Parallel Private 1.11500001 s
Time Normal 0.486000001 s
It was compiled using:
gfortran -Wall -fopenmp -g -O2 -o prog.exe prueba.f90
./prog.exe
If you have N cores and using N threads than some of your threads get switched out for some other process and threads. So it's preferable to use less number threads than the available cores.

Disrepancy in results between OpenMP/OpenACC implementation and gcc/PGI compilers

I have a larger Fortran program that I am trying to convert so that the computationally intensive part will run on an NVidia GPU using OpenMP and/or OpenACC. During development I had some issues to understand how variables declared in a module can be used within subroutines that are executed on the GPU (and some of them also on the CPU). Therefore, I created a small example and worked on that, by experimenting and adding the corresponding OpenMP and OpenACC directives. I have included the three files that comprise my example at the end of this message.
Just as I thought that I had understood things and that my example program works, I noticed the following:
I compile the program with gcc 10.2 using the OpenMP directives:
gfortran -O3 -fopenmp -Wall -Wextra test_link.f90 parameters.f90 common_vars.f90 -o test_link
The results are as expected, i.e. all elements of array XMO are 1, of DCP are 2, of IS1 are 3 and of IS2 are 24.
I compile the program with PGI compiler 19.10 community edition using the OpenACC directives:
pgfortran -O4 -acc -ta=tesla,cc35 -Minfo=all,mp,accel -Mcuda=cuda10.0 test_link.f90 common_vars.f90 parameters.f90 -o test_link
The results are the same as above.
I compile the program with gcc 10.2 using the OpenACC directives:
gfortran -O3 -fopenacc -Wall -Wextra test_link.f90 parameters.f90 common_vars.f90 -o test_link
The results for arrays XMO, DCP and IS1 are correct, but all elements of IS2 are 0. It is easy to verify that variable NR has a value of 0 to get this result.
My understanding is that the OpenMP and OpenACC version of my example are equivalent, but I cannot figure out why the OpenACC version works only for the PGI compiler and not for gcc.
If possible, please provide solutions that do not require changes in the code but only in the directives. As I mentioned, my original code is much larger, contains many more module variables and calls many more subroutines in the code to be executed on the GPU. Changes in that code will be much more difficult to do and obviously I would prefer to do that only if really necessary.
Thank you in advance!
The files of my example follow.
File parameters.f90
MODULE PARAMETERS
IMPLICIT NONE
INTEGER, PARAMETER :: MAX_SOURCE_POSITIONS = 100
END MODULE PARAMETERS
File common_vars.f90
MODULE COMMON_VARS
USE PARAMETERS
IMPLICIT NONE
!$OMP DECLARE TARGET TO(NR)
INTEGER :: NR
!$ACC DECLARE COPYIN(NR)
END MODULE COMMON_VARS
File test_link.f90
SUBROUTINE TEST()
USE COMMON_VARS
IMPLICIT NONE
!$OMP DECLARE TARGET
!$ACC ROUTINE SEQ
INTEGER I
I = NR
END SUBROUTINE TEST
PROGRAM TEST_LINK
USE COMMON_VARS
USE PARAMETERS
IMPLICIT NONE
INTERFACE
SUBROUTINE TEST()
!$OMP DECLARE TARGET
!$ACC ROUTINE SEQ
END SUBROUTINE TEST
END INTERFACE
REAL :: XMO(MAX_SOURCE_POSITIONS), DCP(MAX_SOURCE_POSITIONS)
INTEGER :: IS1(MAX_SOURCE_POSITIONS), IS2(MAX_SOURCE_POSITIONS)
INTEGER :: X, Y, Z, MAX_X, MAX_Y, MAX_Z, ISOUR
MAX_X = 3
MAX_Y = 4
MAX_Z = 5
NR = 6
!$OMP TARGET UPDATE TO(NR)
!$OMP TARGET MAP(TOFROM:IS1,IS2,DCP,XMO)
!$OMP TEAMS DISTRIBUTE PARALLEL DO COLLAPSE(3)
!$ACC UPDATE DEVICE(NR)
!$ACC PARALLEL LOOP GANG WORKER COLLAPSE(3) INDEPENDENT &
!$ACC COPY(IS1,IS2,DCP,XMO)
DO X = 1, MAX_X
DO Y = 1, MAX_Y
DO Z = 1, MAX_Z
ISOUR = (X - 1)*MAX_Y*MAX_Z + (Y - 1)*MAX_Z + Z
XMO(ISOUR) = 1.0
DCP(ISOUR) = 2.0
IS1(ISOUR) = 3
IS2(ISOUR) = 4 * NR
CALL TEST()
ENDDO ! End of z loop
ENDDO ! End of y loop
ENDDO ! End of x loop
!$ACC END PARALLEL LOOP
!$OMP END TEAMS DISTRIBUTE PARALLEL DO
!$OMP END TARGET
DO X = 1, MAX_X
DO Y = 1, MAX_Y
DO Z = 1, MAX_Z
ISOUR = (X - 1)*MAX_Y*MAX_Z + (Y - 1)*MAX_Z + Z
WRITE(*, *) 'ISOUR = ', ISOUR, 'XMO = ', XMO(ISOUR), 'DCP = ', DCP(ISOUR), 'IS1 = ', IS1(ISOUR), 'IS2 = ', IS2(ISOUR)
ENDDO ! End of z loop
ENDDO ! End of y loop
ENDDO ! End of x loop
END PROGRAM TEST_LINK

slow-down when using OpenMP and calling subroutine in a loop

Here I present a simple fortran code using OpenMP that calculate a summation of arrays multiple times. My computers has 6 cores with 12 threads and memory space of 16G.
There are two versions of this code. The first version has only 1 file test.f90 and the summation is implemented in this file. The code is presented as follows
program main
implicit none
integer*8 :: begin, end, rate
integer i, j, k, ii, jj, kk, cnt
real*8,allocatable,dimension(:,:,:)::theta, e
allocate(theta(2000,50,5))
allocate(e(2000,50,5))
call system_clock(count_rate=rate)
call system_clock(count=begin)
!$omp parallel do
do cnt = 1, 8
do i = 1, 1001
do j = 1, 50
theta = theta+0.5d0*e
end do
end do
end do
!$omp end parallel do
call system_clock(count=end)
write(*, *) 'total time cost is : ', (end-begin)*1.d0/rate
deallocate(theta)
deallocate(e)
end program main
This version has no problem on OpenMP and we can see acceleration.
The second version is modified such that the implementation of summation is written in a subroutine. There are two files, test.f90 and sub.f90 which are presented as follows
! test.f90
program main
use sub
implicit none
integer*8 :: begin, end, rate
integer i, j, k, ii, jj, kk, cnt
call system_clock(count_rate=rate)
call system_clock(count=begin)
!$omp parallel do
do cnt = 1, 8
call summation()
end do
!$omp end parallel do
call system_clock(count=end)
write(*, *) 'total time cost is : ', (end-begin)*1.d0/rate
end program main
and
! sub.f90
module sub
implicit none
contains
subroutine summation()
implicit none
real*8,allocatable,dimension(:,:,:)::theta, e
integer i, j
allocate(theta(2000,50,5))
allocate(e(2000,50,5))
theta = 0.d0
e = 0.d0
do i = 1, 101
do j = 1, 50
theta = theta+0.5d0*e
end do
end do
deallocate(theta)
deallocate(e)
end subroutine summation
end module sub
I also write a Makefile as follows
FC = ifort -O2 -mcmodel=large -qopenmp
LN = ifort -O2 -mcmodel=large -qopenmp
FFLAGS = -c
LFLAGS =
result: sub.o test.o
$(LN) $(LFLAGS) -o result test.o sub.o
test.o: test.f90
$(FC) $(FFLAGS) -o test.o test.f90
sub.o: sub.f90
$(FC) $(FFLAGS) -o sub.o sub.f90
clean:
rm result *.o* *.mod *.e*
(we can use gfortran instead) However, we I run this version, there will be dramatic slow-down in using OpenMP and it is even much slower than the single-thread one (no OpenMP). So, what happened here and how to fix this ?

Why does a large matrix pass through several subroutine tasks as fast as a smaller matrix

What Exactly is happening to my matrix? how is Fortran handling it?
What's attached is a snippet of code inspired from a larger
project that simulates light transport in eye tissue. It
passes some large matrices through subroutines and then
randomly puts values in them.
My Goal: To see how passing such a large matrix through
several subroutines would have an impact on
performance.
My Reference: is the exact same code except the dimension of the matrix of interest is now [5,5] ( it was previously [250,200] )
My Question: Why is there no significant difference in results?
MY RESULTS
MATRIX A_rz dimension [250,200]
real 0m6.661s
user 0m6.638s
sys 0m0.012s
MATRIX A_rz dimension [5,5]
real 0m6.508s
user 0m6.489s
sys 0m0.011s
**bMatMOD.f90
module bMatMOD
implicit none
type :: INPUT
integer :: nLayers = 1
integer :: nPhotons = 50000000
real, dimension (2) :: dZR = (/0.0004, 0.001/)
integer, dimension(3) :: nZRA = (/250,200,30/)
real, dimension (1) :: d = (/0.03/)
end type INPUT
type :: OUTPUT
real, allocatable :: Rd_ra(:,:)
real, allocatable :: A_rz(:,:)
real, allocatable :: Tt_ra(:,:)
end type OUTPUT
contains
subroutine initOUTPUTS (in_INPUT,out_OUTPUT)
type (INPUT), intent (in) :: in_INPUT
type (OUTPUT),intent (out) :: out_OUTPUT
allocate (out_OUTPUT%A_rz(in_INPUT%nZRA(2),in_INPUT%nZRA(1)))
allocate (out_OUTPUT%Rd_ra(in_INPUT%nZRA(2),in_INPUT%nZRA(3)))
allocate (out_OUTPUT%Tt_ra(in_INPUT%nZRA(2),in_INPUT%nZRA(3)))
out_OUTPUT%A_rz = 0.0
out_OUTPUT%Rd_ra = 0.0
out_OUTPUT%Tt_ra = 0.0
return
end subroutine initOUTPUTS
end module bMatMOD
**bMatRoutines.f90
subroutine A (o)
use bMatMOD
type (OUTPUT) :: o
real :: rnd1, rnd2
rnd1 = rand()
rnd2 = rand()
call B(o,rnd1,rnd2)
return
end subroutine A
subroutine B (o,x,y)
use bMatMOD
type (OUTPUT) :: o
real, intent (in) :: x
real, intent (in) :: y
integer, dimension(2) :: temp
integer :: i, j
temp = SHAPE(o%A_rz)
i = INT(temp(1)*y)
j = INT(temp(2)*x)
if ( i .eq. 0) then
i = 1
endif
if (i .eq. temp(1)) then
i = i - 1
endif
if (j .eq. 0) then
j = 1
endif
if (j .eq. temp(2)) then
j = j - 1
endif
o%A_rz(i,j) = o%A_rz(i,j) + x + y
return
end subroutine B
**bMatmcml.f90
program bMatmcml
use bMatMOD
implicit none
type (INPUT) :: u
type (OUTPUT) :: o
integer :: i
call initOUTPUTS(u,o)
call srand(0)
do i = 1,u%nPhotons,1
call A(o)
enddo
end program bMatmcml
**bMat.sh
rm -f *.o *~ *.exe
echo "MATRIX A_rz dimension [250,200]"
gfortran bMatMOD.f90 bMatRoutines.f90 bMatmcml.f90 -g -Wall -Werror -O3 -ffast-math -o bMat.exe
time ./bMat.exe
echo "MATRIX A_rz dimension [5,5]"
gfortran bMatMOD-v1.f90 bMatRoutines.f90 bMatmcml.f90 -g -Wall -Werror -O3 -ffast-math -o bMat-v1.exe
time ./bMat.exe

Performance problem with Euler problem and recursion on Int64 types

I'm currently learning Haskell using the project Euler problems as my playground.
I was astound by how slow my Haskell programs turned out to be compared to similar
programs written in other languages. I'm wondering if I've forseen something, or if this is the kind of performance penalties one has to expect when using Haskell.
The following program in inspired by Problem 331, but I've changed it before posting so I don't spoil anything for other people. It computes the arc length of a discrete circle drawn on a 2^30 x 2^30 grid. It is a simple tail recursive implementation and I make sure that the updates of the accumulation variable keeping track of the arc length is strict. Yet it takes almost one and a half minute to complete (compiled with the -O flag with ghc).
import Data.Int
arcLength :: Int64->Int64
arcLength n = arcLength' 0 (n-1) 0 0 where
arcLength' x y norm2 acc
| x > y = acc
| norm2 < 0 = arcLength' (x + 1) y (norm2 + 2*x +1) acc
| norm2 > 2*(n-1) = arcLength' (x - 1) (y-1) (norm2 - 2*(x + y) + 2) acc
| otherwise = arcLength' (x + 1) y (norm2 + 2*x + 1) $! (acc + 1)
main = print $ arcLength (2^30)
Here is a corresponding implementation in Java. It takes about 4.5 seconds to complete.
public class ArcLength {
public static void main(String args[]) {
long n = 1 << 30;
long x = 0;
long y = n-1;
long acc = 0;
long norm2 = 0;
long time = System.currentTimeMillis();
while(x <= y) {
if (norm2 < 0) {
norm2 += 2*x + 1;
x++;
} else if (norm2 > 2*(n-1)) {
norm2 += 2 - 2*(x+y);
x--;
y--;
} else {
norm2 += 2*x + 1;
x++;
acc++;
}
}
time = System.currentTimeMillis() - time;
System.err.println(acc);
System.err.println(time);
}
}
EDIT: After the discussions in the comments I made som modifications in the Haskell code and did some performance tests. First I changed n to 2^29 to avoid overflows. Then I tried 6 different version: With Int64 or Int and with bangs before either norm2 or both and norm2 and acc in the declaration arcLength' x y !norm2 !acc. All are compiled with
ghc -O3 -prof -rtsopts -fforce-recomp -XBangPatterns arctest.hs
Here are the results:
(Int !norm2 !acc)
total time = 3.00 secs (150 ticks # 20 ms)
total alloc = 2,892 bytes (excludes profiling overheads)
(Int norm2 !acc)
total time = 3.56 secs (178 ticks # 20 ms)
total alloc = 2,892 bytes (excludes profiling overheads)
(Int norm2 acc)
total time = 3.56 secs (178 ticks # 20 ms)
total alloc = 2,892 bytes (excludes profiling overheads)
(Int64 norm2 acc)
arctest.exe: out of memory
(Int64 norm2 !acc)
total time = 48.46 secs (2423 ticks # 20 ms)
total alloc = 26,246,173,228 bytes (excludes profiling overheads)
(Int64 !norm2 !acc)
total time = 31.46 secs (1573 ticks # 20 ms)
total alloc = 3,032 bytes (excludes profiling overheads)
I'm using GHC 7.0.2 under a 64-bit Windows 7 (The Haskell platform binary distribution). According to the comments, the problem does not occur when compiling under other configurations. This makes me think that the Int64 type is broken in the Windows release.
Hm, I installed a fresh Haskell platform with 7.0.3, and get roughly the following core for your program (-ddump-simpl):
Main.$warcLength' =
\ (ww_s1my :: GHC.Prim.Int64#) (ww1_s1mC :: GHC.Prim.Int64#)
(ww2_s1mG :: GHC.Prim.Int64#) (ww3_s1mK :: GHC.Prim.Int64#) ->
case {__pkg_ccall ghc-prim hs_gtInt64 [...]
ww_s1my ww1_s1mC GHC.Prim.realWorld#
[...]
So GHC has realized that it can unpack your integers, which is good. But this hs_getInt64 call looks suspiciously like a C call. Looking at the assembler output (-ddump-asm), we see stuff like:
pushl %eax
movl 76(%esp),%eax
pushl %eax
call _hs_gtInt64
addl $16,%esp
So this looks very much like every operation on the Int64 get turned into a full-blown C call in the backend. Which is slow, obviously.
The source code of GHC.IntWord64 seems to verify that: In a 32-bit build (like the one currently shipped with the platform), you will have only emulation via the FFI interface.
Hmm, this is interesting. So I just compiled both of your programs, and tried them out:
% java -version
java version "1.6.0_18"
OpenJDK Runtime Environment (IcedTea6 1.8.7) (6b18-1.8.7-2~squeeze1)
OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode)
% javac ArcLength.java
% java ArcLength
843298604
6630
So about 6.6 seconds for the Java solution. Next is ghc with some optimization:
% ghc --version
The Glorious Glasgow Haskell Compilation System, version 6.12.1
% ghc --make -O arc.hs
% time ./arc
843298604
./arc 12.68s user 0.04s system 99% cpu 12.718 total
Just under 13 seconds for ghc -O
Trying with some further optimization:
% ghc --make -O3
% time ./arc [13:16]
843298604
./arc 5.75s user 0.00s system 99% cpu 5.754 total
With further optimization flags, the haskell solution took under 6 seconds
It would be interesting to know what version compiler you are using.
There's a couple of interesting things in your question.
You should be using -O2 primarily. It will just do a better job (in this case, identifying and removing laziness that was still present in the -O version).
Secondly, your Haskell isn't quite the same as the Java (it does different tests and branches). As with others, running your code on my Linux box results in around 6s runtime. It seems fine.
Make sure it is the same as the Java
One idea: let's do a literal transcription of your Java, with the same control flow, operations and types.
import Data.Bits
import Data.Int
loop :: Int -> Int
loop n = go 0 (n-1) 0 0
where
go :: Int -> Int -> Int -> Int -> Int
go x y acc norm2
| x <= y = case () of { _
| norm2 < 0 -> go (x+1) y acc (norm2 + 2*x + 1)
| norm2 > 2 * (n-1) -> go (x-1) (y-1) acc (norm2 + 2 - 2 * (x+y))
| otherwise -> go (x+1) y (acc+1) (norm2 + 2*x + 1)
}
| otherwise = acc
main = print $ loop (1 `shiftL` 30)
Peek at the core
We'll take a quick peek at the Core, using ghc-core, and it shows a very nice loop of unboxed type:
main_$s$wgo
:: Int#
-> Int#
-> Int#
-> Int#
-> Int#
main_$s$wgo =
\ (sc_sQa :: Int#)
(sc1_sQb :: Int#)
(sc2_sQc :: Int#)
(sc3_sQd :: Int#) ->
case <=# sc3_sQd sc2_sQc of _ {
False -> sc1_sQb;
True ->
case <# sc_sQa 0 of _ {
False ->
case ># sc_sQa 2147483646 of _ {
False ->
main_$s$wgo
(+# (+# sc_sQa (*# 2 sc3_sQd)) 1)
(+# sc1_sQb 1)
sc2_sQc
(+# sc3_sQd 1);
True ->
main_$s$wgo
(-#
(+# sc_sQa 2)
(*# 2 (+# sc3_sQd sc2_sQc)))
sc1_sQb
(-# sc2_sQc 1)
(-# sc3_sQd 1)
};
True ->
main_$s$wgo
(+# (+# sc_sQa (*# 2 sc3_sQd)) 1)
sc1_sQb
sc2_sQc
(+# sc3_sQd 1)
that is, all unboxed into registers. That loop looks great!
And performs just fine (Linux/x86-64/GHC 7.03):
./A 5.95s user 0.01s system 99% cpu 5.980 total
Checking the asm
We get reasonable assembly too, as a nice loop:
Main_mainzuzdszdwgo_info:
cmpq %rdi, %r8
jg .L8
.L3:
testq %r14, %r14
movq %r14, %rdx
js .L4
cmpq $2147483646, %r14
jle .L9
.L5:
leaq (%rdi,%r8), %r10
addq $2, %rdx
leaq -1(%rdi), %rdi
addq %r10, %r10
movq %rdx, %r14
leaq -1(%r8), %r8
subq %r10, %r14
jmp Main_mainzuzdszdwgo_info
.L9:
leaq 1(%r14,%r8,2), %r14
addq $1, %rsi
leaq 1(%r8), %r8
jmp Main_mainzuzdszdwgo_info
.L8:
movq %rsi, %rbx
jmp *0(%rbp)
.L4:
leaq 1(%r14,%r8,2), %r14
leaq 1(%r8), %r8
jmp Main_mainzuzdszdwgo_info
Using the -fvia-C backend.
So this looks fine!
My suspicion, as mentioned in the comment above, is something to do with the version of libgmp you have on 32 bit Windows generating poor code for 64 bit ints. First try upgrading to GHC 7.0.3, and then try some of the other code generator backends, then if you still have an issue with Int64, file a bug report to GHC trac.
Broadly confirming that it is indeed the cost of making those C calls in the 32 bit emulation of 64 bit ints, we can replace Int64 with Integer, which is implemented with C calls to GMP on every machine, and indeed, runtime goes from 3s to well over a minute.
Lesson: use hardware 64 bits if at all possible.
The normal optimization flag for performance concerned code is -O2. What you used, -O, does very little. -O3 doesn't do much (any?) more than -O2 - it even used to include experimental "optimizations" that often made programs notably slower.
With -O2 I get performance competitive with Java:
tommd#Mavlo:Test$ uname -r -m
2.6.37 x86_64
tommd#Mavlo:Test$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.0.3
tommd#Mavlo:Test$ ghc -O2 so.hs
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
tommd#Mavlo:Test$ time ./so
843298604
real 0m4.948s
user 0m4.896s
sys 0m0.000s
And Java is about 1 second faster (20%):
tommd#Mavlo:Test$ time java ArcLength
843298604
3880
real 0m3.961s
user 0m3.936s
sys 0m0.024s
But an interesting thing about GHC is it has many different backends. By default it uses the native code generator (NCG), which we timed above. There's also an LLVM backend that often does better... but not here:
tommd#Mavlo:Test$ ghc -O2 so.hs -fllvm -fforce-recomp
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
tommd#Mavlo:Test$ time ./so
843298604
real 0m5.973s
user 0m5.968s
sys 0m0.000s
But, as FUZxxl mentioned in the comments, LLVM does much better when you add a few strictness annotations:
$ ghc -O2 -fllvm -fforce-recomp so.hs
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
tommd#Mavlo:Test$ time ./so
843298604
real 0m4.099s
user 0m4.088s
sys 0m0.000s
There's also an old "via-c" generator that uses C as an intermediate language. It does well in this case:
tommd#Mavlo:Test$ ghc -O2 so.hs -fvia-c -fforce-recomp
[1 of 1] Compiling Main ( so.hs, so.o )
on the commandline:
Warning: The -fvia-c flag will be removed in a future GHC release
Linking so ...
ttommd#Mavlo:Test$ ti
tommd#Mavlo:Test$ time ./so
843298604
real 0m3.982s
user 0m3.972s
sys 0m0.000s
Hopefully the NCG will be improved to match via-c for this case before they remove this backend.
dberg, I feel like all of this got off to a bad start with the unfortunate -O flag. Just to emphasize a point made by others, for run-of-the-mill compilation and testing, do like me and paste this into your .bashrc or whatever:
alias ggg="ghc --make -O2"
alias gggg="echo 'Glorious Glasgow for Great Good!' && ghc --make -O2 --fforce-recomp"
I've played with the code a little and this version seems to run faster than Java version on my laptop (3.55s vs 4.63s):
{-# LANGUAGE BangPatterns #-}
arcLength :: Int->Int
arcLength n = arcLength' 0 (n-1) 0 0 where
arcLength' :: Int -> Int -> Int -> Int -> Int
arcLength' !x !y !norm2 !acc
| x > y = acc
| norm2 > 2*(n-1) = arcLength' (x - 1) (y - 1) (norm2 - 2*(x + y) + 2) acc
| norm2 < 0 = arcLength' (succ x) y (norm2 + x*2 + 1) acc
| otherwise = arcLength' (succ x) y (norm2 + 2*x + 1) (acc + 1)
main = print $ arcLength (2^30)
:
$ ghc -O2 tmp1.hs -fforce-recomp
[1 of 1] Compiling Main ( tmp1.hs, tmp1.o )
Linking tmp1 ...
$ time ./tmp1
843298604
real 0m3.553s
user 0m3.539s
sys 0m0.006s

Resources