Fortran: 32 bit / 64 bit performance portability - performance

I've been starting to use Fortran (95) for some numerical code (generating python modules). Here is a simple example:
subroutine bincount (x,c,n,m)
implicit none
integer, intent(in) :: n,m
integer, dimension(0:n-1), intent(in) :: x
integer, dimension(0:m-1), intent(out) :: c
integer :: i
c = 0
do i = 0, n-1
c(x(i)) = c(x(i)) + 1
end do
end
I've found that this performs very well in 32 bit, but when compiled as x86_64 it is about 5x slower (macbook pro core2duo, snow leopard, gfortran 4.2.3 from r.research.att.com). I finally realised this might be due to using 32bit integer type instead of the native type, and indeed when I replace with integer*8, the 64 bit performance is only 25% worse than the 32bit one.
Why is using a 32 bit integer so much slower on a 64 bit machine? Are there any implicit casts going on with the indexing that I might not be aware of?
Is it always the case that 64 bit will be slower than 32 bit for this type of code (I was surprised at this) - or is there a chance I could get the 64 bit compiled version running the same speed or faster?
(main question) Is there any way to declare a (integer) variable to be the 'native' type... ie 32 bit when compiled 32 bit, 64 bit when compiled 64 bit in modern fortran. Without this it seems like it is impossible to write portable fortran code that won't be much slower depending on how its compiled - and I think this means I will have to stop using fortran for my project. I have looked at kind and selected_kind but not been able to find anything that does this.
[Edit: the large performance hit was from the f2py wrapper copying the array to cast it from 64 bit int to 32 bit int, so nothing inherent to the fortran.]

The answer to your 'main question' is to select the correct compiler option to have the default integer declared with 32 or 64 bits. I never use gfortran (I prefer g95, even better a paid-for compiler) so I Googled and it seems that -fdefault-integer-8 is the option you need.
Like you I'm surprised that the 64 bit version is slower than the 32 bit version. I don't have anything illuminating on that point.

Really have also tried using a 64-bit to run watfor 77 but mine was completely impossible.I got a gf-FOR-compiler for my 64-bit and tried some options on google ans use later given an option to use gcc-mp 4.3 and gfortran 4.3. Version which was still slow.
I'll advice you use a 32-bit machine, which is fortran compactible to run your programs or de-grade your 64-bit to a 32-bit to run your progs faster and accurate.
Let keep on researching so as to get a 64-bit machine runing campactibly with WATFOR77 and Subroutines progs.

While I haven't done careful studies, I haven't seen such large speed differences.
I suggest trying a newer version of gfortran. Version 4.2 is earlier (gfortran started with 4.0) and considered obsolete. 4.3 and 4.4 are much improved and have more features. 4.4 is the current non-beta version. An easy way to obtain them on a Mac is via MacPorts: the gcc43 and gcc44 packages include gfortran. The compilers are installed as gcc-mp-4.3, gfortran-mp-4.3, etc., so as not to conflict with other versions. Or you can try the latest build of 4.5 from the gfortran wiki page.
Intel fortran is sometimes significantly faster than gfortran.

Related

Detect 32 or 64 bits machine (Understanding `32 << (^uint(0) >> 63)` )

It is said that the 32 << (^uint(0) >> 63) expression can be used to detect whether the machine is 32 or 64 bits.
How so?
UPDATE:
The question was closed because of
How can I determine the size of words in bits (32 or 64) on the architecture?
however, that answer has two problems,
on a 32-bit architecture, the result of the first step yields 0, and no matter how you shift it, the result of the second step will always be 0 (the given answer is wrong).
Moreover, as per The Go Programming Language, the constant are compiled at the compile time, so that const BitsPerWord will be a fixed value, the same as runtime.GOARCH which gives the arch of the compiled program, and cannot be used to detect the OS architecture no matter which one it runs on.
UPDATE:
Found that the most reliable and portable way is to check with this under Linux:
$ getconf LONG_BIT
64
It won't depend on the language implementation of any programming language, and it can be used in shell scripts too.

Is there a reason why arbitrary precision arithmetic (such as BigInt in JavaScript) is implemented in binary?

From this question, it seems Google Chrome and Node.js both chose to implement arbitrary precision arithmetic in binary. Is there a good reason to do that?
If we can add, subtract, multiply, or divide, and do 7 + 8 = 15 and carry to the next digit, it is faster than doing it bit by bit, with 7 + 8 needing to add two bits 4 times.
V8 developer here. Binary is a good choice because hardware is binary [*]. That doesn't mean that operations happen one bit at a time. In V8, a BigInt's "digits" are uintptr_t values, i.e. register-sized (32 bit on a 32-bit machine, 64 bit on a 64-bit machine) unsigned integers. See our blog post for an overview, and the source for all the gory details. FWIW, many other implementations (e.g. GMP, OpenJDK, Go, Dart) have made the same basic choice.
[*] Some hardware architectures have instructions for "binary coded decimal" arithmetic, which is similar to what you're describing, but this approach is (1) generally considered less efficient, and (2) not available on all architectures that we want V8 to run on.
One possible answer: it is done by adding two 32 or 64 bit integer together, so it is faster than doing it one decimal digit at a time.
To get the result of a multiplication, probably in one machine code cycle, two 64 bit integers can multiply and all digits of the result can be obtained.

Turn value into 2<sup>value</sup>

What is the fastest way of turning some value (stored in register) into 2 to the power of that value in assembly language? I think that some bitwise operations can be used. For example:
Value: 8
Result: 256 (2<sup>8</sup>)
So, short answer: What you're looking for is a left shift.
in C and many other languages, your particular wish would be served by 1 << 8.
You could do it in x86 assembler with shl but there's really no sane reason to do so since pretty much any compiler you come across is going to compile the code into the native shift instruction.

How to make my Haskell program faster? Comparison with C

I'm working on an implementation of one of the SHA3 candidates, JH. I'm at the point where the algorithm pass all KATs (Known Answer Tests) provided by NIST, and have also made it an instance of the Crypto-API. Thus I have began looking into its performance. But I'm quite new to Haskell and don't really know what to look for when profiling.
At the moment my code is consistently slower then the reference implementation written in C, by a factor of 10 for all input lengths (C code found here: http://www3.ntu.edu.sg/home/wuhj/research/jh/jh_bitslice_ref64.h).
My Haskell code is found here: https://github.com/hakoja/SHA3/blob/master/Data/Digest/JHInternal.hs.
Now I don't expect you to wade through all my code, rather I would just want some tips on a couple of functions. I have run some performance tests and this is (part of) the performance file generated by GHC:
Tue Oct 25 19:01 2011 Time and Allocation Profiling Report (Final)
main +RTS -sstderr -p -hc -RTS jh e False
total time = 6.56 secs (328 ticks # 20 ms)
total alloc = 4,086,951,472 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
roundFunction Data.Digest.JHInternal 28.4 37.4
word128Shift Data.BigWord.Word128 14.9 19.7
blockMap Data.Digest.JHInternal 11.9 12.9
getBytes Data.Serialize.Get 6.7 2.4
unGet Data.Serialize.Get 5.5 1.3
sbox Data.Digest.JHInternal 4.0 7.4
getWord64be Data.Serialize.Get 3.7 1.6
e8 Data.Digest.JHInternal 3.7 0.0
swap4 Data.Digest.JHInternal 3.0 0.7
swap16 Data.Digest.JHInternal 3.0 0.7
swap8 Data.Digest.JHInternal 1.8 0.7
swap32 Data.Digest.JHInternal 1.8 0.7
parseBlock Data.Digest.JHInternal 1.8 1.2
swap2 Data.Digest.JHInternal 1.5 0.7
swap1 Data.Digest.JHInternal 1.5 0.7
linearTransform Data.Digest.JHInternal 1.5 8.6
shiftl_w64 Data.Serialize.Get 1.2 1.1
Detailed breakdown omitted ...
Now quickly about the JH algorithm:
It's a hash algorithm which consists of a compression function F8, which is repeated as long as there exists input blocks (of length 512 bits). This is just how the SHA-functions operate. The F8 function consists of the E8 function which applies a round function 42 times. The round function itself consists of three parts:
a sbox, a linear transformation and a permutation (called swap in my code).
Thus it's reasonable that most of the time is spent in the round function. Still I would like to know how those parts could be improved. For instance: the blockMap function is just a utility function, mapping a function over the elements in a 4-tuple. So why is it performing so badly? Any suggestions would be welcome, and not just on single functions, i.e. are there structural changes you would have done in order to improve the performance?
I have tried looking at the Core output, but unfortunately that's way over my head.
I attach some of the heap profiles at the end as well in case that could be of interest.
EDIT :
I forgot to mention my setup and build. I run it on a x86_64 Arch Linux machine, GHC 7.0.3-2 (I think), with compile options:
ghc --make -O2 -funbox-strict-fields
Unfortunately there seems to be a bug on the Linux plattform when compiling via C or LLVM, giving me the error:
Error: .size expression for XXXX does not evaluate to a constant
so I have not been able to see the effect of that.
Switch to unboxed Vectors (from Array, used for constants)
Use unsafeIndex instead of incurring the bounds check and data dependency from safe indexing (i.e. !)
Unpack Block1024 as you did with Block512 (or at least use UnboxedTuples)
Use unsafeShift{R,L} so you don't incur the check on the shift value (coming in GHC 7.4)
Unfold the roundFunction so you have one rather ugly and verbose e8 function. This was significat in pureMD5 (the rolled version was prettier but massively slower than the unrolled version). You might be able to use TH to do this and keep the code smallish. If you do this then you'll have no need for constants as these values will be explicit in the code and result in a more cache friendly binary.
Unpack your Word128 values.
Define your own addition for Word128, don't lift Integer. See LargeWord for an example of how this can be done.
rem not mod
Compile with optimization (-O2) and try llvm (-fllvm)
EDIT: And cabalize your git repo along with a benchmark so we can help you easier ;-). Good work on including a crypto-api instance.
The lower graph shows that a lot of memory is occupied by lists. Unless there are more lurking in other modules, they can only come from e8. Maybe you'll have to bite the bullet and make that a loop instead of a fold, but for starters, since Block1024 is a pair, the foldl' doesn't do much evaluation on the fly (unless the strictness analyser has become significantly better). Try making that stricter, data Block1024 = B1024 !Block512 !Block512, perhaps it also needs {-# UNPACK #-} pragmas. In roundFunction, use rem instead of mod (this will only have minor impact, but it's a bit faster) and make the let bindings strict. In the swapN functions, you might get better performance giving the constants in the form W x y rather than as 128-bit hex numbers.
I can't guarantee those changes will help, but that's what looks most promising after a short glance.
Ok, so I thought I would chime in with an update of what I have done and the results obtained thus far. Changes made:
Switched from Array to UnboxedArray (made Word128 an instance type)
Used UnboxedArray + fold in e8 instead of lists and (prelude) fold
Used unsafeIndex instead of !
Changed type of Block1024 to a real datatype (similiar to Block512), and unpacked its arguments
Updated GHC to version 7.2.1 on Arch Linux, thus fixing the problem with compiling via C or LLVM
Switched mod to rem in some places, but NOT in roundFunction. When I do it there, the compile time suddenly takes an awful lot of time, and the run time becomes 10 times slower! Does anyone know why that may be? It is only happening with GHC-7.2.1, not GHC-7.0.3
I compile with the following options:
ghc-7.2.1 --make -O2 -funbox-strict-fields main.hs ./Tests/testframe.hs -fvia-C -optc-O2
And the results? Roughly 50 % reduction in time. On an input of ~107 MB, the code now use 3 minutes as compared to the previous 6-7 minutes. The C version uses 42 seconds.
Things I tried, but which didn't result in better performance:
Unrolled the e8 function like this:
e8 !h = go h 0
where go !x !n
| n == 42 = x
| otherwise = go h' (n + 1)
where !h' = roundFunction x n
Tried breaking up the swapN functions to use the underlying Word64' directly:
swap1 (W xh hl) =
shiftL (W (xh .&. 0x5555555555555555) (xl .&. 0x5555555555555555)) 1
.|.
shiftR (W (xh .&. 0xaaaaaaaaaaaaaaaa) (xl .&. 0xaaaaaaaaaaaaaaaa)) 1
Tried using the LLVM backend
All of these attempts gave worse performance than what I have currently. I don't know if thats because I'm doing it wrong (especially the unrolling of e8), or because they just are worse options.
Still I have some new questions with these new tweaks.
Suddenly I have gotten this peculiar bump in memory usage. Take a look at following heap profiles:
Why has this happened? Is it because of the UnboxedArray? And what does SYSTEM mean?
When I compile via C I get the following warning:
Warning: The -fvia-C flag does nothing; it will be removed in a future GHC release
Is this true? Why then, do I see better performance using it, rather than not?
It looks like you did a fair amount of tweaking already; I'm curious what the performance is like without explicit strictness annotations (BangPatterns) and the various compiler pragmas (UNPACK, INLINE)... Also, a dumb question: what optimization flags are you using?
Anyway, two suggestions which may be completely awful:
Use unboxed primitive types where you can (e.g. replace Data.Word.Word64 with GHC.Word.Word64#, make sure word128Shift is using Int#, etc.) to avoid heap allocation. This is, of course, non-portable.
Try Data.Sequence instead of []
At any rate, rather than looking at the Core output, try looking at the intermediate C files (*.hc) instead. It can be hard to wade through, but sometimes makes it obvious where the compiler wasn't quite as sharp as you'd hoped.

Difference in integer size for 64-bit system(confuse with my old 32-bit pc system)

Few months ago i get myself a laptop with cpu intel i7-2630qm with a 64-bit windows. While practising my programming skils under this system , I encountered some difference in terms of integer size which makes me think that it's probably due to my new 64-bit system.
Let's take a look at a code.
The C Code :
#include <stdio.h>
int main(void)
{
int num = 20;
printf("%d %lld\n" , num , num);
return 0;
}
The Question :
1.) I remember before getting this new laptop , which mean that time i'm still using my old 32-bit system , when i run this code , the program will print the integer 20 while some random number next to it due to the %lld specifier.
2.)But this phenomena no longer happen when i'm using my new laptop , it will instead print both integer correctly , even if i change the variable num to type short.
3.)Is it on a 64-bit system , there's new integer promotion which will promote int to long long when it's use as an argument??Or is it short integer can be promoted to long long which is 64-bit too when pass as an argument??
4.)Besides that I'm quite confuse with one thing , on 16-bit system , int would be 16-bit and it would be 32-bit when it's on a 32-bit system.But why isn't it become 64-bit when it's on a 64-bit??
==================================================================================
Addon :
1.)I choose "console program(64-bit)" as my project on the IDE while using my new laptop but "console program" on my 32-bit old PC system.
2.)I've check the size of int under "console program(64-bit)" project using sizeof operator and it returns 32-bit while short still remain 16-bit.The only change is long type , it's 64-bit and long long still remain its usual 64-bit size.
You are seeing this side-effect because the calling convention is different for x64 code. The function arguments in 32-bit x86 code are passed on the stack. The printf() function will read a word from the stack that isn't part of the activation frame. The odds that it contains a value of 0 are extremely low.
In x64 code, the first 4 arguments for a function are passed through cpu registers, not the stack. The odds that the high word of the 64-bit register is zero by chance are quite good. Left there by a previous 64-bit operation that worked with small numbers. But certainly not guaranteed.
Trying to reason out the defined behavior of undefined behavior is otherwise not useful. Other than trying to guess how the language is implemented for the core that's in your machine. There are better resources for that. Learning the machine code that's applicable to your compiler is an excellent shortcut. Together with the decent debugger that shows you how your C code got translated into machine code. Machine code has no undefined behavior.
I do not have access to an windows 64-bit compiler right now, but my guess is the following.
Your question is not about integer promotion, but regarding how parameters are passed from the function caller to the called function. This is beyond the C specification, but it is interesting to know.
In 32-bit, all parameters are divided into 32-bit blocks as all registers can hold 32 bits. So in this case we have the following stack layout:
[ 32-bit format string pointer ][ num as 32-bit ][ num as 32-bit ] junk...
In 64-bit, all parameters are divided into 64-bit blocks as all registers can hold 64 bits. So the stack will contain the following:
[ 64-bit format string pointer ][ num as 64-bit ][ num as 64-bit ] junk...
The upper 32 bits of the 64-bit registers holding 32-bit values are conveniently set to zero.
So when printf is reading a 64-bit number, it will load the equivalent of two 32-bit registers on a 32-bit platform but only one 64-bit register, with high bits cleared, on a 64-bit platform.
(1 and 2) As already stated, the behaviour in this situation is undefined, so the compiler is allowed to behave differently for any reason or indeed no reason at all.
(3) The compiler is allowed to define int as 64-bit, in which case no promotion would be necessary because all the variables in question would be the same size. But it almost certainly doesn't.
(4) On most or all 64-bit compilers, int is 32-bits. This is because int has been 32 bits for so long that programmers have come to expect it and changing it would break existing code. As far as I know this isn't officially part of the standard, but it's one of those de-facto standards that are even harder to change. :-)
Everything you are describing is specific to whatever spec your compiler is using and the platform you are on (with the exception that long is guaranteed to be at least the same size as int):
Wikipedia entries:
long long
int
The c99 standard seeks to end this ambiguity by adding specific types; int32_t, uint64_t, etc. There's also a POSIX spec that defines u_int32_t, etc.
Edit: I missed the question about printf(), sorry. As #nos points out in the comments on your question, passing something other than a long long to %lld results in undefined behavior. This means there is no rhyme or reason as to what it will do; unicorns spontaneously appearing would not be out of the question.
Oh - and on every compiler and OS I know, int is 32 bit. Changing that has the potential to break things that depend on it being 32 bit.

Resources