Why is this Monte Carlo Haskell program so slow?

Why is this Monte Carlo Haskell program so slow? - performance

update:
My compilation command is ghc -O2 Montecarlo.hs. My random version is random-1.1, ghc version is 8.6.4, and my system is macOS Big Sur 11.1 (Intel chip). The command I used to test the speed is time ./Montecarlo 10000000, and the result it returns is real 0m17.463s, user 0m17.176s, sys 0m0.162s.
The following is a Haskell program that uses Monte Carlo to calculate pi. However, when the input is 10 million, the program ran for 20 seconds. The C program written in the same logic took only 0.206 seconds. Why is this, and how can I speed it up? Thank you all.
This is the Haskell version:
import System.Random
import Data.List
import System.Environment
montecarloCircle :: Int -> Double
montecarloCircle x
= 4*fromIntegral
(foldl' (\x y -> if y <= 1 then x+1 else x) 0
$ zipWith (\x y -> (x**2 + y**2))
(take x $ randomRs (-1.0,1) (mkStdGen 1) :: [Double])
(take x $ randomRs (-1.0,1) (mkStdGen 2) :: [Double]) )
/ fromIntegral x
main = do
num <- getArgs
let n = read (num !! 0)
print $ montecarloCircle n
This is the C version:
#include <stdio.h>
#include <math.h>
#include <time.h>
#include <stdlib.h>
#define N 10000000
#define int_t long // the type of N and M
// Rand from 0.0 to 1.0
double rand01()
{
return rand()*1.0/RAND_MAX;
}
int main()
{
srand((unsigned)time(NULL));
double x,y;
int_t M = 0;
for (int_t i = 0;i < N;i++)
{
x = rand01();
y = rand01();
if (x*x+y*y<1) M++;
}
double pi = (double)4*M/N;
printf("%lf\n", pi);
}

It is sort of expected that on numerical applications, Haskell code tends to be say 2X to 5X slower than equivalent code written in classic imperative languages such as C/C++/Fortran. However, Haskell code being 100 times slower is very unexpected !
Sanity check: reproducing the result
Using a somewhat oldish notebook with an Intel Celeron N2840 64 bits cpu, running Linux kernel 5.3, GLIBC 2.28, GCC 8.3, GHC 8.2.2, GHC random-1.1, we get the timings:
C code, gcc -O3: 950 msec
Haskell code, ghc -O3: 104 sec
So indeed, with these configurations, the Haskell code runs about 100 times slower than the C code.
Why is that ?
A first remark is that the π computing arithmetics atop random number generation looks quite simple, hence the runtimes are probably dominated by the generation of 2*10 millions pseudo-random numbers. Furthermore, we have every reason to expect that Haskell and C are using completely unrelated random number generation algorithms.
So, rather than comparing Haskell with C, we could be comparing the random number generation algorithms these 2 languages happen to be using, as well as their respective implementations.
In C, the language standard does not specify which algorithm function srand() is supposed to use, but that tends to be old, simple and fast algorithms.
On the other hand, in Haskell, we have traditionally seen a lot of complaints about the poor efficiency of StdGen, like here back in 2006:
https://gitlab.haskell.org/ghc/ghc/-/issues/427
where one of the leading Haskell luminaries mentions that StdGen could possibly be made 30 times faster.
Fortunately, help is already on the way. This recent blog post explains how the new Haskell random-1.2 package is solving the problem of StdGen lack of speed, using a completely different algorithm known as splitmix.
The field of pseudo-random number generation (PRNG) is a rather active one. Algorithms routinely get obsoleted by newer and better ones. For the sake of perspective, here is a relatively recent (2018) review paper on the subject.
Moving to more recent, better Haskell components:
Using another, slightly more powerful machine with an Intel Core i5-4440 3GHz 64 bits cpu, running Linux kernel 5.10, GLIBC 2.32, GCC 10.2, and critically Haskell package random-1.2:
C code, gcc -O3: 164 msec
Haskell code, ghc -O3: 985 msec
So Haskell is now “just” 6 times slower instead of 100 times.
And we still have to address the injustice of Haskell having to use x**2+y**2 versus C getting x*x+y*y, a detail which did not really matter before with random-1.1. This gives us 379 msec ! So we are back into the usual 2X-5X ballpark for Haskell to C speed comparisons.
Note that if we ask the Haskell executable to run with statistics on, we get the following output:
$ time q66441802.x +RTS -s -RTS 10000000
3.1415616
925,771,512 bytes allocated in the heap
...
Alloc rate 2,488,684,937 bytes per MUT second
Productivity 99.3% of total user, 99.3% of total elapsed
real 0m0,379s
user 0m0,373s
sys 0m0,004s
so Haskell is found to allocate close to one gigabyte of memory along the way, which helps to understand the speed difference.
A code fix:
We note that the C code uses a single random serie while the Haskell code is using two, with two calls to mkStdGen with numbers 1 and 2 as seeds. This is not only unfair but also incorrect. Applied mathematicians who design PRNG algorithms take great care to ensure any single serie has the right statistical properties, but give essentially no guarantees about possible correlations between different series. It is not unheard of to even use the seed as an offset into a single global sequence.
This is fixed in the following code, which does not alter performance significantly:
computeNorms :: [Double] -> [Double]
computeNorms [] = []
computeNorms [x] = []
computeNorms (x:y:xs2) = (x*x + y*y) : (computeNorms xs2)
monteCarloCircle :: Int -> Double
monteCarloCircle nn =
let
randomSeed = 1
gen0 = mkStdGen randomSeed
-- use a single random serie:
uxys = (randomRs (-1.0, 1.0) gen0) :: [Double]
norms = take nn (computeNorms uxys)
insiderCount = length $ filter (<= 1.0) norms
in
(4.0::Double) * ((fromIntegral insiderCount) / (fromIntegral nn))
Side note:
The new Haskell random-1.2 package has been mentioned in this recent SO question, though in the context of the new monadic interface.
A word of conclusion:
Assuming the goal of the exercise is to gauge the relative runtime speeds of the C and Haskell languages, there are essentially two possibilities:
One is to avoid using PRNG altogether, because of the different algorithms in use.
The other one is to control random number generation by manually providing one and the same algorithm into both languages. For example, one could use the publicly available MRG32k3a algorithm designed by Pierre L'Ecuyer. Candidate Haskell MRG32k3a implementation here (as of March 2021). Left as an exercise for the reader.

Related

Why does refactoring data to newtype speed up my haskell program?

I have a program which traverses an expression tree that does algebra on probability distributions, either sampling or computing the resulting distribution.
I have two implementations computing the distribution: one (computeDistribution) nicely reusable with monad transformers and one (simpleDistribution) where I concretize everything by hand. I would like to not concretize everything by hand, since that would be code duplication between the sampling and computing code.
I also have two data representations:
type Measure a = [(a, Rational)]
-- data Distribution a = Distribution (Measure a) deriving Show
newtype Distribution a = Distribution (Measure a) deriving Show
When I use the data version with the reusable code, computing the distribution of 20d2 (ghc -O3 program.hs; time ./program 20 > /dev/null) takes about one second, which seems way too long. Pick higher values of n at your own peril.
When I use the hand-concretized code, or I use the newtype representation with either implementation, computing 20d2 (time ./program 20 s > /dev/null) takes the blink of an eye.
Why?
How can I find out why?
My knowledge of how Haskell is executed is almost nil. I gather there's a graph of thunks in basically the same shape as the program, but that's about all I know.
I figure with newtype the representation of Distribution is the same as that of Measure, i.e. it's just a list, whereas with the data version each Distribution is kinda' like a single-field record, except with a pointer to the contained list, and so the data version has to perform more allocations. Is this true? If true, is this enough to explain the performance difference?
I'm new to working with monad transformer stacks. Consider the Let and Uniform cases in simpleDistribution — do they do the same as the walkTree-based implementation? How do I tell?
Here's my program. Note that Uniform n corresponds to rolling an n-sided die (in case the unary-ness was surprising).
Update: based on comments I simplified my program by removing everything not contributing to the performance gap. I made two semantic changes: probabilities are now denormalized and all wonky and wrong, and the simplification step is gone. But the essential shape of my program is still there. (See question edit history for the non-simplified program.)
Update 2: I made further simplifications, reducing Distribution down to the list monad with a small twist, removing everything to do with probabilities, and shortening the names. I still observe large performance differences when using data but not newtype.
import Control.Monad (liftM2)
import Control.Monad.Trans (lift)
import Control.Monad.Reader (ReaderT, runReaderT)
import System.Environment (getArgs)
import Text.Read (readMaybe)
main = do
args <- getArgs
let dieCount = case map readMaybe args of Just n : _ -> n; _ -> 10
let f = if ["s"] == (take 1 $ drop 1 $ args) then fast else slow
print $ f dieCount
fast, slow :: Int -> P Integer
fast n = walkTree n
slow n = walkTree n `runReaderT` ()
walkTree 0 = uniform
walkTree n = liftM2 (+) (walkTree 0) (walkTree $ n - 1)
data P a = P [a] deriving Show
-- newtype P a = P [a] deriving Show
class Monad m => MonadP m where uniform :: m Integer
instance MonadP P where uniform = P [1, 1]
instance MonadP p => MonadP (ReaderT env p) where uniform = lift uniform
instance Functor P where fmap f (P pxs) = P $ fmap f pxs
instance Applicative P where
pure x = P [x]
(P pfs) <*> (P pxs) = P $ pfs <*> pxs
instance Monad P where
(P pxs) >>= f = P $ do
x <- pxs
case f x of P fxs -> fxs

How can I find out why?
This is, in general, hard.
The extreme way to do it is to look at the core code (which you can produce by running GHC with -ddump-simpl). This can get complicated really quickly, and it's basically a whole new language to learn. Your program is already big enough that I had trouble learning much from the core dump.
The other way to find out why is to just keep using GHC and asking questions and learning about GHC optimizations until you recognize certain patterns.
Why?
In short, I believe it's due to list fusion.
NOTE: I don't know for sure that this answer is correct, and it would take more time/work to verify than I'm willing to put in right now. That said, it fits the evidence.
First off, we can check whether this slowdown you're seeing is a result of something truly fundamental vs a GHC optimization triggering or not by running in O0, that is, without optimizations. In this mode, both Distribution representations result in about the same (excruciatingly long) runtime. This leads me to believe that it's not the data representation that is inherently the problem but rather there's an optimization that's triggered with the newtype version that isn't with the data version.
When GHC is run in -O1 or higher, it engages certain rewrite rules to fuse different folds and maps of lists together so that it doesn't need to allocate intermediate values. (See https://markkarpov.com/tutorial/ghc-optimization-and-fusion.html#fusion for a decent tutorial on this concept as well as https://stackoverflow.com/a/38910170/14802384 which additionally has a link to a gist with all of the rewrite rules in base.) Since computeDistribution is basically just a bunch of list manipulations (which are all essentially folds), there is the potential for these to fire.
The key is that with the newtype representation of Distribution, the newtype wrapper is erased during compilation, and the list operations are allowed to fuse. However, with the data representation, the wrappers are not erased, and the rewrite rules do not fire.
Therefore, I will make an unsubstantiated claim: If you want your data representation to be as fast as the newtype one, you will need to set up rewrite rules similar to the ones for list folding but that work over the Distribution type. This may involve writing your own special fold functions and then rewriting your Functor/Applicative/Monad instances to use them.

Do Haskell’s strict folds really use linear space?

I thought I understood the basics of fold performance in Haskell, as described in foldr, foldl, foldl' on the Haskell Wiki and many other places. In particular, I learned that for accumulating functions, one should use foldl', to avoid space leaks, and that the standard library functions are written to respect this. So I presumed that simple accumulators like length, applied to simple lists like replicate n 1, should require constant space (or at least sub-linear) in the length of the list. My intuition was that on sufficiently simple lists, they would behave roughly like a for loop in an imperative language.
But today I found that this seems not to hold in practice. For instance, length $ replicate n 1 seems to use space linear in n. In ghci:
ghci> :set +s
ghci> length $ replicate (10^6) 1
1000000
(0.02 secs, 56,077,464 bytes)
ghci> length $ replicate (10^7) 1
10000000
(0.08 secs, 560,078,360 bytes)
ghci> length $ replicate (10^8) 1
100000000
(0.61 secs, 5,600,079,312 bytes)
ghci> length $ replicate (10^9) 1
1000000000
(5.88 secs, 56,000,080,192 bytes)
Briefly, my question is: Do length and other strict folds really use linear space? If so, why? And is it inevitable? Below are more details of how I’ve played around trying to understand this, but they’re probably not worth reading — the tl;dr is that the linear-space usage seems to persist whatever variations I try.
(I originally used sum as the example function. As Willem Van Onsem points out, that was a badly-chosen example as default instances aren’t actually strict. However, the main question remains, since as noted below, this occurs with plenty of other functions that really are based on strict folds.)
Replacing length with foldl' (\n _ -> n+1) 0 appears to make performance worse by a constant factor; space usage still seems to be linear.
Versions defined with foldl and foldr had worse memory usage (as expected), but only by a small constant factor, not asymptotically worse (as most discussions seem to suggest).
Replacing length with sum, last, or other simple accumulators, or with the obvious definitions of these using foldl', also doesn’t seem to change the linear space usage.
Using [1..n] as the test list, and other similar variations, also seems to make no significant difference.
Switching between the general versions of sum, foldl', etc from Data.Foldable, the specialised ones in Data.List, and local versions defined directly by pattern-matching, also seems to make no difference.
Compiling instead of working in ghci also only seemed to improve space usage by a constant factor.
Switching between several recent versions of GHC — 8.8.4, 8.10.5, and 9.0.1 — also seemed to make no significant difference.

"Do they use linear space" is a slightly unclear question. Usually when we talk about the space an algorithm uses, we're talking about its working set: the maximum amount of memory it needs all at once. "If my computer only had X bytes of memory, could I run this program?" But that's not what GHCI's :set +s measures. It measures the sum of all memory allocations made, including those that were cleaned up partway through. And what is the biggest use of memory in your experiment? The list itself, of course.
So you've really just measured the number of bytes that a list of size N takes up. You can confirm this by using last instead of length, which I hope you'll agree allocates no intermediate results, and is strict. It takes the same amount of memory using your metric as length does - length does no extra allocation for the sums.
But a bigger problem is that GHCI is not an optimizing compiler. If you care about performance characteristics at all, GHCI is the wrong tool. Instead, use GHC with -O2, and turn on GHC's profiler.
import System.Environment (getArgs)
main = do
n <- read . head <$> getArgs
print $ length (replicate (10^n) 1)
And running it:
$ ghc -O2 -prof -fprof-auto stackoverflow.hs
$ ./stackoverflow 6 +RTS -p
1000000
$ grep "total alloc" stackoverflow.prof
total alloc = 54,856 bytes (excludes profiling overheads)
$ ./stackoverflow 9 +RTS -p
1000000000
$ grep "total alloc" stackoverflow.prof
total alloc = 55,008 bytes (excludes profiling overheads)
we can see that space usage is roughly constant despite a thousand-fold increase in input size.
Will Ness correctly points out in a comment that -s would be a better measuring tool than -p.

Replacing sum with foldl' (+) 0 here, then performance improves noticeably in both time and space (which is itself a surprise; shouldn’t the standard sum be at least as efficient?) — but only by a constant factor; space usage still seems to be linear.
The sum is implemented as [src]:
sum :: Num a => t a -> a
sum = getSum #. foldMap Sum
It thus makes use of the Sum data type and its Monoid instance such that mappend = (+) and mempty = 0. foldMap works right associative, indeed:
Map each element of the structure into a monoid, and combine the results with (<>). This fold is right-associative and lazy in the accumulator. For strict left-associative folds consider foldMap' instead.
foldMap is thus implemented with foldr [src]:
foldMap :: Monoid m => (a -> m) -> t a -> m
{-# INLINE foldMap #-}
-- This INLINE allows more list functions to fuse. See #9848.
foldMap f = foldr (mappend . f) mempty
While foldl' will indeed have a (much) smaller memory footprint, and likely be more efficient, a reason to work with foldr is that for Peano numbers for example, one can make use of lazyness, and thus the head normal form will look like S(…) where … might not be evaluated (yet).
foldr can also terminate earlier. If for example you make a sum for a certain algebraic structure, it is possible that we can terminate the looping earlier.

Poor Haskell performance with lazy lists

I tried to test Haskell performance, but got some unxepectedly poor results:
-- main = do
-- putStrLn $ show $ sum' [1..1000000]
sum' :: [Int] -> Int
sum' [] = 0
sum' (x:xs) = x + sum' xs
I first ran it from ghci -O2:
> :set +s
> :sum' [1..1000000]
1784293664
(4.81 secs, 163156700 bytes)
Then I complied the code with ghc -O3, ran it using time and got this:
1784293664
real 0m0.728s
user 0m0.700s
sys 0m0.016s
Needless to say, these results are abysmal compared to the C code:
#include <stdio.h>
int main(void)
{
int i, n;
n = 0;
for (i = 1; i <= 1000000; ++i)
n += i;
printf("%d\n", n);
}
After compiling it with gcc -O3 and running it with time I got:
1784293664
real 0m0.022s
user 0m0.000s
sys 0m0.000s
What is the reason for such poor performance? I assumed that Haskell would never actually construct the list, am I wrong in that assumption? Is this something else?
UPD: Is the problem that Haskell doesn't know that addition is associative? Is there a way to make it see and use that?

First, don't bother to discuss GHCi when you're talking about performance. It's nonsense to use -Ox flags with GHCi.
You're Building Up A Huge Computation
Using GHC 7.2.2 x86-64 with -O2 I get:
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.
The reason this uses so much stack space is upon every loop you build an expression of i+..., so your computation is transformed into a huge thunk:
n = 1 + (2 + (3 + (4 + ...
That's going to take a lot of memory. There is a reason the standard sum isn't defined like your sum'.
With A Reasonable Definition for sum
If I change your sum' to sum or an equivalent such as foldl' (+) 0 then I get:
$ ghc -O2 -fllvm so.hs
$ time ./so
500000500000
real 0m0.049s
Which seems entirely reasonable to me. Keep in mind that, with such a short-running piece of code much of your measured time is noise (loading the binary, starting up the RTS and GC nursery, misc initializations, etc). Use Criterion (a benchmarking tool) if you want accurate measurements of small-ish Haskell computations.
Comparing to C
My gcc -O3 time is immeasurably low (reported as 0.002 seconds) because the main routine consists of 4 instructions - the entire computation is evaluated at compile time and the constant of 0x746a5a2920 is stored in the binary.
There is a rather long Haskell thread (here, but be ware it's something of an epic flame war that still burns in peoples minds almost 3 years later) where people discuss the realities of doing this in GHC starting from your exact benchmark - it isn't there yet but they did come up with some Template Haskell work that would do this if you wish to achieve the same results selectively.

The GHC optimizer seems to not be doing as well as it should. Still, you can probably build a much better implementation of sum' using tail recursion and strict values.
Something like (using Bang Patterns):
sum' :: [Int] -> Int
sum' = sumt 0
sumt :: Int -> [Int] -> Int
sumt !n [] = n
sumt !n (x:xs) = sumt (n + x) xs
I havent tested that, but I would bet it gets closer to the c version.
Of course, you are still holding out on the optimizer to get rid of the list. You could just use the same algorithm as you do in c (using int i and a goto):
sumToX x = sumToX' 0 1 x
sumToX' :: Int -> Int -> Int -> Int
sumToX' !n !i x = if (i <= x) then sumToX' (n+i) (i+1) x else n
You still hope that GHC does loop unwinding at the imperative level.
I havent tested any of this, btw.
EDIT: thought I should point out that sum [1..1000000] really should be 500000500000 and is only 1784293664 because of an integer overflow. Why you would ever need to calculate this becomes an open question. Anyways, using ghc -O2 and a naive tail recursive version with no bang patterns (which should be exactly the sum in the standard lib) got me
real 0m0.020s
user 0m0.015s
sys 0m0.003s
Which made me think that the problem was just your GHC. But, it seems my machine is just faster, because the c ran at
real 0m0.005s
user 0m0.001s
sys 0m0.002s
My sumToX (with or without bang patterns) gets half way there
real 0m0.010s
user 0m0.004s
sys 0m0.003s
Edit 2: After disassembling code I think my answer to why the c is still twice as fast (as the list free version) is this: GHC has a lot more overhead before it ever gets to calling main. GHC generates a fair bit of runtime junk. Obviously this gets amortized on real code, but compare to the beauty GCC generates:
0x0000000100000f00 <main+0>: push %rbp
0x0000000100000f01 <main+1>: mov %rsp,%rbp
0x0000000100000f04 <main+4>: mov $0x2,%eax
0x0000000100000f09 <main+9>: mov $0x1,%esi
0x0000000100000f0e <main+14>: xchg %ax,%ax
0x0000000100000f10 <main+16>: add %eax,%esi
0x0000000100000f12 <main+18>: inc %eax
0x0000000100000f14 <main+20>: cmp $0xf4241,%eax
0x0000000100000f19 <main+25>: jne 0x100000f10 <main+16>
0x0000000100000f1b <main+27>: lea 0x14(%rip),%rdi # 0x100000f36
0x0000000100000f22 <main+34>: xor %eax,%eax
0x0000000100000f24 <main+36>: leaveq
0x0000000100000f25 <main+37>: jmpq 0x100000f30 <dyld_stub_printf>
Now, I'm not much of an X86 assembly programmer, but that looks more or less perfect.
Okay, I have graduate school applications to work on. No more.

How to make my Haskell program faster? Comparison with C

I'm working on an implementation of one of the SHA3 candidates, JH. I'm at the point where the algorithm pass all KATs (Known Answer Tests) provided by NIST, and have also made it an instance of the Crypto-API. Thus I have began looking into its performance. But I'm quite new to Haskell and don't really know what to look for when profiling.
At the moment my code is consistently slower then the reference implementation written in C, by a factor of 10 for all input lengths (C code found here: http://www3.ntu.edu.sg/home/wuhj/research/jh/jh_bitslice_ref64.h).
My Haskell code is found here: https://github.com/hakoja/SHA3/blob/master/Data/Digest/JHInternal.hs.
Now I don't expect you to wade through all my code, rather I would just want some tips on a couple of functions. I have run some performance tests and this is (part of) the performance file generated by GHC:
Tue Oct 25 19:01 2011 Time and Allocation Profiling Report (Final)
main +RTS -sstderr -p -hc -RTS jh e False
total time = 6.56 secs (328 ticks # 20 ms)
total alloc = 4,086,951,472 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
roundFunction Data.Digest.JHInternal 28.4 37.4
word128Shift Data.BigWord.Word128 14.9 19.7
blockMap Data.Digest.JHInternal 11.9 12.9
getBytes Data.Serialize.Get 6.7 2.4
unGet Data.Serialize.Get 5.5 1.3
sbox Data.Digest.JHInternal 4.0 7.4
getWord64be Data.Serialize.Get 3.7 1.6
e8 Data.Digest.JHInternal 3.7 0.0
swap4 Data.Digest.JHInternal 3.0 0.7
swap16 Data.Digest.JHInternal 3.0 0.7
swap8 Data.Digest.JHInternal 1.8 0.7
swap32 Data.Digest.JHInternal 1.8 0.7
parseBlock Data.Digest.JHInternal 1.8 1.2
swap2 Data.Digest.JHInternal 1.5 0.7
swap1 Data.Digest.JHInternal 1.5 0.7
linearTransform Data.Digest.JHInternal 1.5 8.6
shiftl_w64 Data.Serialize.Get 1.2 1.1
Detailed breakdown omitted ...
Now quickly about the JH algorithm:
It's a hash algorithm which consists of a compression function F8, which is repeated as long as there exists input blocks (of length 512 bits). This is just how the SHA-functions operate. The F8 function consists of the E8 function which applies a round function 42 times. The round function itself consists of three parts:
a sbox, a linear transformation and a permutation (called swap in my code).
Thus it's reasonable that most of the time is spent in the round function. Still I would like to know how those parts could be improved. For instance: the blockMap function is just a utility function, mapping a function over the elements in a 4-tuple. So why is it performing so badly? Any suggestions would be welcome, and not just on single functions, i.e. are there structural changes you would have done in order to improve the performance?
I have tried looking at the Core output, but unfortunately that's way over my head.
I attach some of the heap profiles at the end as well in case that could be of interest.
EDIT :
I forgot to mention my setup and build. I run it on a x86_64 Arch Linux machine, GHC 7.0.3-2 (I think), with compile options:
ghc --make -O2 -funbox-strict-fields
Unfortunately there seems to be a bug on the Linux plattform when compiling via C or LLVM, giving me the error:
Error: .size expression for XXXX does not evaluate to a constant
so I have not been able to see the effect of that.

Switch to unboxed Vectors (from Array, used for constants)
Use unsafeIndex instead of incurring the bounds check and data dependency from safe indexing (i.e. !)
Unpack Block1024 as you did with Block512 (or at least use UnboxedTuples)
Use unsafeShift{R,L} so you don't incur the check on the shift value (coming in GHC 7.4)
Unfold the roundFunction so you have one rather ugly and verbose e8 function. This was significat in pureMD5 (the rolled version was prettier but massively slower than the unrolled version). You might be able to use TH to do this and keep the code smallish. If you do this then you'll have no need for constants as these values will be explicit in the code and result in a more cache friendly binary.
Unpack your Word128 values.
Define your own addition for Word128, don't lift Integer. See LargeWord for an example of how this can be done.
rem not mod
Compile with optimization (-O2) and try llvm (-fllvm)
EDIT: And cabalize your git repo along with a benchmark so we can help you easier ;-). Good work on including a crypto-api instance.

The lower graph shows that a lot of memory is occupied by lists. Unless there are more lurking in other modules, they can only come from e8. Maybe you'll have to bite the bullet and make that a loop instead of a fold, but for starters, since Block1024 is a pair, the foldl' doesn't do much evaluation on the fly (unless the strictness analyser has become significantly better). Try making that stricter, data Block1024 = B1024 !Block512 !Block512, perhaps it also needs {-# UNPACK #-} pragmas. In roundFunction, use rem instead of mod (this will only have minor impact, but it's a bit faster) and make the let bindings strict. In the swapN functions, you might get better performance giving the constants in the form W x y rather than as 128-bit hex numbers.
I can't guarantee those changes will help, but that's what looks most promising after a short glance.

Ok, so I thought I would chime in with an update of what I have done and the results obtained thus far. Changes made:
Switched from Array to UnboxedArray (made Word128 an instance type)
Used UnboxedArray + fold in e8 instead of lists and (prelude) fold
Used unsafeIndex instead of !
Changed type of Block1024 to a real datatype (similiar to Block512), and unpacked its arguments
Updated GHC to version 7.2.1 on Arch Linux, thus fixing the problem with compiling via C or LLVM
Switched mod to rem in some places, but NOT in roundFunction. When I do it there, the compile time suddenly takes an awful lot of time, and the run time becomes 10 times slower! Does anyone know why that may be? It is only happening with GHC-7.2.1, not GHC-7.0.3
I compile with the following options:
ghc-7.2.1 --make -O2 -funbox-strict-fields main.hs ./Tests/testframe.hs -fvia-C -optc-O2
And the results? Roughly 50 % reduction in time. On an input of ~107 MB, the code now use 3 minutes as compared to the previous 6-7 minutes. The C version uses 42 seconds.
Things I tried, but which didn't result in better performance:
Unrolled the e8 function like this:
e8 !h = go h 0
where go !x !n
| n == 42 = x
| otherwise = go h' (n + 1)
where !h' = roundFunction x n
Tried breaking up the swapN functions to use the underlying Word64' directly:
swap1 (W xh hl) =
shiftL (W (xh .&. 0x5555555555555555) (xl .&. 0x5555555555555555)) 1
.|.
shiftR (W (xh .&. 0xaaaaaaaaaaaaaaaa) (xl .&. 0xaaaaaaaaaaaaaaaa)) 1
Tried using the LLVM backend
All of these attempts gave worse performance than what I have currently. I don't know if thats because I'm doing it wrong (especially the unrolling of e8), or because they just are worse options.
Still I have some new questions with these new tweaks.
Suddenly I have gotten this peculiar bump in memory usage. Take a look at following heap profiles:
Why has this happened? Is it because of the UnboxedArray? And what does SYSTEM mean?
When I compile via C I get the following warning:
Warning: The -fvia-C flag does nothing; it will be removed in a future GHC release
Is this true? Why then, do I see better performance using it, rather than not?

It looks like you did a fair amount of tweaking already; I'm curious what the performance is like without explicit strictness annotations (BangPatterns) and the various compiler pragmas (UNPACK, INLINE)... Also, a dumb question: what optimization flags are you using?
Anyway, two suggestions which may be completely awful:
Use unboxed primitive types where you can (e.g. replace Data.Word.Word64 with GHC.Word.Word64#, make sure word128Shift is using Int#, etc.) to avoid heap allocation. This is, of course, non-portable.
Try Data.Sequence instead of []
At any rate, rather than looking at the Core output, try looking at the intermediate C files (*.hc) instead. It can be hard to wade through, but sometimes makes it obvious where the compiler wasn't quite as sharp as you'd hoped.

Analyzing slow performance of a Haskell program

I was trying to solve ITA Software's "Word Nubmers" puzzle using a brute force approach. It looks like my Haskell version is more than 10 times slower than a C#/C++ version.
The answer
Thanks to Bryan O'Sullivan's answer, I was able to "correct" my program to acceptable performance. You can read his code which is much cleaner than mine. I am going to outline the key points here.
Int is Int64 on Linux GHC x64. Unless you unsafeCoerce, you should just use Int. This saves you from having to fromIntegral. Doing Int64 on Windows 32-bit GHC is just darn slow, avoid it. (This is in fact not GHC's fault. As mentioned in my blog post below, 64 bit integers in 32-bit programs is slow in general (at least in Windows))
-fllvm or -fvia-C for performance.
Prefer quotRem to divMod, quotRem already suffices. That gave me 20% speed up.
In general, prefer Data.Vector to Data.Array as an "array"
Use the wrapper-worker pattern liberally.
The above points were enough to give me about 100% boost over my original version.
In my blog post, I have detailed a step-by-step illustrated example of how I turned the original program to match Bryan's program. There are other points mentioned there as well.
The original question
(This may sound like a "could you do the work for me" post, but I argue that such a concrete example would be very instructive since profiling Haskell performance is often seen as a myth)
(As noted in the comments, I think I have misinterpreted the problem. But who cares, we can focus on performance in a different problem)
Here's a my version of a quick recap of the problem:
A wordNumber is defined as
wordNumber 1 = "one"
wordNumber 2 = "onetwo"
wordNumber 3 = "onethree"
wordNumber 15 = "onetwothreefourfivesixseveneightnineteneleventwelvethirteenfourteenfifteen"
...
Problem: Find the 51-billion-th letter of (wordNumber Infinity); assume that letter is found at 'wordNumber x', also find 'sum [1..x]'
From an imperative perspective, a naive algorithm would be to have 2 counters, one for sum of numbers and one for sum of lengths. Keep counting the length of each wordNumber and "break" to return the result.
The imperative brute-force approach is implemented in C# here: http://ideone.com/JjCb3. It takes about 1.5 minutes to find the answer on my computer. There is also an C++ implementation that runs in 45 seconds on my computer.
Then I implemented a brute-force Haskell version: http://ideone.com/ngfFq. It cannot finish the calculation in 5 minutes on my machine. (Irony: it's has more lines than the C# version)
Here is the -p profile of the Haskell program: http://hpaste.org/49934
Question: How to make it perform comparatively to the C# version? Are there obvious mistakes I am making?
(Note: I am fully aware that brute-forcing it is not the correct solution to this problem. I am mainly interested in making the Haskell version perform comparatively to the C# version. Right now it is at least 5x slower so obviously I am missing something obvious)
(Note 2: It does not seem to be space leaking. The program runs with constant memory (about 2MB) on my computer)
(Note 3: I am compiling with `ghc -O2 WordNumber.hs)
To make the question more reader friendly, I include the "gist" of the two versions.
// C#
long sumNum = 0;
long sumLen = 0;
long target = 51000000000;
long i = 1;
for (; i < 999999999; i++)
{
// WordiLength(1) = 3 "one"
// WordiLength(101) = 13 "onehundredone"
long newLength = sumLen + WordiLength(i);
if (newLength >= target)
break;
sumNum += i;
sumLen = newLength;
}
Console.WriteLine(Wordify(i)[Convert.ToInt32(target - sumLen - 1)]);
-
-- Haskell
-- This has become totally ugly during my squeeze for
-- performance
-- Tail recursive
-- n-th number (51000000000 in our problem) -> accumulated result -> list of 'zipped' left to try
-- accumulated has the format (sum of numbers, current lengths of the whole chain, the current number)
solve :: Int64 -> (Int64, Int64, Int64) -> [(Int64, Int64)] -> (Int64, Int64, Int64)
solve !n !acc#(!sumNum, !sumLen, !curr) ((!num, !len):xs)
| sumLen' >= n = (sumNum', sumLen, num)
| otherwise = solve n (sumNum', sumLen', num) xs
where
sumNum' = sumNum + num
sumLen' = sumLen + len
-- wordLength 1 = 3 "one"
-- wordLength 101 = 13 "onehundredone"
wordLength :: Int64 -> Int64
-- wordLength = ...
solution :: Int64 -> (Int64, Char)
solution !x =
let (sumNum, sumLen, n) = solve x (0,0,1) (map (\n -> (n, wordLength n)) [1..])
in (sumNum, (wordify n) !! (fromIntegral $ x - sumLen - 1))

I've written a gist that contains both a C++ version (a copy of yours from a Haskell-cafe message, with a bug fixed) and a Haskell translation.
Notice that the two are structurally almost identical. When compiled with -fllvm, the Haskell code runs at about half the speed of the C++ code, which is pretty good.
Now let's compare my Haskell wordLength code to yours. You're passing around an extra unnecessary parameter, which is unnecessary (you apparently figured that out when writing the C++ code that I translated). Also, the large number of bang patterns suggests panic; they're almost all useless.
Your solve function is also very confused.
You're passing parameters in three different ways: a regular Int, a 3-tuple, and a list! Whoa.
This function is necessarily not very regular in its behaviour, so while you gain nothing stylistically by using a list to supply your counter, you probably force GHC to allocate memory. In other words, this both obfuscates the code and makes it slower.
By using a tuple for three parameters (for no obvious reason), you're again working hard to force GHC to allocate memory for every step through the loop, when it could avoid doing so if you passed the parameters directly.
Only your n parameter is dealt with in a sensible way, but you don't need a bang pattern on it.
The only parameter that needs a bang pattern is sumNum, because you never inspect its value until after the loop has finished. GHC's strictness analyser will deal with the others. All of your other bang patterns are unnecessary at best, misdirections at worst.

Here are two pointers I could come up with in a quick investigation:
Note that using Int64 is really slow when you are using a 32 bit build of GHC, as is the default for Haskell Platform, currently. This also turned out to be the main villain in a previous performance problem (there I give a few more details).
For reasons I don't quite understand the divMod function does not seem to get inlined. As a result, the numbers are returned on the heap. When using div and mod separately, wordLength' executes purely on the stack as it should be.
Sadly I currently have no 64-bit GHC around to test whether this is enough to solve the problem.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio