Haskell - simple way to cache a function call - performance

I have functions like:
millionsOfCombinations = [[a, b, c, d] |
a <- filter (...some filter...) someListOfAs,
b <- (...some other filter...) someListOfBs,
c <- someListOfCs, d <- someListOfDs]
aLotOfCombinationsOfCombinations = [[comb1, comb2, comb3] |
comb1 <- millionsOfCombinations,
comb2 <- millionsOfCombinations,
comb3 <- someList,
...around 10 function calls to find if
[comb1, comb2, comb3] is actually useful]
Evaluating millionsOfCombinations takes 40s. on a very fast workstation. Evaluating aLotOfCombinationsOfCombinations!!0 took 2 days :-(
How can I speed up this code? So far I've had 2 ideas - use a profiler. Tried running myapp +RTS -sstderr after compiling with GHC, but get a blank screen and don't want to wait days for it to finish.
2nd thought was to somehow cache millionsOfCombinations. Do I understand correctly that for each value in aLotOfCombinationsOfCombinations, millionsOfCombinations gets evaluated multiple times? If that is so, how can I cache the result? Obviously I've just started learning Haskell. I know there is a way to do call caching with a monad, but I still don't understand those things.

Use the -fforce-recomp, -O2 and -fllvm flags
If you aren't already, be sure to use the above flags. I wouldn't normally mention it, but I've seen some questions recently that didn't know powerful optimization isn't a default.
Profile Your Code
The -sstderr flag isn't exactly profiling. When people say profiling they're usually talking about either heap profiling or time profiling via -prof and -auto-all flags.
Avoid Costly Primitives
If you need the entire list in memory (i.e. it isn't going to be optimized away) then consider unboxed vectors. If Int will do instead of Integer, consider that (but Integer is a reasonable default when you don't know!). Use worker/wrapping transforms at the right times. If you're leaning heavily on Data.Map, try using Data.HashMap from the unordered-containers library. This list can go on and on, but since you don't already have an intuition on where your computation time is going the profiling should come first!

I think, that there is no way. Please notice, that the time to generate the list is growing with each list involved. So you get around 10000003 combinations to check, which indeed takes a lot of time. Caching the list ist possible but is unlikely to change anything, since new elements can be generated almost instantly. The only way is probably to change the algorithm.
If millionsOfCombinations is a constant (and not a function with arguments), it is cached automatically. Else, make it a constant by using a where clause:
aLotOfCombinationsOfCombinations = [[comb1, comb2, comb3] |
comb1 <- millionsOfCombinations,
comb2 <- millionsOfCombinations,
comb3 <- someList,
...around 10 function calls to find if
[comb1, comb2, comb3] is actually useful] where
millionsOfCombinations = makeCombination xyz

Related

How can this function in Haskell be optimised

As part of an advent of code challenge, I've written the following functions in Haskell:
simulateUntilRepeat_int a b i = if (a /= b) then (simulateUntilRepeat_int a (updateCycle b) (i+1)) else i
simulateUntilRepeat a = simulateUntilRepeat_int a (updateCycle a) 1
The purpose of this is to take a list of moons and simulate their movement until they resume their original position, returning the number of cycles it took for them to get there. (the function updateCycle does one iteration of the simulation). However, when I attempt to run this it uses all available memory and then gets killed by the operating system. The question does admit that this may take a very large number of cycles.
Googling around about this problem I find the usual fix is to make some of the parameters strict, but I think I've experimented with all possible permutations of strictness on the parameters to no avail. By the looks of this function, I'd have anticipated the compiler would be able to use the tail recursion optimisation and turn it into a loop, but this seems to not be happening somehow.
A friend of mine, who is knowledgeable in haskell suggested changing the form of the function to the following:
f a b0 = length (takeWhile (/= a) (iterate updateCycle b0))
But doing this didn't fix it either, leaving me out of ideas.
The comments are undoubtedly correct that your approach is not the intended solution method.
However, the functions you've posted would not, in and of themselves, cause a memory leak, fail to tail recurse, or lead to poor performance. Given your code above plus the definitions:
updateCycle 4686774942 = 0
updateCycle n = n+1
main = do
print $ simulateUntilRepeat (0 :: Int)
and compiling with -O2, the program runs in constant memory on my laptop in about 30 seconds. Adding explicit type signatures to use Int in place of Integer for the iteration count:
simulateUntilRepeat_int :: Int -> Int -> Int -> Int
simulateUntilRepeat :: Int -> Int
it runs in about 2.4 seconds.
So, to understand why your program is gobbling all available memory or why your strictness annotations failed to make a difference, it would probably be necessary to see the whole working program (or preferably a minimal example that illustrates the performance problem). If the program is short, and the question is "why is the performance of this program totally unreasonable?" instead of "how can I optimize my program to run as fast as possible?", it might still be a good SO question. Otherwise, the Code Review site might be better -- you can post a larger program there and ask for general performance advice, and that's considered on-topic for that site.

Replicate() versus a for loop?

Does anyone know how the replicate() function works in R and how efficient it is relative to using a for loop?
For example, is there any efficiency difference between...
means <- replicate(100000, mean(rnorm(50)))
And...
means <- c()
for(i in 1:100000) {
means <- c(means, mean(rnorm(50)))
}
(I may have typed something slightly off above, but you get the idea.)
You can just benchmark the code and get your answer empirically. Note that I also added a second for loop flavor which circumvents the growing vector problem by preallocating the vector.
repl_function = function(no_rep) means <- replicate(no_rep, mean(rnorm(50)))
for_loop = function(no_rep) {
means <- c()
for(i in 1:no_rep) {
means <- c(means, mean(rnorm(50)))
}
means
}
for_loop_prealloc = function(no_rep) {
means <- vector(mode = "numeric", length = no_rep)
for(i in 1:no_rep) {
means[i] <- mean(rnorm(50))
}
means
}
no_loops = 50e3
benchmark(repl_function(no_loops),
for_loop(no_loops),
for_loop_prealloc(no_loops),
replications = 3)
test replications elapsed relative user.self sys.self
2 for_loop(no_loops) 3 18.886 6.274 17.803 0.894
3 for_loop_prealloc(no_loops) 3 3.209 1.066 3.189 0.000
1 repl_function(no_loops) 3 3.010 1.000 2.997 0.000
user.child sys.child
2 0 0
3 0 0
1 0 0
Looking at the relative column, the un-preallocated for loop is 6.2 times slower. However, the preallocated for loop is just as fast as replicate.
replicate is a wrapper for sapply, which itself is a wrapper for lapply. lapply is ultimately an .Internal function that is written in C and performs the looping in an optimised way, rather than through the interpreter. It's main advantages are efficient memory management, especially compared to the highly inefficient vector growing method you present above.
I have a very different experience with replicate which also confuses me. It often happens that my R crashes and my laptop hangs when I use replicate compared to for and this surprises me, as for the reasons mentioned above, I also expected a C-written function to outperform the for loop. For example, if you execute the functions below, you'll see that for loop is faster than replicate
system.time(for (i in 1:10) runif(1e7))
# user system elapsed
# 3.340 0.218 3.558
system.time(replicate(10, runif(1e7)))
# user system elapsed
# 4.622 0.484 5.109
so with 10 replicates, the for loop is clearly faster. If you repeat it for 100 replicates you get similar results. So I wonder if anyone can come with an example that shows its practical privileges compared to for.
PS I also created a function for the runif(1e7) and that made no difference in the comparison. Basically I failed to come with any example that shows the advantage of replicate.
Vectorization is the key difference between them. I will tray to explain this point. R is an high-level-interpreted computer language. It takes care of many basic computer tasks for you. When you write
x <- 2.0
you don’t have to tell your computer that
“2.0” is a floating-point number;
“x” should store numeric-type data;
it has to find a place in memory to put “5”;
it has to register “x” as a pointer to a certain place in memory.
R figures these things by itself.
But, for such comfortable issue, there is a price: it is slower than low level languages.
In C or FORTRAN, much of this "test if" would be accomplished during the compilation step, not during the program execution. They are translated into binary computer language (0/1) after they are written, BUT before they are run. This allows the compiler to organize the binary machine code in an optimal way for the computer to interpret.
What does this have to do with vectorization in R? Well, many R functions are actually written in a a compiled language, such as C, C++, and FORTRAN, and have a small R “wrapper”. This is the difference between yours approach. for loops add further test if operations that the machine has to do on data, making it slower

How to make my Haskell program faster? Comparison with C

I'm working on an implementation of one of the SHA3 candidates, JH. I'm at the point where the algorithm pass all KATs (Known Answer Tests) provided by NIST, and have also made it an instance of the Crypto-API. Thus I have began looking into its performance. But I'm quite new to Haskell and don't really know what to look for when profiling.
At the moment my code is consistently slower then the reference implementation written in C, by a factor of 10 for all input lengths (C code found here: http://www3.ntu.edu.sg/home/wuhj/research/jh/jh_bitslice_ref64.h).
My Haskell code is found here: https://github.com/hakoja/SHA3/blob/master/Data/Digest/JHInternal.hs.
Now I don't expect you to wade through all my code, rather I would just want some tips on a couple of functions. I have run some performance tests and this is (part of) the performance file generated by GHC:
Tue Oct 25 19:01 2011 Time and Allocation Profiling Report (Final)
main +RTS -sstderr -p -hc -RTS jh e False
total time = 6.56 secs (328 ticks # 20 ms)
total alloc = 4,086,951,472 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
roundFunction Data.Digest.JHInternal 28.4 37.4
word128Shift Data.BigWord.Word128 14.9 19.7
blockMap Data.Digest.JHInternal 11.9 12.9
getBytes Data.Serialize.Get 6.7 2.4
unGet Data.Serialize.Get 5.5 1.3
sbox Data.Digest.JHInternal 4.0 7.4
getWord64be Data.Serialize.Get 3.7 1.6
e8 Data.Digest.JHInternal 3.7 0.0
swap4 Data.Digest.JHInternal 3.0 0.7
swap16 Data.Digest.JHInternal 3.0 0.7
swap8 Data.Digest.JHInternal 1.8 0.7
swap32 Data.Digest.JHInternal 1.8 0.7
parseBlock Data.Digest.JHInternal 1.8 1.2
swap2 Data.Digest.JHInternal 1.5 0.7
swap1 Data.Digest.JHInternal 1.5 0.7
linearTransform Data.Digest.JHInternal 1.5 8.6
shiftl_w64 Data.Serialize.Get 1.2 1.1
Detailed breakdown omitted ...
Now quickly about the JH algorithm:
It's a hash algorithm which consists of a compression function F8, which is repeated as long as there exists input blocks (of length 512 bits). This is just how the SHA-functions operate. The F8 function consists of the E8 function which applies a round function 42 times. The round function itself consists of three parts:
a sbox, a linear transformation and a permutation (called swap in my code).
Thus it's reasonable that most of the time is spent in the round function. Still I would like to know how those parts could be improved. For instance: the blockMap function is just a utility function, mapping a function over the elements in a 4-tuple. So why is it performing so badly? Any suggestions would be welcome, and not just on single functions, i.e. are there structural changes you would have done in order to improve the performance?
I have tried looking at the Core output, but unfortunately that's way over my head.
I attach some of the heap profiles at the end as well in case that could be of interest.
EDIT :
I forgot to mention my setup and build. I run it on a x86_64 Arch Linux machine, GHC 7.0.3-2 (I think), with compile options:
ghc --make -O2 -funbox-strict-fields
Unfortunately there seems to be a bug on the Linux plattform when compiling via C or LLVM, giving me the error:
Error: .size expression for XXXX does not evaluate to a constant
so I have not been able to see the effect of that.
Switch to unboxed Vectors (from Array, used for constants)
Use unsafeIndex instead of incurring the bounds check and data dependency from safe indexing (i.e. !)
Unpack Block1024 as you did with Block512 (or at least use UnboxedTuples)
Use unsafeShift{R,L} so you don't incur the check on the shift value (coming in GHC 7.4)
Unfold the roundFunction so you have one rather ugly and verbose e8 function. This was significat in pureMD5 (the rolled version was prettier but massively slower than the unrolled version). You might be able to use TH to do this and keep the code smallish. If you do this then you'll have no need for constants as these values will be explicit in the code and result in a more cache friendly binary.
Unpack your Word128 values.
Define your own addition for Word128, don't lift Integer. See LargeWord for an example of how this can be done.
rem not mod
Compile with optimization (-O2) and try llvm (-fllvm)
EDIT: And cabalize your git repo along with a benchmark so we can help you easier ;-). Good work on including a crypto-api instance.
The lower graph shows that a lot of memory is occupied by lists. Unless there are more lurking in other modules, they can only come from e8. Maybe you'll have to bite the bullet and make that a loop instead of a fold, but for starters, since Block1024 is a pair, the foldl' doesn't do much evaluation on the fly (unless the strictness analyser has become significantly better). Try making that stricter, data Block1024 = B1024 !Block512 !Block512, perhaps it also needs {-# UNPACK #-} pragmas. In roundFunction, use rem instead of mod (this will only have minor impact, but it's a bit faster) and make the let bindings strict. In the swapN functions, you might get better performance giving the constants in the form W x y rather than as 128-bit hex numbers.
I can't guarantee those changes will help, but that's what looks most promising after a short glance.
Ok, so I thought I would chime in with an update of what I have done and the results obtained thus far. Changes made:
Switched from Array to UnboxedArray (made Word128 an instance type)
Used UnboxedArray + fold in e8 instead of lists and (prelude) fold
Used unsafeIndex instead of !
Changed type of Block1024 to a real datatype (similiar to Block512), and unpacked its arguments
Updated GHC to version 7.2.1 on Arch Linux, thus fixing the problem with compiling via C or LLVM
Switched mod to rem in some places, but NOT in roundFunction. When I do it there, the compile time suddenly takes an awful lot of time, and the run time becomes 10 times slower! Does anyone know why that may be? It is only happening with GHC-7.2.1, not GHC-7.0.3
I compile with the following options:
ghc-7.2.1 --make -O2 -funbox-strict-fields main.hs ./Tests/testframe.hs -fvia-C -optc-O2
And the results? Roughly 50 % reduction in time. On an input of ~107 MB, the code now use 3 minutes as compared to the previous 6-7 minutes. The C version uses 42 seconds.
Things I tried, but which didn't result in better performance:
Unrolled the e8 function like this:
e8 !h = go h 0
where go !x !n
| n == 42 = x
| otherwise = go h' (n + 1)
where !h' = roundFunction x n
Tried breaking up the swapN functions to use the underlying Word64' directly:
swap1 (W xh hl) =
shiftL (W (xh .&. 0x5555555555555555) (xl .&. 0x5555555555555555)) 1
.|.
shiftR (W (xh .&. 0xaaaaaaaaaaaaaaaa) (xl .&. 0xaaaaaaaaaaaaaaaa)) 1
Tried using the LLVM backend
All of these attempts gave worse performance than what I have currently. I don't know if thats because I'm doing it wrong (especially the unrolling of e8), or because they just are worse options.
Still I have some new questions with these new tweaks.
Suddenly I have gotten this peculiar bump in memory usage. Take a look at following heap profiles:
Why has this happened? Is it because of the UnboxedArray? And what does SYSTEM mean?
When I compile via C I get the following warning:
Warning: The -fvia-C flag does nothing; it will be removed in a future GHC release
Is this true? Why then, do I see better performance using it, rather than not?
It looks like you did a fair amount of tweaking already; I'm curious what the performance is like without explicit strictness annotations (BangPatterns) and the various compiler pragmas (UNPACK, INLINE)... Also, a dumb question: what optimization flags are you using?
Anyway, two suggestions which may be completely awful:
Use unboxed primitive types where you can (e.g. replace Data.Word.Word64 with GHC.Word.Word64#, make sure word128Shift is using Int#, etc.) to avoid heap allocation. This is, of course, non-portable.
Try Data.Sequence instead of []
At any rate, rather than looking at the Core output, try looking at the intermediate C files (*.hc) instead. It can be hard to wade through, but sometimes makes it obvious where the compiler wasn't quite as sharp as you'd hoped.

Are Haskell List Comprehensions Inefficient?

I started doing Project Euler and got to problem number 9. Since I was using Project Euler to learn Haskell, I decided to use list comprehensions (as shown in Learn You A Haskell). I do that and GHCI takes awhile to figure out the triplet, which I figured is normal because of the calculations involved. Now, at work yesterday (I don't work as a programmer professionally, yet) I was talking to a friend who knows VBA and he wanted to try to find the answers in VBA. I thought it would be a fun challenge as well, and I churn out some basic for loops and if statements, but what got me was that it was much faster than Haskell was.
My question is: are Haskell's list comprehension incredibly inefficient? At first I thought it was just because I was in GHC's interactive mode, but then I realized VBA is interpreted too.
Please note, I didn't post my code because of it being an answer to project euler. If it will answer my question (as in I'm doing something wrong) then I will gladly post the code.
[edit]
Here is my Haskell list comprehension:
[(a,b,c) | c <- [1..1000], b <- [1..c], a <- [1..b], a+b+c=1000, a^2+b^2=c^2]
I guess I could've lowered the range on c but is that what is really slowing it down?
There are two things you could be doing with this problem that could make your code slow. One is how you are trying values for a, b and c. If you loop through all possible values for a, b, c from 1 to 1000, you'll be spending a long time. To give a hint, you can make use of a+b+c=1000 if you rearrange it for c. The other is that if you only use a list comprehension, it will process every possible value for a, b and c. The problem tells you that there is only one unique set of numbers that satisfies the problem, so if you change your answer from this:
[ a * b * c | .... ]
to:
head [ a * b * c | ... ]
then Haskell's lazy evaluation means that it will stop after finding the first answer. This is the Haskell equivalent of breaking out of your VBA loop when you find the first answer. When I used both these tips, I had an answer that completed very quickly (under a second) in ghci.
Addendum: I missed at first the condition a < b < c. You can also make use of this in your list comprehensions; it is valid to say things along the lines of:
[(a, b) | b <- [1..100], a <- [1..b-1]]
Consider this simplified version of your list comprehension:
[(a,b,c) | a <- [1..1000], b <- [1..1000], c <- [1..1000]]
This will give all possible combinations of a, b, and c. It's kind of like saying, "how many ways can three one-thousand-sided dice land?" The answer is 1000*1000*1000 = 1,000,000,000 different combinations. If it took 0.001 seconds to generate each combination, it would therefore take 1,000,000 seconds (~11.5 days) to finish all combinations. (OK, 0.001 seconds is actually pretty slow for a computer, but you get the idea)
When you add predicates to your list comprehension, it still takes the same amount of time to compute; in fact, it takes longer since it needs to check the predicate for each of the 1 billion combinations it computes.
Now consider your comprehension. It looks like it should be much faster, right?
[(a,b,c) | c <- [1..1000], b <- [1..c], a <- [1..b], a+b+c=1000, a^2+b^2=c^2]
There are 1000 choices for c. How many are there for b and a? Well, the average choice for c is 500. For all choices of c, then, there are an average of 500 choices for b (since b can range from 1 to c). Likewise, for all choices of c and b, there are an average of 250 choices for a. That's very hand-wavy, but I'm fairly sure it's accurate. So 1000 choices for c * 1000/2 choices for b * 1000/4 choices for a = 1 billion / 8 ~= 100 million. It's 8x faster, but if you paid attention, you'll notice it's actually the same big-Oh complexity as the simplified version above. If we compared "simplified" vs "improved" versions of the same problem, but from [1..100000] instead of [1..1000], the "improved" would still only be 8x faster than the "simplified".
Don't get me wrong, 8x is a wonderful constant-factor speedup. But unless you want to wait a couple hours to get the solution, you'll need to get a better big-Oh.
As Neil noted, the way to reduce the complexity of this problem is, for a given b and c, choose the a that satisfies a+b+c=1000. That way, you're not trying a bunch of as that will fail. This will drop the big-Oh complexity; you'll only be considering approximately 1000 * 500 * 1 = 500,000 combinations, instead of ~100,000,000.
Once you get the solution to the problem you can check out other peoples versions of Haskell solutions on the Project Euler site to get an idea of how other people have solved the problem. Incidentally, here is a link to the referenced problem: http://projecteuler.net/index.php?section=problems&id=9
In addition to what everyone else has said about generating fewer elements in the generators, you can also switch to using Int instead of Integer as the type of the numbers. The default is Integer, but your numbers are small enough to fit in an Int.
(Also, to nitpick, Haskell list comprehensions have no speed. Haskell is a language definition with very little operational semantics. A particular Haskell implementation might have slow list comprehensions, though.)

Mathematica running out of memory

I'm trying to run the following program, which calculates roots of polynomials of degree up to d with coefficients only +1 or -1, and then store it into files.
d = 20; n = 18000;
f[z_, i_] := Sum[(2 Mod[Floor[(i - 1)/2^k], 2] - 1) z^(d - k), {k, 0, d}];
Here f[z,i] gives a polynomial in z with plus or minus signs counting in binary. Say d=2, we would have
f[z,1] = -z2 - z - 1
f[z,2] = -z2 - z + 1
f[z,3] = -z2 + z - 1
f[z,4] = -z2 + z + 1
DistributeDefinitions[d, n, f]
ParallelDo[
Do[
root = N[Root[f[z, i], j]];
{a, b} = Round[n ({Re[root], Im[root]}/1.5 + 1)/2];
{i, 1, 2^d}],
{j, 1, d}]
I realise reading this probably isn't too enjoyable, but it's relatively short anyway. I would've tried to cut down to the relevant parts, but here I really have no clue what the trouble is. I'm calculating all roots of f[z,i], and then just round them to make them correspond to a point in a n by n grid, and save that data in various files.
For some reason, the memory usage in Mathematica creeps up until it fills all the memory (6 GB on this machine); then the computation continues extremely slowly; why is this?
I am not sure what is using up the memory here - my only guess was the stream of files used up memory, but that's not the case: I tried appending data to 2GB files and there was no noticeable memory usage for that. There seems to be absolutely no reason for Mathematica to be using large amounts of memory here.
For small values of d (15 for example), the behaviour is the following: I have 4 kernels running. As they all run through the ParallelDo loop (each doing a value of j at a time), the memory use increases, until they all finish going through that loop once. Then the next times they go through that loop, the memory use does not increase at all. The calculation eventually finishes and everything is fine.
Also, quite importantly, once the calculation stops, the memory use does not go back down.
If I start another calculation, the following happens:
-If the previous calculation stopped when memory use was still increasing, it continues to increase (it might take a while to start increasing again, basically to get to the same point in the computation).
-If the previous calculation stopped when memory use was not increasing, it does not increase further.
Edit: The issue seems to come from the relative complexity of f - changing it into some easier polynomial seems to fix the issue. I thought the problem might be that Mathematica remembers f[z,i] for specific values of i, but setting f[z,i] :=. just after calculating a root of f[z,i] complains that the assignment did not exist in the first place, and the memory is still used.
It's quite puzzling really, as f is the only remaining thing I can imagine taking up memory, but defining f in the inner Do loop and clearing it each time after a root is calculated does not solve the problem.
Ouch, this is a nasty one.
What's going on is that N will do caching of results in order to speed up future calculations if you need them again. Sometimes this is absolutely what you want, but sometimes it just breaks the world. Fortunately, you do have some options. One is to use the ClearSystemCache command, which does just what it said on the tin. After I ran your un-parallelized loop for a little while (before getting bored and aborting the calculation), MemoryInUse reported ~160 MiB in use. Using ClearSystemCache got that down to about 14 MiB.
One thing you should look at doing, instead of calling ClearSystemCache programmatically, is to use SetSystemOptions to change the caching behavior. You should take a look at SystemOptions["CacheOptions"] to see what the possibilities are.
EDIT: It's not terribly surprising that the caching causes a bigger problem for more complex expressions. It's got to be stashing copies of those expressions somewhere, and more complex expressions require more memory.

Resources