How can this function in Haskell be optimised - performance

As part of an advent of code challenge, I've written the following functions in Haskell:
simulateUntilRepeat_int a b i = if (a /= b) then (simulateUntilRepeat_int a (updateCycle b) (i+1)) else i
simulateUntilRepeat a = simulateUntilRepeat_int a (updateCycle a) 1
The purpose of this is to take a list of moons and simulate their movement until they resume their original position, returning the number of cycles it took for them to get there. (the function updateCycle does one iteration of the simulation). However, when I attempt to run this it uses all available memory and then gets killed by the operating system. The question does admit that this may take a very large number of cycles.
Googling around about this problem I find the usual fix is to make some of the parameters strict, but I think I've experimented with all possible permutations of strictness on the parameters to no avail. By the looks of this function, I'd have anticipated the compiler would be able to use the tail recursion optimisation and turn it into a loop, but this seems to not be happening somehow.
A friend of mine, who is knowledgeable in haskell suggested changing the form of the function to the following:
f a b0 = length (takeWhile (/= a) (iterate updateCycle b0))
But doing this didn't fix it either, leaving me out of ideas.

The comments are undoubtedly correct that your approach is not the intended solution method.
However, the functions you've posted would not, in and of themselves, cause a memory leak, fail to tail recurse, or lead to poor performance. Given your code above plus the definitions:
updateCycle 4686774942 = 0
updateCycle n = n+1
main = do
print $ simulateUntilRepeat (0 :: Int)
and compiling with -O2, the program runs in constant memory on my laptop in about 30 seconds. Adding explicit type signatures to use Int in place of Integer for the iteration count:
simulateUntilRepeat_int :: Int -> Int -> Int -> Int
simulateUntilRepeat :: Int -> Int
it runs in about 2.4 seconds.
So, to understand why your program is gobbling all available memory or why your strictness annotations failed to make a difference, it would probably be necessary to see the whole working program (or preferably a minimal example that illustrates the performance problem). If the program is short, and the question is "why is the performance of this program totally unreasonable?" instead of "how can I optimize my program to run as fast as possible?", it might still be a good SO question. Otherwise, the Code Review site might be better -- you can post a larger program there and ask for general performance advice, and that's considered on-topic for that site.

Related

Factorial using recursion(backtracking vs a new approach)

I am trying to learn recursion. for the starting problem of it , which is calculating factorial of a number I have accomplished it using two methods.
the first one being the normal usual approach.
the second one i have tried to do something different.
in the second one i return the value of n at the end rather than getting the starting value as in the first one which uses backtracking.
my question is that does my approach has any advantages over backtracking?
if asked to chose which one would be a better solution?
//first one is ,
ll factorial(int n)
{
if(n==1)
return 1 ;
return n*factorial(n-1) ;
}
int main()
{
factorial(25) ;
return 0 ;
}
// second one is ,
ll fact(ll i,ll n)
{
if(i==0)return n ;
n=(n*i)
i--;
n=fact(i,n);
}
int main()
{
int n ;
cin>>n ;
cout<<fact(n,1) ;
return 0 ;
}
// ll is long long int
First of all I want to point out that premature optimization at the expense of readibility is almost always a mistake, especially when the need for optimization comes from intuition and not from measurements.
"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%" - Donald Knuth
But let's say we care about the 3% in this case, because all our program ever does is compute lots of factorials. To keep it short: You will never be smarter than the compiler.
If this seems slightly crazy then this definitely applies to you and you should stop thinking about 'micromanaging/optimizing your code'. If you are a very skilled C++ programmer this will still apply to you in most cases, but you will recognize the opportunities to help your compiler out.
To back this up with some fact, we can compile the code (with automatic optimization) and (roughly) compare the assembly output. I will use the wonderful website godbolt.org
Don't be discouraged by the crazy assembler code, we don't need to understand it. But we can see that both methods
are basically the same length when compiled as assembler code
contain almost the same instructions
So to recap, readability should be your number one priority. In case a speed measurement shows that this one part of your code really is a big performance problem, really think about if you can make a change that structurally improves the algorithm (i.e. by decreasing the complexity). Otherwise your compiler will take care of it for you.

What's more costly on current CPUs: arithmetic operations or conditionals?

20-30 years ago arithmetic operations like division were one of the most costly operations for CPUs. Saving one division in a piece of repeatedly called code was a significant performance gain. But today CPUs have fast arithmetic operations and since they heavily use instruction pipelining, conditionals can disrupt efficient execution. If I want to optimize code for speed, should I prefer arithmetic operations in favor of conditionals?
Example 1
Suppose we want to implement operations modulo n. What will perform better:
int c = a + b;
result = (c >= n) ? (c - n) : c;
or
result = (a + b) % n;
?
Example 2
Let's say we're converting 24-bit signed numbers to 32-bit. What will perform better:
int32_t x = ...;
result = (x & 0x800000) ? (x | 0xff000000) : x;
or
result = (x << 8) >> 8;
?
All the low hanging fruits are already picked and pickled by authors of compilers and guys who build hardware. If you are the kind of person who needs to ask such question, you are unlikely to be able to optimize anything by hand.
While 20 years ago it was possible for a relatively competent programmer to make some optimizations by dropping down to assembly, nowadays it is the domain of experts, specializing in the target architecture; also, optimization requires not only knowing the program, but knowing the data it will process. Everything comes down to heuristics, tests under different conditions etc.
Simple performance questions no longer have simple answers.
If you want to optimise for speed, you should just tell your compiler to optimise for speed. Modern compilers will generally outperform you in this area.
I've sometimes been surprised trying to relate assembly code back to the original source for this very reason.
Optimise your source code for readability and let the compiler do what it's best at.
I expect that in example #1, the first will perform better. The compiler will probably apply some bit-twiddling trick to avoid a branch. But you're taking advantage of knowledge that it's extremely unlikely that the compiler can deduce: namely that the sum is always in the range [0:2*n-2] so a single subtraction will suffice.
For example #2, the second way is both faster on modern CPUs and simpler to follow. A judicious comment would be appropriate in either version. (I wouldn't be surprised to see the compiler convert the first version into the second.)

What are some obvious optimizations for a virtual machine implementing a functional language?

I'm working on an intermediate language and a virtual machine to run a functional language with a couple of "problematic" properties:
Lexical namespaces (closures)
Dynamically growing call stack
A slow integer type (bignums)
The intermediate language is stack based, with a simple hash-table for the current namespace. Just so you get an idea of what it looks like, here's the McCarthy91 function:
# McCarthy 91: M(n) = n - 10 if n > 100 else M(M(n + 11))
.sub M
args
sto n
rcl n
float 100
gt
.if
.sub
rcl n
float 10
sub
.end
.sub
rcl n
float 11
add
list 1
rcl M
call-fast
list 1
rcl M
tail
.end
call-fast
.end
The "big loop" is straightforward:
fetch an instruction
increment the instruction pointer (or program counter)
evaluate the instruction
Along with sto, rcl and a whole lot more, there are three instructions for function calls:
call copies the namespace (deep copy) and pushes the instruction pointer onto the call stack
call-fast is the same, but only creates a shallow copy
tail is basically a 'goto'
The implementation is really straightforward. To give you a better idea, here's just a random snippet from the middle of the "big loop" (updated, see below)
} else if inst == 2 /* STO */ {
local[data] = stack[len(stack) - 1]
if code[ip + 1][0] != 3 {
stack = stack[:len(stack) - 1]
} else {
ip++
}
} else if inst == 3 /* RCL */ {
stack = append(stack, local[data])
} else if inst == 12 /* .END */ {
outer = outer[:len(outer) - 1]
ip = calls[len(calls) - 1]
calls = calls[:len(calls) - 1]
} else if inst == 20 /* CALL */ {
calls = append(calls, ip)
cp := make(Local, len(local))
copy(cp, local)
outer = append(outer, &cp)
x := stack[len(stack) - 1]
stack = stack[:len(stack) - 1]
ip = x.(int)
} else if inst == 21 /* TAIL */ {
x := stack[len(stack) - 1]
stack = stack[:len(stack) - 1]
ip = x.(int)
The problem is this: Calling McCarthy91 16 times with a value of -10000 takes, near as makes no difference, 3 seconds (after optimizing away the deep-copy, which adds nearly a second).
My question is: What are some common techniques for optimizing interpretation of this kind of language? Is there any low-hanging fruit?
I used slices for my lists (arguments, the various stacks, slice of maps for the namespaces, ...), so I do this sort of thing all over the place: call_stack[:len(call_stack) - 1]. Right now, I really don't have a clue what pieces of code make this program slow. Any tips will be appreciated, though I'm primarily looking for general optimization strategies.
Aside:
I can reduce execution time quite a bit by circumventing my calling conventions. The list <n> instruction fetches n arguments of the stack and pushes a list of them back onto the stack, the args instruction pops off that list and pushes each item back onto the stack. This is firstly to check that functions are called with the correct number of arguments and secondly to be able to call functions with variable argument-lists (i.e. (defun f x:xs)). Removing that, and also adding an instruction sto* <x>, which replaces sto <x>; rcl <x>, I can get it down to 2 seconds. Still not brilliant, and I have to have this list/args business anyway. :)
Another aside (this is a long question I know, sorry):
Profiling the program with pprof told me very little (I'm new to Go in case that's not obvious) :-). These are the top 3 items as reported by pprof:
16 6.1% 6.1% 16 6.1% sweep pkg/runtime/mgc0.c:745
9 3.4% 9.5% 9 3.4% fmt.(*fmt).fmt_qc pkg/fmt/format.go:323
4 1.5% 13.0% 4 1.5% fmt.(*fmt).integer pkg/fmt/format.go:248
These are the changes I've made so far:
I removed the hash table. Instead I'm now passing just pointers to arrays, and I only efficiently copy the local scope when I have to.
I replaced the instruction names with integer opcodes. Before, I've wasted quite a bit of time comparing strings.
The call-fast instruction is gone (the speedup wasn't measurable anymore after the other changes)
Instead of having "int", "float" and "str" instructions, I just have one eval and I evaluate the constants at compile time (compilation of the bytecode that is). Then eval just pushes a reference to them.
After changing the semantics of .if, I could get rid of these pseudo-functions. it's now .if, .else and .endif, with implicit gotos ànd block-semantics similar to .sub. (some example code)
After implementing the lexer, parser, and bytecode compiler, the speed went down a little bit, but not terribly so. Calculating MC(-10000) 16 times makes it evaluate 4.2 million bytecode instructions in 1.2 seconds. Here's a sample of the code it generates (from this).
The whole thing is on github
There are decades of research on things you can optimize:
Implementing functional languages: a tutorial, Simon Peyton Jones and David Lester. Published by Prentice Hall, 1992.
Practical Foundations for Programming Languages, Robert Harper
You should have efficient algorithmic representations for the various concepts of your interpreter. Doing deep copies on a hashtable looks like a terrible idea, but I see that you have already removed that.
(Yes, your stack-popping operation using array slices look suspect. You should make sure they really have the expected algorithmic complexity, or else use a dedicated data structure (... a stack). I'm generally wary of using all-purposes data structure such as Python lists or PHP hashtables for this usage, because they are not necessarily designed to handle this particular use case well, but it may be that slices do guarantee an O(1) pushing and popping cost under all circumstances.)
The best way to handle environments, as long as they don't need to be reified, is to use numeric indices instead of variables (de Bruijn indices (0 for the variable bound last), or de Bruijn levels (0 for the variable bound first). This way you can only keep a dynamically resized array for the environment and accessing it is very fast. If you have first-class closures you will also need to capture the environment, which will be more costly: you have to copy the part of it in a dedicated structure, or use a non-mutable structure for the whole environment. Only experiment will tell, but my experience is that going for a fast mutable environment structure and paying a higher cost for closure construction is better than having an immutable structure with more bookkeeping all the time; of course you should make an usage analysis to capture only the necessary variables in your closures.
Finally, once you have rooted out the inefficiency sources related to your algorithmic choices, the hot area will be:
garbage collection (definitely a hard topic; if you don't want to become a GC expert, you should seriously consider reusing an existing runtime); you may be using the GC of your host language (heap-allocations in your interpreted language are translated into heap-allocations in your implementation language, with the same lifetime), it's not clear in the code snippet you've shown; this strategy is excellent to get something reasonably efficient in a simple way
numerics implementation; there are all kind of hacks to be efficient when the integers you manipulate are in fact small. Your best bet is to reuse the work of people that have invested tons of effort on this, so I strongly recommend you reuse for example the GMP library. Then again, you may also reuse your host language support for bignum if it has some, in your case Go's math/big package.
the low-level design of your interpreter loop. In a language with "simple bytecode" such as yours (each bytecode instruction is translated in a small number of native instructions, as opposed to complex bytecodes having high-level semantics such as the Parrot bytecode), the actual "looping and dispatching on bytecodes" code can be a bottleneck. There has been quite a lot of research on what's the best way to write such bytecode dispatch loops, to avoid the cascade of if/then/else (jump tables), benefit from the host processor branch prediction, simplify the control flow, etc. This is called threaded code and there are a lot of (rather simple) different techniques : direct threading, indirect threading... If you want to look into some of the research, there is for example work by Anton Ertl, The Structure and Performance of Efficient Interpreters in 2003, and later Context threading: A flexible and efficient dispatch technique for virtual machine interpreters. The benefits of those techniques tend to be fairly processor-sensitive, so you should test the various possibilities yourself.
While the STG work is interesting (and Peyton-Jones book on programming language implementation is excellent), it is somewhat oriented towards lazy evaluation. Regarding the design of efficient bytecode for strict functional languages, my reference is Xavier Leroy's 1990 work on the ZINC machine: The ZINC experiment: An Economical Implementation of the ML Language, which was ground-breaking work for the implementation of ML languages, and is still in use in the implementation of the OCaml language: there are both a bytecode and a native compiler, but the bytecode still uses a glorified ZINC machine.

Analyzing slow performance of a Haskell program

I was trying to solve ITA Software's "Word Nubmers" puzzle using a brute force approach. It looks like my Haskell version is more than 10 times slower than a C#/C++ version.
The answer
Thanks to Bryan O'Sullivan's answer, I was able to "correct" my program to acceptable performance. You can read his code which is much cleaner than mine. I am going to outline the key points here.
Int is Int64 on Linux GHC x64. Unless you unsafeCoerce, you should just use Int. This saves you from having to fromIntegral. Doing Int64 on Windows 32-bit GHC is just darn slow, avoid it. (This is in fact not GHC's fault. As mentioned in my blog post below, 64 bit integers in 32-bit programs is slow in general (at least in Windows))
-fllvm or -fvia-C for performance.
Prefer quotRem to divMod, quotRem already suffices. That gave me 20% speed up.
In general, prefer Data.Vector to Data.Array as an "array"
Use the wrapper-worker pattern liberally.
The above points were enough to give me about 100% boost over my original version.
In my blog post, I have detailed a step-by-step illustrated example of how I turned the original program to match Bryan's program. There are other points mentioned there as well.
The original question
(This may sound like a "could you do the work for me" post, but I argue that such a concrete example would be very instructive since profiling Haskell performance is often seen as a myth)
(As noted in the comments, I think I have misinterpreted the problem. But who cares, we can focus on performance in a different problem)
Here's a my version of a quick recap of the problem:
A wordNumber is defined as
wordNumber 1 = "one"
wordNumber 2 = "onetwo"
wordNumber 3 = "onethree"
wordNumber 15 = "onetwothreefourfivesixseveneightnineteneleventwelvethirteenfourteenfifteen"
...
Problem: Find the 51-billion-th letter of (wordNumber Infinity); assume that letter is found at 'wordNumber x', also find 'sum [1..x]'
From an imperative perspective, a naive algorithm would be to have 2 counters, one for sum of numbers and one for sum of lengths. Keep counting the length of each wordNumber and "break" to return the result.
The imperative brute-force approach is implemented in C# here: http://ideone.com/JjCb3. It takes about 1.5 minutes to find the answer on my computer. There is also an C++ implementation that runs in 45 seconds on my computer.
Then I implemented a brute-force Haskell version: http://ideone.com/ngfFq. It cannot finish the calculation in 5 minutes on my machine. (Irony: it's has more lines than the C# version)
Here is the -p profile of the Haskell program: http://hpaste.org/49934
Question: How to make it perform comparatively to the C# version? Are there obvious mistakes I am making?
(Note: I am fully aware that brute-forcing it is not the correct solution to this problem. I am mainly interested in making the Haskell version perform comparatively to the C# version. Right now it is at least 5x slower so obviously I am missing something obvious)
(Note 2: It does not seem to be space leaking. The program runs with constant memory (about 2MB) on my computer)
(Note 3: I am compiling with `ghc -O2 WordNumber.hs)
To make the question more reader friendly, I include the "gist" of the two versions.
// C#
long sumNum = 0;
long sumLen = 0;
long target = 51000000000;
long i = 1;
for (; i < 999999999; i++)
{
// WordiLength(1) = 3 "one"
// WordiLength(101) = 13 "onehundredone"
long newLength = sumLen + WordiLength(i);
if (newLength >= target)
break;
sumNum += i;
sumLen = newLength;
}
Console.WriteLine(Wordify(i)[Convert.ToInt32(target - sumLen - 1)]);
-
-- Haskell
-- This has become totally ugly during my squeeze for
-- performance
-- Tail recursive
-- n-th number (51000000000 in our problem) -> accumulated result -> list of 'zipped' left to try
-- accumulated has the format (sum of numbers, current lengths of the whole chain, the current number)
solve :: Int64 -> (Int64, Int64, Int64) -> [(Int64, Int64)] -> (Int64, Int64, Int64)
solve !n !acc#(!sumNum, !sumLen, !curr) ((!num, !len):xs)
| sumLen' >= n = (sumNum', sumLen, num)
| otherwise = solve n (sumNum', sumLen', num) xs
where
sumNum' = sumNum + num
sumLen' = sumLen + len
-- wordLength 1 = 3 "one"
-- wordLength 101 = 13 "onehundredone"
wordLength :: Int64 -> Int64
-- wordLength = ...
solution :: Int64 -> (Int64, Char)
solution !x =
let (sumNum, sumLen, n) = solve x (0,0,1) (map (\n -> (n, wordLength n)) [1..])
in (sumNum, (wordify n) !! (fromIntegral $ x - sumLen - 1))
I've written a gist that contains both a C++ version (a copy of yours from a Haskell-cafe message, with a bug fixed) and a Haskell translation.
Notice that the two are structurally almost identical. When compiled with -fllvm, the Haskell code runs at about half the speed of the C++ code, which is pretty good.
Now let's compare my Haskell wordLength code to yours. You're passing around an extra unnecessary parameter, which is unnecessary (you apparently figured that out when writing the C++ code that I translated). Also, the large number of bang patterns suggests panic; they're almost all useless.
Your solve function is also very confused.
You're passing parameters in three different ways: a regular Int, a 3-tuple, and a list! Whoa.
This function is necessarily not very regular in its behaviour, so while you gain nothing stylistically by using a list to supply your counter, you probably force GHC to allocate memory. In other words, this both obfuscates the code and makes it slower.
By using a tuple for three parameters (for no obvious reason), you're again working hard to force GHC to allocate memory for every step through the loop, when it could avoid doing so if you passed the parameters directly.
Only your n parameter is dealt with in a sensible way, but you don't need a bang pattern on it.
The only parameter that needs a bang pattern is sumNum, because you never inspect its value until after the loop has finished. GHC's strictness analyser will deal with the others. All of your other bang patterns are unnecessary at best, misdirections at worst.
Here are two pointers I could come up with in a quick investigation:
Note that using Int64 is really slow when you are using a 32 bit build of GHC, as is the default for Haskell Platform, currently. This also turned out to be the main villain in a previous performance problem (there I give a few more details).
For reasons I don't quite understand the divMod function does not seem to get inlined. As a result, the numbers are returned on the heap. When using div and mod separately, wordLength' executes purely on the stack as it should be.
Sadly I currently have no 64-bit GHC around to test whether this is enough to solve the problem.

Haskell - simple way to cache a function call

I have functions like:
millionsOfCombinations = [[a, b, c, d] |
a <- filter (...some filter...) someListOfAs,
b <- (...some other filter...) someListOfBs,
c <- someListOfCs, d <- someListOfDs]
aLotOfCombinationsOfCombinations = [[comb1, comb2, comb3] |
comb1 <- millionsOfCombinations,
comb2 <- millionsOfCombinations,
comb3 <- someList,
...around 10 function calls to find if
[comb1, comb2, comb3] is actually useful]
Evaluating millionsOfCombinations takes 40s. on a very fast workstation. Evaluating aLotOfCombinationsOfCombinations!!0 took 2 days :-(
How can I speed up this code? So far I've had 2 ideas - use a profiler. Tried running myapp +RTS -sstderr after compiling with GHC, but get a blank screen and don't want to wait days for it to finish.
2nd thought was to somehow cache millionsOfCombinations. Do I understand correctly that for each value in aLotOfCombinationsOfCombinations, millionsOfCombinations gets evaluated multiple times? If that is so, how can I cache the result? Obviously I've just started learning Haskell. I know there is a way to do call caching with a monad, but I still don't understand those things.
Use the -fforce-recomp, -O2 and -fllvm flags
If you aren't already, be sure to use the above flags. I wouldn't normally mention it, but I've seen some questions recently that didn't know powerful optimization isn't a default.
Profile Your Code
The -sstderr flag isn't exactly profiling. When people say profiling they're usually talking about either heap profiling or time profiling via -prof and -auto-all flags.
Avoid Costly Primitives
If you need the entire list in memory (i.e. it isn't going to be optimized away) then consider unboxed vectors. If Int will do instead of Integer, consider that (but Integer is a reasonable default when you don't know!). Use worker/wrapping transforms at the right times. If you're leaning heavily on Data.Map, try using Data.HashMap from the unordered-containers library. This list can go on and on, but since you don't already have an intuition on where your computation time is going the profiling should come first!
I think, that there is no way. Please notice, that the time to generate the list is growing with each list involved. So you get around 10000003 combinations to check, which indeed takes a lot of time. Caching the list ist possible but is unlikely to change anything, since new elements can be generated almost instantly. The only way is probably to change the algorithm.
If millionsOfCombinations is a constant (and not a function with arguments), it is cached automatically. Else, make it a constant by using a where clause:
aLotOfCombinationsOfCombinations = [[comb1, comb2, comb3] |
comb1 <- millionsOfCombinations,
comb2 <- millionsOfCombinations,
comb3 <- someList,
...around 10 function calls to find if
[comb1, comb2, comb3] is actually useful] where
millionsOfCombinations = makeCombination xyz

Resources