OCaml performance according to matching order - performance

In OCaml, is there any relation between the order in a pattern-matching and performance?
For instance, if I declare a type:
type t = A | B | C
and then perform some pattern-matching as follows:
match t1 with
| A -> ...
| _ -> ...
From a performance point of view, is it equivalent to
match t1 with
| B -> ...
| _ -> ...
assuming in the first case there are as many A's as there are B's in the second?
In other words, should I worry about the order of declaration of constructors in a type, when considering performance?

There is a paper explaining how pattern-matching is compiled in OCaml:
“Optimizing Pattern Matching”, L. Maranget and F. Le Fessant, ICFP’01
It basically says that the semantics is "in order", but that it is usually compiled in the optimal way, independently of the order of lines. Values of constructors don't matter either, it's the number of constructors that makes the difference, i.e. if it is compiled by a tree of comparisons, or by a jump table.
Optimality + exhaustivity test makes pattern-matching in OCaml probably the most wonderful feature of the language, and is much more efficient that writting cascades of "if" manually.

This is an impossible question to answer carefully. However, in practice if you have a type whose constructors are all nullary (i.e., equivalent to small integers), and there are more than a very few of them, but less than a whomping huge pile of them, the code generator will almost certainly use a hardware jump table, which has essentially the same performance for each possible value.
In general, I wouldn't worry about things like this at all until you have identified the slow parts of your code. But there's almost no chance you would be able to speed things up by reordering a set of nullary constructors.

Related

Is having only one argument functions efficient? Haskell

I have started learning Haskell and I have read that every function in haskell takes only one argument and I can't understand what magic happens under the hood of Haskell that makes it possible and I am wondering if it is efficient.
Example
>:t (+)
(+) :: Num a => a -> a -> a
Signature above means that (+) function takes one Num then returns another function which takes one Num and returns a Num
Example 1 is relatively easy but I have started wondering what happens when functions are a little more complex.
My Questions
For sake of the example I have written a zipWith function and executed it in two ways, once passing one argument at the time and once passing all arguments.
zipwithCustom f (x:xs) (y:ys) = f x y : zipwithCustom f xs ys
zipwithCustom _ _ _ = []
zipWithAdd = zipwithCustom (+)
zipWithAddTo123 = zipWithAdd [1,2,3]
test1 = zipWithAddTo123 [1,1,1]
test2 = zipwithCustom (+) [1,2,3] [1,1,1]
>test1
[2,3,4]
>test2
[2,3,4]
Is passing one argument at the time (scenario_1) as efficient as passing all arguments at once (scenario_2)?
Are those scenarios any different in terms of what Haskell is actually doing to compute test1 and test2 (except the fact that scenario_1 probably takes more memory as it needs to save zipWithAdd and zipWithAdd123)
Is this correct and why? In scenario_1 I iterate over [1,2,3] and then over [1,1,1]
Is this correct and why? In scenario_1 and scenario_2 I iterate over both lists at the same time
I realise that I have asked a lot of questions in one post but I believe those are connected and will help me (and other people who are new to Haskell) to better understand what actually is happening in Haskell that makes both scenarios possible.
You ask about "Haskell", but Haskell the language specification doesn't care about these details. It is up to implementations to choose how evaluation happens -- the only thing the spec says is what the result of the evaluation should be, and carefully avoids giving an algorithm that must be used for computing that result. So in this answer I will talk about GHC, which, practically speaking, is the only extant implementation.
For (3) and (4) the answer is simple: the iteration pattern is exactly the same whether you apply zipWithCustom to arguments one at a time or all at once. (And that iteration pattern is to iterate over both lists at once.)
Unfortunately, the answer for (1) and (2) is complicated.
The starting point is the following simple algorithm:
When you apply a function to an argument, a closure is created (allocated and initialized). A closure is a data structure in memory, containing a pointer to the function and a pointer to the argument. When the function body is executed, any time its argument is mentioned, the value of that argument is looked up in the closure.
That's it.
However, this algorithm kind of sucks. It means that if you have a 7-argument function, you allocate 7 data structures, and when you use an argument, you may have to follow a 7-long chain of pointers to find it. Gross. So GHC does something slightly smarter. It uses the syntax of your program in a special way: if you apply a function to multiple arguments, it generates just one closure for that application, with as many fields as there are arguments.
(Well... that might be not quite true. Actually, it tracks the arity of every function -- defined again in a syntactic way as the number of arguments used to the left of the = sign when that function was defined. If you apply a function to more arguments than its arity, you might get multiple closures or something, I'm not sure.)
So that's pretty nice, and from that you might think that your test1 would then allocate one extra closure compared to test2. And you'd be right... when the optimizer isn't on.
But GHC also does lots of optimization stuff, and one of those is to notice "small" definitions and inline them. Almost certainly with optimizations turned on, your zipWithAdd and zipWithAddTo123 would both be inlined anywhere they were used, and we'd be back to the situation where just one closure gets allocated.
Hopefully this explanation gets you to where you can answer questions (1) and (2) yourself, but just in case it doesn't, here's explicit answers to those:
Is passing one argument at the time as efficient as passing all arguments at once?
Maybe. It's possible that passing arguments one at a time will be converted via inlining to passing all arguments at once, and then of course they will be identical. In the absence of this optimization, passing one argument at a time has a (very slight) performance penalty compared to passing all arguments at once.
Are those scenarios any different in terms of what Haskell is actually doing to compute test1 and test2?
test1 and test2 will almost certainly be compiled to the same code -- possibly even to the point that only one of them is compiled and the other is an alias for it.
If you want to read more about the ideas in the implementation, the Spineless Tagless G-machine paper is much more approachable than its title suggests, and only a little bit out of date.

Why are epsilon transitions used in NFA?

I'm trying to understand how to create NFA-s from regular expressions, but I am really confused from epsilon transitions. I have this example in my textbook , but I don't understand why epsilon transitions are used and how does one know when to use them.
In general, espilon-transitions are used when they are convenient. For example, when constructing an NFA from a regular expression, you start by constructing small parts of the automaton corresponding to parts of the expression. To connect them, you need to put a transition. But if there is no symbol to be read there, an epsilon transition is a simple way to do this. They are, however never necessary, you can always find a solution without them.
In your example, just apply the algorithm described in your textbook. It tells you when to use them.
The epsilon transitions
from 1 to 2 probably connects the parts for (a|b)* and for ac
1->5 and 8->1 probably result from the *
5->6 and 5->7 probably result from the alternative in |
Epsilon-transitions in NFAs are a natural representation of choice or disjunction or union in regular expressions. That is, a regular expression like r + s (or r | s or r U s depending on your preferred notation) is naturally represented as an NFA consisting of two independent NFAs, one for r and one for s, joined using e-transitions as follows:
e
----->q0----->(r)
|
| e
|
V
(s)
When used to connect states in more complicated ways, the effect may not be as easy or natural to describe, but essentially these transitions let you choose unconditionally among multiple options. So, if I have seen a part of the input already and there are a few different ways the string could end, I can represent that by using e-transitions to states that handle the different possibilities.
In your example, the e-transitions are not really serving any very useful function and are merely artifacts of the conversion algorithm you have used. That algorithm includes them because, in the general case, they may be useful or necessary. In your specific case this was not true, so they look out of place.

Boolean expression optimization in compiler and high end processor pipeline

I want to calculate a boolean expression. For ease of understanding let's assume the expression is,
O=( A & B & C) | ( D & E & F)---(eqn. 1),
Here A, B, C, D, E and F are random bits. Now, as my target platform is high-end intel i7-Haswell processor that supports 64 bit data type, I can make this much more efficient using bit-slicing.
So now, O, A, B, C, D, E and f are 64 bits data type,
O_64=( A_64 & B_64 & C_64) | ( D_64 & E_64 & F_64)---(eqn. 2), the & and | are bitwise operators similar to C language.
Now, I need the expression to take constant time to execute. That means, the calculation of Eqn. 2 should take the exact number of steps in the processor irrespective of the values in A_64, B_64, C_64, D_64, E_64, and F_64. The values are filled up using a random generator in the runtime.
Now my question is,
Considering I am using GCC or GCC-7 with -O3, How far can the compiler optimize the expression? for example, if A_64 becomes all zeroes (can happen with probability 2^{-64} ) Then we don't need to calculate the first part of eqn.2 then O_64 becomes equal to D_64 & E_64 & F_64. Is it possible for a c compiler to optimize such a way? We have to remember that the values are filled up at runtime and the boolean expressions have around 120 variables.
Is it possible for a for a processor to do such an optimization (List 1) during runtime? As my boolean expression is very long, the execution will be heavily pipelined, now is it possible for a processor to pull out an operation out of the pipeline in if such a situation arises?
Please, let me know if any part of the question is not understandable.
I appreciate your help.
Is it possible for a c compiler to optimize such a way?
It's allowed to do it, but it probably won't. There is nothing to gain in general. If part of the expression was statically known to be zero, that would be used. But inserting branches inside bitwise calculations is almost always counterproductive, and I've never seen a compiler judge a sequence of ANDs to be "long enough to be worth inserting an early-out" (you can certainly do so manually, of course). If you need a hard guarantee of course I can't give you that, if you want to be sure you should always check the assembly.
What it probably will do (for longer expressions at least) is reassociate the expression for more instruction-level parallelism. So code like that probably won't be just two long (but parallel with each other) chains of dependent ANDs, but be split up into more chains. That still wouldn't make the time depend on the values.
Is it possible for a for a processor to do such an optimization during runtime?
Extremely hypothetically yes. No processor architecture that I am aware of does that. It would be a slightly tricky mechanism, and as a general rule it would almost never help.
Hypothetically it could work like this: when the operands for an AND instruction are looked up and one (or both) of them is found to be renamed to the hard-wired zero-register, the renamer can immediately rename the destination to zero as well (rather than allocating a new register for the result), effectively giving that AND instruction 0-latency. The flags output would also be known so the µop would not even have to be executed. It would roughly be a cross between copy-elimination and a zeroing idiom.
That mechanism wouldn't even trigger unless one of the inputs is set to zero with a zeroing idiom, if an input is accidentally zero that wouldn't be detected. It would also not completely remove the influence of the redundant AND instructions, they still have to go through (most of) the front-end of the processor even if it is just to find out that they didn't need to be executed after all.

When to use various language pragmas and optimisations?

I have a fair bit of understanding of haskell but I am always little unsure about what kind of pragmas and optimizations I should use and where. Like
Like when to use SPECIALIZE pragma and what performance gains it has.
Where to use RULES. I hear people taking about a particular rule not firing? How do we check that?
When to make arguments of a function strict and when does that help? I understand that making argument strict will make the arguments to be evaluated to normal form, then why should I not add strictness to all function arguments? How do I decide?
How do I see and check I have a space leak in my program? What are the general patterns which constitute to a space leak?
How do I see if there is a problem with too much lazyness? I can always check the heap profiling but I want to know what are the general cause, examples and patterns where lazyness hurts?
Is there any source which talks about advanced optimizations (both at higher and very low levels) especially particular to haskell?
Like when to use SPECIALIZE pragma and what performance gains it has.
You let the compiler specialise a function if you have a (type class) polymorphic function, and expect it to be called often at one or a few instances of the class(es).
The specialisation removes the dictionary lookup where it is used, and often enables further optimisation, the class member functions can often be inlined then, and they are subject to strictness analysis, both give potentially huge performance gains. If the only optimisation possible is the elimination of the dicitonary lookup, the gain won't generally be huge.
As of GHC-7, it's probably more useful to give the function an {-# INLINABLE #-} pragma, which makes its (nearly unchanged, some normalising and desugaring is performed) source available in the interface file, so the function can be specialised and possibly even inlined at the call site.
Where to use RULES. I hear people taking about a particular rule not firing? How do we check that?
You can check which rules have fired by using the -ddump-rule-firings command line option. That usually dumps a large number of fired rules, so you have to search a bit for your own rules.
You use rules
when you have a more efficient version of a function for special types, e.g.
{-# RULES
"realToFrac/Float->Double" realToFrac = float2Double
#-}
when some functions can be replaced with a more efficient version for special arguments, e.g.
{-# RULES
"^2/Int" forall x. x ^ (2 :: Int) = let u = x in u*u
"^3/Int" forall x. x ^ (3 :: Int) = let u = x in u*u*u
"^4/Int" forall x. x ^ (4 :: Int) = let u = x in u*u*u*u
"^5/Int" forall x. x ^ (5 :: Int) = let u = x in u*u*u*u*u
"^2/Integer" forall x. x ^ (2 :: Integer) = let u = x in u*u
"^3/Integer" forall x. x ^ (3 :: Integer) = let u = x in u*u*u
"^4/Integer" forall x. x ^ (4 :: Integer) = let u = x in u*u*u*u
"^5/Integer" forall x. x ^ (5 :: Integer) = let u = x in u*u*u*u*u
#-}
when rewriting an expression according to general laws might produce code that's better to optimise, e.g.
{-# RULES
"map/map" forall f g. (map f) . (map g) = map (f . g)
#-}
Extensive use of RULES in the latter style is made in fusion frameworks, for example in the text library, and for the list functions in base, a different kind of fusion (foldr/build fusion) is implemented using rules.
When to make arguments of a function strict and when does that help? I understand that making argument strict will make the arguments to be evaluated to normal form, then why should I not add strictness to all function arguments? How do I decide?
Making an argument strict will ensure that it is evaluated to weak head normal form, not to normal form.
You do not make all arguments strict because some functions must be non-strict in some of their arguments to work at all and some are less efficient if strict in all arguments.
For example partition must be non-strict in its second argument to work at all on infinite lists, more general every function used in foldr must be non-strict in the second argument to work on infinite lists. On finite lists, having the function non-strict in the second argument can make it dramatically more efficient (foldr (&&) True (False:replicate (10^9) True)).
You make an argument strict, if you know that the argument must be evaluated before any worthwhile work can be done anyway. In many cases, the strictness analyser of GHC can do that on its own, but of course not in all.
A very typical case are accumulators in loops or tail recursions, where adding strictness prevents the building of huge thunks on the way.
I know no hard-and-fast rules for where to add strictness, for me it's a matter of experience, after a while you learn in what places adding strictness is likely to help and where to harm.
As a rule of thumb, it makes sense to keep small data (like Int) evaluated, but there are exceptions.
How do I see and check I have a space leak in my program? What are the general patterns which constitute to a space leak?
The first step is to use the +RTS -s option (if the programme was linked with rtsopts enabled). That shows you how much memory was used overall, and you can often judge by that whether you have a leak.
A more informative output can be obtained from running the programme with the +RTS -hT option, that produces a heap profile that can help locating the space leak (also, the programme needs to be linked with enabled rtsopts).
If further analysis is required, the programme needs to be compiled with profiling enabled (-rtsops -prof -fprof-auto, in older GHCs, the -fprof-auto option wasn't available, the -prof-auto-all option is the closest correspondence there).
Then you run it with various profiling options and look at the generated heap profiles.
The two most common causes for space leaks are
too much laziness
too much strictness
the third place is probably taken by unwanted sharing, GHC does little common subexpression elimination, but it occasionally shares long lists even where not wanted.
For finding the cause of a leak, I know again no hard-and-fast rules, and occasionally, a leak can be fixed by adding strictness in one place or by adding laziness in another.
How do I see if there is a problem with too much lazyness? I can always check the heap profiling but I want to know what are the general cause, examples and patterns where lazyness hurts?
Generally, laziness is wanted where results can be built up incrementally, and unwanted where no part of the result can be delivered before processing is complete, like in left folds or generally in tail-recursive functions.
I recommend reading the GHC documentation on Pragmas and Rewrite Rules, as they address many of your questions about SPECIALIZE and RULES.
To briefly address your questions:
SPECIALIZE is used to force the compiler to build a specialized version of a polymorphic function for a particular type. The advantage is that applying the function in that case will no longer require the dictionary. The disadvantage is that it will increase the size of your program. Specialization is particularly valuable for functions called in "inner-loops", and it's essentially useless for infrequently called top-level functions. Refer to the GHC documentation for interactions with INLINE.
RULES allows you to specify rewrite rules that you know to be valid but the compiler couldn't infer on its own. The common example is {-# RULES "mapfusion" forall f g xs. map f (map g xs) = map (f.g) xs #-}, which tells GHC how to fuse map. It can be finicky to get GHC to use the rules because of interference with INLINE. 7.19.3 touches on how to avoid conflicts and also how to force GHC to use a rule even when it would normally avoid it.
Strict arguments are most vital for something like an accumulator in a tail-recursive function. You know that the value will ultimately be fully calculated, and building up a stack of closures to delay the computation completely defeats the purpose. Enforced strictness must naturally be avoided anytime the function may be applied to a value which must be processed lazily, like an infinite list. Generally, the best idea is to initially only force strictness where it's obviously useful (like accumulators), and then add more later only as profiling shows it's needed.
My experience has been that most show-stopping space leaks came from lazy accumulators and unevaluated lazy values in very large data-structures, although I'm sure this is specific to the kinds of programs you're writing. Using unboxed data-structures whenever possible fixes a lot of the problems.
Outside of the instances where laziness causes space-leaks, the major situation where it should be avoided is in IO. Lazily processing resource inherently increases the amount of wall-clock time that the resource is needed. This can be bad for cache performance, and it's obviously bad if something else wants exclusive rights to use the same resource.

Reusing memory of immutable state in eager evaluation?

I'm studying purely functional language and currently thinking about some immutable data implementation.
Here is a pseudo code.
List a = [1 .. 10000]
List b = NewListWithoutLastElement a
b
When evaluating b, b must be copied in eager/strict implementation of immutable data.
But in this case, a is not used anymore in any place, so memory of 'a' can be re-used safely to avoid copying cost.
Furthermore, programmer can force compiler always do this by marking the type List with some keyword meaning must-be-disposed-after-using. Which makes compile time error on logic cannot avoid copying cost.
This can gain huge performance. Because it can be applied to huge object graph too.
How do you think? Any implementations?
This would be possible, but severely limited in scope. Keep in mind that the vast majority of complex values in a functional program will be passed to many functions to extract various properties from them - and, most of the time, those functions are themselves arguments to other functions, which means you cannot make any assumptions about them.
For example:
let map2 f g x = f x, g x
let apply f =
let a = [1 .. 10000]
f a
// in another file :
apply (map2 NewListWithoutLastElement NewListWithoutFirstElement)
This is fairly standard in functional code, and there is no way to place a must-be-disposed-after-using attribute on a because no specific location has enough knowledge about the rest of the program. Of course, you could try adding that information to the type system, but type inference on this is decidedly non-trivial (not to mention that types would grow quite large).
Things get even worse when you have compound objects, such as trees, that might share sub-elements between values. Consider this:
let a = binary_tree [ 1; 2; 5; 7; 9 ]
let result_1 = complex_computation_1 (insert a 6)
let result_2 = complex_computation_2 (remove a 5)
In order to allow memory reuse within complex_computation_2, you would need to prove that complex_computation_1 does not alter a, does not store any part of a within result_1 and is done using a by the time complex_computation_2 starts working. While the two first requirements might seem the hardest, keep in mind that this is a pure functional language: the third requirement actually causes a massive performance drop because complex_computation_1 and complex_computation_2 cannot be run on different threads anymore!
In practice, this is not an issue in the vast majority of functional languages, for three reasons:
They have a garbage collector built specifically for this. It is faster for them to just allocate new memory and reclaim the abandoned one, rather than try to reuse existing memory. In the vast majority of cases, this will be fast enough.
They have data structures that already implement data sharing. For instance, NewListWithoutFirstElement already provides full reuse of the memory of the transformed list without any effort. It's fairly common for functional programmers (and any kind of programmers, really) to determine their use of data structures based on performance considerations, and rewriting a "remove last" algorithm as a "remove first" algorithm is kind of easy.
Lazy evaluation already does something equivalent: a lazy list's tail is initially just a closure that can evaluate the tail if you need to—so there's no memory to be reused. On the other hand, this means that reading an element from b in your example would read one element from a, determine if it's the last, and return it without really requiring storage (a cons cell would probably be allocated somewhere in there, but this happens all the time in functional programming languages and short-lived small objects are perfectly fine with the GC).

Resources