Which is better in OCaml pattern matching, `when` or `if-then-else`? - performance

Let's say we have a type called d:
type d = D of int * int
And we want to do some pattern matching over it, is it better to do it this way:
let dcmp = function
| D (x, y) when x > y -> 1
| D (x, y) when x < y -> -1
| _ -> 0
or
let dcmp = function
| D (x, y) ->
if x > y then 1 else if x < y then -1 else 0
Just in general is better to match patterns with many "when" cases or to match one pattern and the put an "if-then-else" in it?
And where can I get more information about such matters, like good practices in OCaml and syntactic sugars and such?

Both approaches have their cons and pros so they should be used accordingly to the context.
The when clause is easier to understand than if because it has only one branch, so you can digest a branch in a time. It comes with the price that when we analyze a clause in order to understand its path condition we have to analyze all branches before it (and negate them), e.g., compare your variant with the following definition, which is equivalent,
let dcmp = function
| D (x, y) when x > y -> 1
| D (x, y) when x = y -> 0
| _ -> -1
Of course, the same is true for if/then/else construct it is just harder to accidentally rearrange branches (e.g., during refactoring) in the if/then/else expression and completely change the logic of the expression.
In addition, the when guards may prevent the compiler from performing decision tree optimizations1 and confuse2 the refutation mechanism.
Given this, the only advantage to using when instead of if in this particular example is that when syntax looks more appealing as it perfectly lined up and it is easier for the human brain to find where are the conditions and their corresponding values, i.e., it looks more like a truth-table. However, if we will write
let dcmp (D (x,y)) =
if x = y then 0 else
if x > y then 1 else -1
we can achieve the same level of readability.
To summarize, it is better to use when when it is impossible or nearly impossible to express the same code with if/then/else. To improve readability it is better to factor your logic into helper functions with readable names. For example, with dcmp the best solution is to use neither if or when, e.g.,
let dcmp (D (x,y)) = compare x y
1)In this particular case the compiler will generate the same code for when and if/then/else. But in more general cases, guards may prevent the matching compiler from generating the efficient code, especially when branches are disjoint. In our case, the compiler just noticed that we're repeating the same branch and coalesced them into a single branch and turned it back into the if/then/else expression, e.g., here is the cmm output of the function with the when guards,
(if (> x y) 3 (if (< x y) -1 1))
which is exactly the same code as generated by the if/then/else version of the dcmp function.
2) Not to the state where it will not notice a missing branch, of course, but to the state where it will report missing branches less precisely or will ask you to add unnecessary branches.

Quoting the OCaml Towards Clarity and Grace style guide:
Code is more often read than written - make the life of the reader easy
and
Less code is better, cryptic code is worse
The first makes me think that the version with multiple when clauses is the better choice, as it makes it easy to predict or evaluate the result when reading the code depending on condition. The second goes further, against the if-then-else because, even if shorter, is cryptic when looking from afar.
Also, in the section Functions, we find out that "Pattern matching is the preferred way to define functions"
From a Haskell functional programmer's point of view.

Related

Blending Boolean Algebra and Numeric Algebra to assign variables

I have written some code which assigns variables using the results of condition expressions without the explicit use of IF-ELSE statements.
In the simplest form, the problem looks like this:
Version 1
if (x < K)
y = A;
else
y = B;
I've seen a "trick" in the past in which people accomplish the same task in one line without the conditional like this:
Version 2
y = (x < K) * A + !(x < K) * B;
This approach extends relatively easily to handle IF-ELSE IF-ELSE assignments. The trick is to ensure that the conditions are all mutually exclusive.
From a unit testing perspective, I'm required to achieve 100% code path coverage.
My coworkers agree that the Version 2 is more elegant, but they contend it is less readable. Furthermore, they argue that I am "side-stepping" the path coverage requirement and that I would be able to achieve 100% path coverage by "hiding" the conditional logic inside the single line of code without actually exercising both conditions ((x < K) and !(x < K)).
I argue that I am able to blend Boolean algebra and numeric algebra to perform variable assignment because the computer treats Boolean 'true' and 'false' as '1' and '0' which can be multiplied by 'float' and 'int' variables. To me, it becomes simply an arithmetic expression with zeros and ones multiplying variables.
Why am I doing this?
I am doing this blend of Boolean and numeric algebra to minimize the number of IF-statements, minimize lines of code, and general code cleanup. Obviously performance can be improved by saving the result of the condition to a variable and referencing.
The Question
Is this practice (and ternary operators) frowned upon from a unit testing perspective?
If this question is too subjective, please suggest edits.
I'd suggest avoiding it (this trick is actually useful when the intention is to avoid branching, which may be the context you've seen it in). Given that the language doesn't have a conditional operator, you should be able to define the equivalent of
cond(bool, x, y) { if (bool) return x; else return y; }
yourself and write y = cond(x < K, A, B). It's more readable, harder to make a mistake when writing, is usable with non-number types, and is considered correctly in path coverage. It evaluates both sides, unlike the actual conditional operator (unless the language has macros or lazy evaluation), but so does the described trick.

Conditional Dependencies in Compiler Semantic Analysis Passes

Imagine that we have a been given an Excel spreadsheet with three columns, labeled COND, X and Y.
COND = TRUE or FALSE (user input)
X = if(COND == TRUE) then 0 else Y
Y = if(COND == TRUE) then X else 1;
These formulas evaluate perfectly fine in Excel, and Excel does not generate a Circular Dependency error.
I am writing a compiler that tries to convert these Excel formulas to C code. In my compiler, these formulas do generate a circular dependency error. The issue is that (naïvely) the expression of X depends on Y and the expression for Y depends on X and my compiler is unable to logically continue.
Excel is able to accomplish this feat because it is a lazy, interpreted language. Excel will just lazily evaluate the formulas at run-time (with user inputs), and since no circular dependency occurs at run-time Excel has no problem evaluating such logic.
Unfortunately, I need to convert these formulas to a compiled language (not an interpreted one). The actual formulas, in the actual spreadsheets, have more complicated dependencies between multiple cells/variables (involving up to over half a dozen different cells). This means that my compiler has to perform some kind of sophisticated static, semantic analysis of the formulas and be smart enough to detect that there are no circular references if we "look inside" the conditional branches. The compiler would then have to generate the following C code from the above Excel formulas:
bool COND;
int X, Y;
if(COND) { X = 0; Y = X; } else { Y = 1; X = Y; }
Notice that the order of the assignment instructions is different in each branch of the if-statement in C.
My question is, is there any established algorithm or literature on compilers that explains how to implement this type of analysis in a compiler? Do functional programming language compilers have to solve this problem?
Why aren't standard optimization techniques adequate?
Presumably, the Excel formulas form a DAG with the leaves being primitive values and the nodes being computations/assignments. (If the Excel computation forms a cycle, then you need
some kind of iterative solver assuming you want a fixpoint).
If you simply propagate the conditional by lifting it (a class compiler optimization), we start with your original equations, where each computation is evaluated in any order WRT to others, such that the result computes dag-like (that "anyorder" is an operator intending to model that):
X = if(COND == TRUE) then 0 else Y;
anyorder
Y = if(COND == TRUE) then X else 1;
then lifting the conditional:
if (COND) { X=0; } else { X = 1; }
anyorder
if (COND) { Y=X; } else { Y = 1; }
then
if (COND) { X=0; anyorder Y=X; } else { X = Y; anyorder Y = 1; }
Each of the arms must be dag-like.
The first arm is daglike evaluating the X=0 assignment first.
The second arm is daglike evaluating Y=1 first. So, we get the answer you wanted:
if (COND) { X=0; Y=X; } else { Y = 1; X = Y; }
So conventional transformations and knowledge about anyorder-if-daglike knowledges
seems to give the right effect.
I'm not sure what you do if COND is computed as a function of the cells.
I suspect the way to do this is to generate a dependency graph of computations with
with conditionals on the dependencies. You probably have to propagate/group those conditionals over the arcs more as less as I did over the syntax.
Yes, literature exists, sorry I cannot quote any, I simply don't remember and would it just google up just as you can..
Basic algos for dependency and cycle analysis are really simple. I.e. detect symbols in the expression, build a set of expressions and dependencies in form:
inps expr outs
cell_A6, cell_B7 -> expr3 -> cell_A7
cell_A1, cell_B4 -> expr1 -> cell_A5
cell_A1, cell_A5 -> expr2 -> cell_A6
and then by comparing and iteratively expanding/replacing sets of inputs/outputs:
step0:
cell_A6, cell_B7 -> expr3 -> cell_A7
cell_A1, cell_B4 -> expr1 -> cell_A5 <--1 note that cell_A5 ~ (A1,B4)
cell_A1, cell_A5 -> expr2 -> cell_A6 <--1 apply that knowledge here
so dependency
cell_A1, cell_A5 -> expr2 -> cell_A6
morphs into
cell_A1, cell_B4 -> expr2 -> cell_A6 <--2 note that cell_A6 ~ (A1,B4) and so on
Finally, you will get either a set of full dependencies, where you can easily detect circular dependencies, like for example:
cell_A1, cell_D6, cell_F7 -> exprN -> cell_D6
or, if none found - you will be able to determine a safe, incremental order of the execution.
If the expressions contain branches or sideeffects other than the 'returned value', you can apply various transformations to reduce/expand the expressions into new ones, or into groups of new expressions that will be of the form above. For example:
B5 = { if(A5 + A3 > 0) A3-1 else A5+1 }
so
inps ... outs
A3, A5 -> theExpr -> B5
the condition can be 'lifted' and form two conditional rules:
A5 + A3 > 0 : A3 -> reducedexpr "A3-1" -> B5
A5 + A3 <= 0 : A5 -> reducedexpr "A5-1" -> B5
but now, your execution/analysis must also take care of the conditions before applying the rules. Lifting is only one of possible transformations.
However, you stil need something more than that, at least some an 'extension' for it. The hard part of your problem is that your expressions are complex, have branches, and you need to include user-random input to resolve branches to eliminate the dead branches and break dead dependencies.
Since the key is elimination of dead dependencies, you have to somehow detect dead branches. Conditions can be of any arbitrary complexity, and user-input is random, so you cannot work it out completely statically, really. After playing with transformations, you would still have to analyze the conditions and generate code accordingly. To do so, you would need to generate code for all possible combinations of the outcomes of the conditions, and all resulting branching and rule combinations, which is simply infeasible except for some trivial cases. With number of unknown the number of leafs can grow exponentially (2^N) which is a huge bloat after crossing some threshold.
Of course while analyzing conditions based on Bools, you can analyze, group and eliminate conflicting conditions like (a & b & !a)..
..but if your input values and conditions include NON-BOOL data, like integers or floating or strings, just imagine your condition is have a condition that executes some external weird statistical function and checks its result.. Ignore the 'weird' part and focus on 'external'. If you meet some expressions that use complex functions like AVG or MAX, you cannot chew through something like that statically(*). Even simple arithmetic is hard to analyze: (a+b)*(c+d) - you could derive a fact that c+d can be ignored when a+b==0, but this a really tough task to cover fully..
IIRC, doing a satisfiability analysis (SAT) for boolean expressions with basic operators is an NP-hard problem, not mentioning integers or floating points with all their math.. Calculating the result of expression is much easier than telling which values does it really depend on!!
So, since input values may be either hardcoded (cool) or user-supplied at runtime (doh!), your compler most probably will not be able to fully analyze it up front. Now link it with the fact marked as (*) and it's quite obvious that you can include some static analysis and try to eliminate some branches at 'compilation time', but still there might be some parts that must be delayed until the user provides the actual input.
So, if part of the analysis must be done at runtime, all the branch elimination is just an optional optimisation and I think you should focus on the runtime part now.
At minimal unoptimized version, your generated program could simply remember all the excel-expressions and wait for input data. Once the program is run and input is given, the program has to substitute the input in the expressions, and then try to iteratively reduce them to output values.
Writing such algo in imperative language is completely possible. Actually, you'd need to write it once, and later you'd just merge it with a different sets of rules derived from cell-formulas and done. Runtime part of the program would be the same, formulas would change.
You could then expand the 'compiler' side to try to help by i.e. preliminarily partially analyzing the dependencies and trying to reorder the rules so later they will be checked in a "better order", or by precalculating constants, or inlining some expressions and so on but as I said, it's all optimizations, not core feature.
Sadly, I cannot really tell you much anything serious about the "functional languages", but since usually their runtimes are 'very dynamic' and sometimes they even execute the code in terms of symbols and transformations, it could reduce the complexity of your 'compiler' and 'engine' part. The most valuable asset here is the dynamism. So, even a Ruby would do much better than C - but in no way it's a "compiled" language as you'd say.
For example, you could try to transform excel rules directly into functions:
def cell_A5 = expr1(cell_A1, cell_B4)
def cell_A7 = expr3(cell_A6, cell_B7)
def cell_A6 = expr2(cell_A1, cell_A5)
write it down as part of the program, then when at runtime when the user provides some values, you'd those would just redefine some of the parts of the program
cell_B7 = 11.2 // filling up undefined variable
cell_A1 = 23 // filling up undefined variable
cell_A5 = 13 // overwriting the function with a value
That's the power of dynamic platforms, nothing very 'functional' here. Dynamic platforms make it easy to fill/override bits. But then, once the user provided some bits and once the program has been "corrected on the fly", which one function would you call first?
The answer is somewhat sad.. You don't know.
If your dynamic language has some rule-engine built into it, you can try generating rules instead of functions and later rely on that engine to "fill up" everything that is possible to calculate.
But if it doesn't have rule engine, you are back to point one..
afterthought:
Hm.. sorry, I think I just wrote too much and too vaguely/chatty. If you think it's helpful, please drop me a comment. Otherwise I'll delete it after few days or a week.

OCaml: order of input two values

In functional language order of evaluation function arguments should have no sense.
However, even simplest programs could be not quite-functional. Here the code reads two integers and raises one into the power of other:
let pwr x y =
let rec pwrx = function 0 -> 1 | y -> x * pwrx (y - 1)
in pwrx y;;
print_int (pwr (read_int ()) (read_int ()));;
The code, obviously, reads the second argument first: if 5 and 4 are entered, result is 1024.
I suppose problem is in mishandling the language and lack of understanding its ideology. How should I wrote such things properly? Should I read two values in separate lines before calling function?
let x = read_int();;
let y = read_int();;
print_int (pwr x y);;
It works but looks like bit overhead - isn't it?
The problem is not the lack of functional aspect of the language, but the fact that a feature like read_line is not functional per se since it relies on an external input from stdin.
You should use local declarations like you did (and it's even pointed out here on official documentation). There's no real overhead since they're just declarations.
If you want this to be part of a single function and use purely local variables the code would be:
let x = read_int() in
let y = read_int() in
print_int (pwr x y)
Generally, speaking, if you want to enforce a particular order, you should use let statements to do it. It's slightly less pretty looking, but not everything can look elegant all the time, especially input and output in functional programming.

Haskell: Caches, memoization, and referential transparency [duplicate]

I can't figure out why m1 is apparently memoized while m2 is not in the following:
m1 = ((filter odd [1..]) !!)
m2 n = ((filter odd [1..]) !! n)
m1 10000000 takes about 1.5 seconds on the first call, and a fraction of that on subsequent calls (presumably it caches the list), whereas m2 10000000 always takes the same amount of time (rebuilding the list with each call). Any idea what's going on? Are there any rules of thumb as to if and when GHC will memoize a function? Thanks.
GHC does not memoize functions.
It does, however, compute any given expression in the code at most once per time that its surrounding lambda-expression is entered, or at most once ever if it is at top level. Determining where the lambda-expressions are can be a little tricky when you use syntactic sugar like in your example, so let's convert these to equivalent desugared syntax:
m1' = (!!) (filter odd [1..]) -- NB: See below!
m2' = \n -> (!!) (filter odd [1..]) n
(Note: The Haskell 98 report actually describes a left operator section like (a %) as equivalent to \b -> (%) a b, but GHC desugars it to (%) a. These are technically different because they can be distinguished by seq. I think I might have submitted a GHC Trac ticket about this.)
Given this, you can see that in m1', the expression filter odd [1..] is not contained in any lambda-expression, so it will only be computed once per run of your program, while in m2', filter odd [1..] will be computed each time the lambda-expression is entered, i.e., on each call of m2'. That explains the difference in timing you are seeing.
Actually, some versions of GHC, with certain optimization options, will share more values than the above description indicates. This can be problematic in some situations. For example, consider the function
f = \x -> let y = [1..30000000] in foldl' (+) 0 (y ++ [x])
GHC might notice that y does not depend on x and rewrite the function to
f = let y = [1..30000000] in \x -> foldl' (+) 0 (y ++ [x])
In this case, the new version is much less efficient because it will have to read about 1 GB from memory where y is stored, while the original version would run in constant space and fit in the processor's cache. In fact, under GHC 6.12.1, the function f is almost twice as fast when compiled without optimizations than it is compiled with -O2.
m1 is computed only once because it is a Constant Applicative Form, while m2 is not a CAF, and so is computed for each evaluation.
See the GHC wiki on CAFs: http://www.haskell.org/haskellwiki/Constant_applicative_form
There is a crucial difference between the two forms: the monomorphism restriction applies to m1 but not m2, because m2 has explicitly given arguments. So m2's type is general but m1's is specific. The types they are assigned are:
m1 :: Int -> Integer
m2 :: (Integral a) => Int -> a
Most Haskell compilers and interpreters (all of them that I know of actually) do not memoize polymorphic structures, so m2's internal list is recreated every time it's called, where m1's is not.
I'm not sure, because I'm quite new to Haskell myself, but it appears that it's beacuse the second function is parametrized and the first one is not. The nature of the function is that, it's result depends on input value and in functional paradigm especailly it depends ONLY on the input. Obvious implication is that a function with no parameters returns always the same value over and over, no matter what.
Aparently there's an optimizing mechanizm in GHC compiler that exploits this fact to compute the value of such a function only once for whole program runtime. It does it lazily, to be sure, but does it nonetheless. I noticed it myself, when I wrote the following function:
primes = filter isPrime [2..]
where isPrime n = null [factor | factor <- [2..n-1], factor `divides` n]
where f `divides` n = (n `mod` f) == 0
Then to test it, I entered GHCI and wrote: primes !! 1000. It took a few seconds, but finally I got the answer: 7927. Then I called primes !! 1001 and got the answer instantly. Similarly in an instant I got the result for take 1000 primes, because Haskell had to compute the whole thousand-element list to return 1001st element before.
Thus if you can write your function such that it takes no parameters, you probably want it. ;)

Are there Mathematica packages for presenting proofs/derivations?

When I write out a proof or derivation on paper I frequently make sign errors or drop terms as I move from one step to the next. I'd like to use Mathematica to save myself from these silly mistakes. I don't want Mathematica to solve the expression, I just want to use it carry out and display a series of algebraic manipulations. For a (trivial) example
In[111]:= MultBothSides[Equal[a_, b_], c_] := Equal[c a, c b];
In[112]:= expression = 2 a == a b
Out[112]= 2 a == a b
In[113]:= MultBothSides[expression, 1/a]
Out[113]= 2 == b
Can anyone point me to a package that would support this kind of manipulation?
Edit
Thanks for the input, not quite what I'm looking for though. The symbol manipulation isn't really the problem. I'm really looking for something that will make explicit the algebraic or mathematical justification of each step of a derivation. My goal here is really pedagogical.
Mathematica also provides a number of high-level functions for manipulating algebraic. Among these are Expand, Apart and Together, and Cancel, though there are quite a few more.
Also, for your specific example of applying the same transformation to both sides of an equation (that is, and expression with the head Equal), you can use the Thread function, which works just like your MultBothSides function, but with a great deal more generality.
In[1]:= expression = 2 a == a b
Out[1]:= 2 a == a b
In[2]:= Thread[expression /a, Equal]
Out[2]:= 2 == b
In[3]:= Thread[expression - c, Equal]
Out[3]:= 2 a - c == a b - c
In either of the presented solutions, it should be relatively easy to see what the step entailed. If you want something a little more explicit, you can write your own function like so:
In[4]:= ApplyToBothSides[f_, eq_Equal] := Map[f, eq]
In[5]:= ApplyToBothSides[4 * #&, expression]
Out[5]:= 8 a == 4 a b
It's a generalization of your MultBothSides function that takes advantage of the fact that Map works on expressions with any head, not just head List. If you're trying to communicate with an audience that is unfamiliar with Mathematica, using these sorts of names can help you communicate more clearly. In a related vein, if you want to use replacement rules as suggested by Ira Baxter, it may be helpful to write out Replace or ReplaceAll instead of using the /. syntactic sugar.
In[6]:= ReplaceAll[expression, a -> (x + y)]
Out[6]:= 2 (x + y) == b (x + y)
If you think it would be clearer to have the actual equation, instead of the variable name expression, in your input, and you're using the notebook interface, highlight the word expression with your mouse, call up the contextual menu, and select "Evaluate in Place".
The notebook interface is also a very pleasant environment for doing "literate programming", so you can also explain any steps that are not immediately obvious in words. I believe this is a good practice when writing mathematical proofs regardless of the medium.
I don't think you need a package. What you want to do is to manipulate each formula according to an inference rule. In MMa, you can model inference rules on a formula using transformations. So, if you have a formula f, you can apply an inference rule I by executing (my MMa syntax is 15 years rusty)
f ./ I
to produce the next formula in your sequence.
MMa will of course try to simplify your formulas if they contain standard algebraic operators and terms, such as constant numbers and arithmetic operators. You can prevent MMa from applying its own "inference" rules by enclosing your formula in a Hold[...] form.

Resources