Is having only one argument functions efficient? Haskell - performance

I have started learning Haskell and I have read that every function in haskell takes only one argument and I can't understand what magic happens under the hood of Haskell that makes it possible and I am wondering if it is efficient.
Example
>:t (+)
(+) :: Num a => a -> a -> a
Signature above means that (+) function takes one Num then returns another function which takes one Num and returns a Num
Example 1 is relatively easy but I have started wondering what happens when functions are a little more complex.
My Questions
For sake of the example I have written a zipWith function and executed it in two ways, once passing one argument at the time and once passing all arguments.
zipwithCustom f (x:xs) (y:ys) = f x y : zipwithCustom f xs ys
zipwithCustom _ _ _ = []
zipWithAdd = zipwithCustom (+)
zipWithAddTo123 = zipWithAdd [1,2,3]
test1 = zipWithAddTo123 [1,1,1]
test2 = zipwithCustom (+) [1,2,3] [1,1,1]
>test1
[2,3,4]
>test2
[2,3,4]
Is passing one argument at the time (scenario_1) as efficient as passing all arguments at once (scenario_2)?
Are those scenarios any different in terms of what Haskell is actually doing to compute test1 and test2 (except the fact that scenario_1 probably takes more memory as it needs to save zipWithAdd and zipWithAdd123)
Is this correct and why? In scenario_1 I iterate over [1,2,3] and then over [1,1,1]
Is this correct and why? In scenario_1 and scenario_2 I iterate over both lists at the same time
I realise that I have asked a lot of questions in one post but I believe those are connected and will help me (and other people who are new to Haskell) to better understand what actually is happening in Haskell that makes both scenarios possible.

You ask about "Haskell", but Haskell the language specification doesn't care about these details. It is up to implementations to choose how evaluation happens -- the only thing the spec says is what the result of the evaluation should be, and carefully avoids giving an algorithm that must be used for computing that result. So in this answer I will talk about GHC, which, practically speaking, is the only extant implementation.
For (3) and (4) the answer is simple: the iteration pattern is exactly the same whether you apply zipWithCustom to arguments one at a time or all at once. (And that iteration pattern is to iterate over both lists at once.)
Unfortunately, the answer for (1) and (2) is complicated.
The starting point is the following simple algorithm:
When you apply a function to an argument, a closure is created (allocated and initialized). A closure is a data structure in memory, containing a pointer to the function and a pointer to the argument. When the function body is executed, any time its argument is mentioned, the value of that argument is looked up in the closure.
That's it.
However, this algorithm kind of sucks. It means that if you have a 7-argument function, you allocate 7 data structures, and when you use an argument, you may have to follow a 7-long chain of pointers to find it. Gross. So GHC does something slightly smarter. It uses the syntax of your program in a special way: if you apply a function to multiple arguments, it generates just one closure for that application, with as many fields as there are arguments.
(Well... that might be not quite true. Actually, it tracks the arity of every function -- defined again in a syntactic way as the number of arguments used to the left of the = sign when that function was defined. If you apply a function to more arguments than its arity, you might get multiple closures or something, I'm not sure.)
So that's pretty nice, and from that you might think that your test1 would then allocate one extra closure compared to test2. And you'd be right... when the optimizer isn't on.
But GHC also does lots of optimization stuff, and one of those is to notice "small" definitions and inline them. Almost certainly with optimizations turned on, your zipWithAdd and zipWithAddTo123 would both be inlined anywhere they were used, and we'd be back to the situation where just one closure gets allocated.
Hopefully this explanation gets you to where you can answer questions (1) and (2) yourself, but just in case it doesn't, here's explicit answers to those:
Is passing one argument at the time as efficient as passing all arguments at once?
Maybe. It's possible that passing arguments one at a time will be converted via inlining to passing all arguments at once, and then of course they will be identical. In the absence of this optimization, passing one argument at a time has a (very slight) performance penalty compared to passing all arguments at once.
Are those scenarios any different in terms of what Haskell is actually doing to compute test1 and test2?
test1 and test2 will almost certainly be compiled to the same code -- possibly even to the point that only one of them is compiled and the other is an alias for it.
If you want to read more about the ideas in the implementation, the Spineless Tagless G-machine paper is much more approachable than its title suggests, and only a little bit out of date.

Related

Recursive function call hanging, Erlang

I am currently teaching my self Erlang. Everything is going well until I found a problem with this function.
-module(chapter).
-compile(export_all).
list_length([]) -> 0;
list_length([_|Xs]) -> 1+list_length([Xs]).
This was taken out of an textbook. When I run this code using OTP 17, it just hangs, meaning it just sits as shown below.
1> c(chapter).
{ok,chapter}
2> chapter:list_length([]).
0
3> chapter:list_length([1,2]).
When looking in the task manager the Erlang OTP is using 200 Mb to 330 Mb of memory. What causes this.
It is not terminating because you are creating a new non-empty list in every case: [Anything] is always a non-empty list, even if that list contains an empty list as its only member ([[]] is a non-empty list of one member).
A proper list terminates like this: [ Something | [] ].
So with that in mind...
list_length([]) -> 0;
list_length([_|Xs]) -> 1 + list_length(Xs).
In most functional languages "proper lists" are cons-style lists. Check out the Wikipedia entry on "cons" and the Erlang documentation about lists, and then meditate on examples of list operations you see written in example code for a bit.
NOTES
Putting whitespace around operators is a Good Thing; it will prevent you from doing confused stuff with arrows and binary syntax operators next to each other along with avoiding a few other ambiguities (and its easier to read anyway).
As Steve points out, the memory explosion you noticed is because while your function is recursive, it is not tail recursive -- that is, 1 + list_length(Xs) leaves pending work to be done, which must leave a reference on the stack. To have anything to add 1 to it must complete execution of list_length/1, return a value, and in this case remember that pending value as many times as there are members in the list. Read Steve's answer for an example of how tail recursive functions can be written using an accumulator value.
Since the OP is learning Erlang, note also that the list_length/1 function isn't amenable to tail call optimization because of its addition operation, which requires the runtime to call the function recursively, take its return value, add 1 to it, and return the result. This requires stack space, which means that if the list is long enough, you can run out of stack.
Consider this approach instead:
list_length(L) -> list_length(L, 0).
list_length([], Acc) -> Acc;
list_length([_|Xs], Acc) -> list_length(Xs, Acc+1).
This approach, which is very common in Erlang code, creates an accumulator in list_length/1 to hold the length value, initializing it to 0 and passing it to list_length/2, which does the recursion. Each call of list_length/2 then increments the accumulator, and when the list is empty, the first clause of list_length/2 returns the accumulator as the result. But note that the addition operation here occurs before the recursive call takes place, which means the calls are true tail calls and thus do not require extra stack space.
For non-beginner Erlang programmers, it can be instructive to compile both the original and modified versions of this module with erlc -S and examine the generated Erlang assembler. With the original version, the assembler contains allocate calls for stack space and uses call for recursive invocation, where call is the instruction for normal function calls. But for this modified version, no allocate calls are generated, and instead of using call it performs recursion using call_only, which is optimized for tail calls.

About speed of procedures between user-made and built-in in scheme (related with SICP exercise 1.23)

//My question was so long So I reduced.
In scheme, user-made procedures consume more time than built-in procedures?
(If both's functions are same)
//This is my short version question.
//Below is long long version question.
EX 1.23 is problem(below), why the (next) procedure isn't twice faster than (+ 1)?
This is my guess.
reason 1 : (next) contains 'if' (special-form) and it consumes time.
reason 2 : function call consumes time.
http://community.schemewiki.org/?sicp-ex-1.23 says reason 1 is right.
And I want to know reason 2 is also right.
SO I rewrote the (next) procedure. I didn't use 'if' and checked the number divided by 2 just once before use (next)(so (next) procedure only do + 2). And I remeasured the time. It was more fast than before BUT still not 2. SO I rewrote again. I changed (next) to (+ num 2). Finally It became 2 or almost 2. And I thought why. This is why I guess the 'reason 2'. I want to know what is correct answer.
ps. I'm also curious about why some primes are being tested (very?) fast than others? It doesn't make sense because if a number n is prime, process should see from 2 to sqrt(n). But some numbers are tested faster. Do you know why some primes are tested faster?
Exercise 1.23. The smallest-divisor procedure shown at the start of this section does lots of needless testing: After it checks to see if the number is divisible by 2 there is no point in checking to see if it is divisible by any larger even numbers. This suggests that the values used for test-divisor should not be 2, 3, 4, 5, 6, ..., but rather 2, 3, 5, 7, 9, .... To implement this change, define a procedure next that returns 3 if its input is equal to 2 and otherwise returns its input plus 2. Modify the smallest-divisor procedure to use (next test-divisor) instead of (+ test-divisor 1). With timed-prime-test incorporating this modified version of smallest-divisor, run the test for each of the 12 primes found in exercise 1.22. Since this modification halves the number of test steps, you should expect it to run about twice as fast. Is this expectation confirmed? If not, what is the observed ratio of the speeds of the two algorithms, and how do you explain the fact that it is different from 2?
Where the book is :
http://mitpress.mit.edu/sicp/full-text/book/book-Z-H-11.html#%_sec_1.2.6
Your long ans short questions are actually addressing two different problems.
EX 1.23 is problem(below), why the (next) procedure isn't twice faster than (+ 1)?
You provide two possible reasons to explain the relative lack of speed up (and 30% speed gain is already a good achievement):
The apparent use of a function next instead of a simple arithmetic expression (as I understand it from your explanations),
the inclusion of a test in that very next function,
The first reason is an illusion: (+ 1) is a function incrementing its argument, so in both cases there is a function call (although the increment function is certainly a builtin one, a point which is addressed by your other question).
The second reason is indeed relevant: any test in a block of code will introduce a potential redirection in the code flow (a jump from the current executing instruction to some other address in the program), which may incur a delay. Note that this is analogous to a function call, which will also induce an address jump, so both reasons actually resolve to only one potential cause.
Regarding your short question, builtin functions are indeed usually faster, because the compiler is able to apply a special treatment to them in certain cases. This is due to the facts that:
knowing the semantics of the builtins, compiler designers are able to include special rules pertaining to the algebraic properties of these builtins, and for instance fuse successive incréments in a single call, or suppress a combination of increment and decrement called in sequence.
A builtin function call, when not optimised away, will be converted into a native machine code function call, which may not have to abide to all the calling convention rules. If your scheme compiler produce machine code from the source, then there might be only a marginal gain, but if it produce so called bytecode, the gain might be quite substantial, since user written functions will be translated to that bytecode format, and still require some form of interpretation. If you are only using an interpreter, then the gain is even more important.
I believe this is highly implementation and setting dependent. In many implementations there are different kinds of optimizations and in some there are none. To get the best performance you may need to compile your code or use settings to reduce debug information / stack traces. Getting the best performance in one implementation can worsen the performance in another.
Primitive procedures are usually compiled to be native and in some implementations, like ikarus, it's even inlined. (When you do (map car lst) ikarus changes it to (map (lambda (x) (car x)) lst) since car isn't a procedure) Lambdas are supposed to be cheap.. Remember many scheme implementations change your code to CPS and that is one procedure call for each expression in the body of a procedure call. It will never be as fast as machine code since it needs to do load closure variables.
To check which of the two options which are correct for your implementation make next do the same as it originally did. eg. no if but just increment the argument. The difference now is the extra call and nothing else. Then you can inline next by writing it's code directly in your procedure and substituting arguments for the operands. Is it still slower, then it's if. you need to run the tests several times, preferably with large enough number of primes to produce that it runs for a minute or so. Use time or similar in both the Scheme implementations to get the differences in ms. I use unix time command as well to see how the OS reflects on it.
You should also test to see if you get the same reason in some other implementation. It's not like it's not enough Scheme implementations out there so know yourself out! The differences between them might amaze you. I always use racket (raco exe source to make executable) and Ikarus. WHen doing a large test I include Chicken, Gambit and Chibi.

How to recognize variables that don't affect the output of a program?

Sometimes the value of a variable accessed within the control-flow of a program cannot possibly have any effect on a its output. For example:
global var_1
global var_2
start program hello(var_3, var_4)
if (var_2 < 0) then
save-log-to-disk (var_1, var_3, var_4)
end-if
return ("Hello " + var_3 + ", my name is " + var_1)
end program
Here only var_1 and var_3 have any influence on the output, while var_2 and var_4 are only used for side effects.
Do variables such as var_1 and var_3 have a name in dataflow-theory/compiler-theory?
Which static dataflow analysis techniques can be used to discover them?
References to academic literature on the subject would be particularly appreciated.
The problem that you stated is undecidable in general,
even for the following very narrow special case:
Given a single routine P(x), where x is a parameter of type integer. Is the output of P(x) independent of the value of x, i.e., does
P(0) = P(1) = P(2) = ...?
We can reduce the following still undecidable version of the halting problem to the question above: Given a Turing machine M(), does the program
never stop on the empty input?
I assume that we use a (Turing-complete) language in which we can build a "Turing machine simulator":
Given the program M(), construct this routine:
P(x):
if x == 0:
return 0
Run M() for x steps
if M() has terminated then:
return 1
else:
return 0
Now:
P(0) = P(1) = P(2) = ...
=>
M() does not terminate.
M() does terminate
=> P(x) = 1 for a sufficiently large x
=> P(x) != P(0) = 0
So, it is very difficult for a compiler to decide whether a variable actually does not influence the return value of a routine; in your example, the "side effect routine" might manipulate one of its values (or even loop infinitely, which would most definitely change the return value of the routine ;-)
Of course overapproximations are still possible. For example, one might conclude that a variable does not influence the return value if it does not appear in the routine body at all. You can also see some classical compiler analyses (like Expression Simplification, Constant propagation) having the side effect of eliminating appearances of such redundant variables.
Pachelbel has discussed the fact that you cannot do this perfectly. OK, I'm an engineer, I'm willing to accept some dirt in my answer.
The classic way to answer you question is to do dataflow tracing from program outputs back to program inputs. A dataflow is the connection of a program assignment (or sideeffect) to a variable value, to a place in the application that consumes that value.
If there is (transitive) dataflow from a program output that you care about (in your example, the printed text stream) to an input you supplied (var2), then that input "affects" the output. A variable that does not flow from the input to your desired output is useless from your point of view.
If you focus your attention only the computations involved in the dataflows, and display them, you get what is generally called a "program slice" . There are (very few) commercial tools that can show this to you.
Grammatech has a good reputation here for C and C++.
There are standard compiler algorithms for constructing such dataflow graphs; see any competent compiler book.
They all suffer from some limitation due to Turing's impossibility proofs as pointed out by Pachelbel. When you implement such a dataflow algorithm, there will be places that it cannot know the right answer; simply pick one.
If your algorithm chooses to answer "there is no dataflow" in certain places where it is not sure, then it may miss a valid dataflow and it might report that a variable does not affect the answer incorrectly. (This is called a "false negative"). This occasional error may be satisfactory if
the algorithm has some other nice properties, e.g, it runs really fast on a millions of code. (The trivial algorithm simply says "no dataflow" in all places, and it is really fast :)
If your algorithm chooses to answer "yes there is a dataflow", then it may claim that some variable affects the answer when it does not. (This is called a "false positive").
You get to decide which is more important; many people prefer false positives when looking for a problem, because then you have to at least look at possibilities detected by the tool. A false negative means it didn't report something you might care about. YMMV.
Here's a starting reference: http://en.wikipedia.org/wiki/Data-flow_analysis
Any of the books on that page will be pretty good. I have Muchnick's book and like it lot. See also this page: (http://en.wikipedia.org/wiki/Program_slicing)
You will discover that implementing this is pretty big effort, for any real langauge. You are probably better off finding a tool framework that does most or all this for you already.
I use the following algorithm: a variable is used if it is a parameter or it occurs anywhere in an expression, excluding as the LHS of an assignment. First, count the number of uses of all variables. Delete unused variables and assignments to unused variables. Repeat until no variables are deleted.
This algorithm only implements a subset of the OP's requirement, it is horribly inefficient because it requires multiple passes. A garbage collection may be faster but is harder to write: my algorithm only requires a list of variables with usage counts. Each pass is linear in the size of the program. The algorithm effectively does a limited kind of dataflow analysis by elimination of the tail of a flow ending in an assignment.
For my language the elimination of side effects in the RHS of an assignment to an unused variable is mandated by the language specification, it may not be suitable for other languages. Effectiveness is improved by running before inlining to reduce the cost of inlining unused function applications, then running it again afterwards which eliminates parameters of inlined functions.
Just as an example of the utility of the language specification, the library constructs a thread pool and assigns a pointer to it to a global variable. If the thread pool is not used, the assignment is deleted, and hence the construction of the thread pool elided.
IMHO compiler optimisations are almost invariably heuristics whose performance matters more than effectiveness achieving a theoretical goal (like removing unused variables). Simple reductions are useful not only because they're fast and easy to write, but because a programmer using a language who understand basics of the compiler operation can leverage this knowledge to help the compiler. The most well known example of this is probably the refactoring of recursive functions to place the recursion in tail position: a pointless exercise unless the programmer knows the compiler can do tail-recursion optimisation.

most readable way in XPath to write "is value X a member of sequence S"?

XPath 2.0 has some new functions and syntax, relative to 1.0, that work with sequences. Some of theset don't really add to what the language could already do in 1.0 (with node sets), but they make it easier to express the desired logic in ways that are more readable. This increases the chances of the programmer getting the code correct -- and keeping it that way. For example,
empty(s) is equivalent to not(s), but its intent is much clearer when you want to test whether a sequence is empty.
Correction: the effective boolean value of a sequence is in general more complicated than that. E.g. empty((0)) != not((0)). This applies to exists(s) vs. s in a boolean context as well. However, there are domains of s where empty(s) is equivalent to not(s), so the two could be used interchangeably within those domains. But this goes to show that the use of empty() can make a non-trivial difference in making code easier to understand.
Similarly, exists(s) is equivalent to boolean(s) that already existed in XPath 1.0 (or just s in a boolean context), but again is much clearer about the intent.
Quantified expressions; e.g. "some $x in expression satisfies test($x)" would be equivalent to boolean(expression[test(.)]) (although the new syntax is more flexible, in that you don't need to worry about losing the context item because you have the variable to refer to it by).
Similarly, "every $x in expression satisfies test($x)" would be equivalent to not(expression[not(test(.))]) but is more readable.
These functions and syntax were evidently added at no small cost, solely to serve the goal of writing XPath that is easier to map to how humans think. This implies, as experienced developers know, that understandable code is significantly superior to code that is difficult to understand.
Given all that ... what would be a clear and readable way to write an XPath test expression that asks
Does value X occur in sequence S?
Some ways to do it: (Note: I used X and S notation here to indicate the value and the sequence, but I don't mean to imply that these subexpressions are element name tests, nor that they are simple expressions. They could be complicated.)
X = S: This would be one of the most unreadable, since it requires the reader to
think about which of X and S are sequences vs. single values
understand general comparisons, which are not obvious from the syntax
However, one advantage of this form is that it allows us to put the topic (X) before the comment ("is a member of S"), which, I think, helps in readability.
See also CMS's good point about readability, when the syntax or names make the "cardinality" of X and S obvious.
index-of(S, X): This one is clear about what's intended as a value and what as a sequence (if you remember the order of arguments to index-of()). But it expresses more than we need to: it asks for the index, when all we really want to know is whether X occurs in S. This is somewhat misleading to the reader. An experienced developer will figure out what's intended, with some effort and with understanding of the context. But the more we rely on context to understand the intent of each line, the more understanding the code becomes a circular (spiral) and potentially Sisyphean task! Also, since index-of() is designed to return a list of all the indexes of occurrences of X, it could be more expensive than necessary: a smart processor, in order to evaluate X = S, wouldn't necessarily have to find all the contents of S, nor enumerate them in order; but for index-of(S, X), correct order would have to be determined, and all contents of S must be compared to X. One other drawback of using index-of() is that it's limited to using eq for comparison; you can't, for example, use it to ask whether a node is identical to any node in a given sequence.
Correction: This form, used as a conditional test, can result in a runtime error: Effective boolean value is not defined for a sequence of two or more items starting with a numeric value. (But at least we won't get wrong boolean values, since index-of() can't return a zero.) If S can have multiple instances of X, this is another good reason to prefer form 3 or 6.
exists(index-of(X, S)): makes the intent clearer, and would help the processor eliminate the performance penalty if the processor is smart enough.
some $m in S satisfies $m eq X: This one is very clear, and matches our intent exactly. It seems long-winded compared to 1, and that in itself can reduce readability. But maybe that's an acceptable price for clarity. Keep in mind that X and S could potentially be complex expressions themselves -- they're not necessarily just variable references. An advantage is that since the eq operator is explicit, you can replace it with is or any other comparison operator.
S[. eq X]: clearer than 1, but shares the semantic drawbacks of 2: it computes all members of S that are equal to X. Actually, this could return a false negative (incorrect effective boolean value), if X is falsy. E.g. (0, 1)[. eq 0] returns 0 which is falsy, even though 0 occurs in (0, 1).
exists(S[. eq X]): Clearer than 1, 2, 3, and 5. Not as clear as 4, but shorter. Avoids the drawbacks of 5 (or at least most of them, depending on the processor smarts).
I'm kind of leaning toward the last one, at this point: exists(S[. eq X])
What about you... As a developer coming to a complex, unfamiliar XSLT or XQuery or other program that uses XPath 2.0, and wanting to figure out what that program is doing, which would you find easiest to read?
Apologies for the long question. Thanks for reading this far.
Edit: I changed = to eq wherever possible in the above discussion, to make it easier to see where a "value comparison" (as opposed to a general comparison) was intended.
For what it's worth, if names or context make clear that X is a singleton, I'm happy to use your first form, X = S -- for example when I want to check an attribute value against a set of possible values:
<xsl:when test="#type = ('A', 'A+', 'A-', 'B+')" />
or
<xsl:when test="#type = $magic-types"/>
If I think there is a risk of confusion, then I like your sixth formulation. The less frequently I have to remember the rules for calculating an effective boolean value, the less frequently I make a mistake with them.
I prefer this one:
count(distinct-values($seq)) eq count(distinct-values(($x, $seq)))
When $x is itself a sequence, this expression implements the (value-based) subset of relation between two sets of values, that are represented as sequences. This implementation of subset of has just linear time complexity -- vs many other ways of expressing this, that have O(N^2)) time complexity.
To summarize, the question whether a single value belongs to a set of values is a special case of the question whether one set of values is a subset of another. If we have a good implementation of the latter, we can simply use it for answering the former.
The functx library has a nice implementation of this function, so you can use
functx:is-node-in-sequence($X, $Y)
(this particular function can be found at http://www.xqueryfunctions.com/xq/functx_is-node-in-sequence.html)
The whole functx library is available for both XQuery (http://www.xqueryfunctions.com/) and XSLT (http://www.xsltfunctions.com/)
Marklogic ships the functx library with their core product; other vendors may also.
Another possibility, when you want to know whether node X occurs in sequence S, is
exists((X) intersect S)
I think that's pretty readable, and concise. But it only works when X and the values in S are nodes; if you try to ask
exists(('bob') intersect ('alice', 'bob'))
you'll get a runtime error.
In the program I'm working on now, I need to compare strings, so this isn't an option.
As Dimitri notes, the occurrence of a node in a sequence is a question of identity, not of value comparison.

When to use various language pragmas and optimisations?

I have a fair bit of understanding of haskell but I am always little unsure about what kind of pragmas and optimizations I should use and where. Like
Like when to use SPECIALIZE pragma and what performance gains it has.
Where to use RULES. I hear people taking about a particular rule not firing? How do we check that?
When to make arguments of a function strict and when does that help? I understand that making argument strict will make the arguments to be evaluated to normal form, then why should I not add strictness to all function arguments? How do I decide?
How do I see and check I have a space leak in my program? What are the general patterns which constitute to a space leak?
How do I see if there is a problem with too much lazyness? I can always check the heap profiling but I want to know what are the general cause, examples and patterns where lazyness hurts?
Is there any source which talks about advanced optimizations (both at higher and very low levels) especially particular to haskell?
Like when to use SPECIALIZE pragma and what performance gains it has.
You let the compiler specialise a function if you have a (type class) polymorphic function, and expect it to be called often at one or a few instances of the class(es).
The specialisation removes the dictionary lookup where it is used, and often enables further optimisation, the class member functions can often be inlined then, and they are subject to strictness analysis, both give potentially huge performance gains. If the only optimisation possible is the elimination of the dicitonary lookup, the gain won't generally be huge.
As of GHC-7, it's probably more useful to give the function an {-# INLINABLE #-} pragma, which makes its (nearly unchanged, some normalising and desugaring is performed) source available in the interface file, so the function can be specialised and possibly even inlined at the call site.
Where to use RULES. I hear people taking about a particular rule not firing? How do we check that?
You can check which rules have fired by using the -ddump-rule-firings command line option. That usually dumps a large number of fired rules, so you have to search a bit for your own rules.
You use rules
when you have a more efficient version of a function for special types, e.g.
{-# RULES
"realToFrac/Float->Double" realToFrac = float2Double
#-}
when some functions can be replaced with a more efficient version for special arguments, e.g.
{-# RULES
"^2/Int" forall x. x ^ (2 :: Int) = let u = x in u*u
"^3/Int" forall x. x ^ (3 :: Int) = let u = x in u*u*u
"^4/Int" forall x. x ^ (4 :: Int) = let u = x in u*u*u*u
"^5/Int" forall x. x ^ (5 :: Int) = let u = x in u*u*u*u*u
"^2/Integer" forall x. x ^ (2 :: Integer) = let u = x in u*u
"^3/Integer" forall x. x ^ (3 :: Integer) = let u = x in u*u*u
"^4/Integer" forall x. x ^ (4 :: Integer) = let u = x in u*u*u*u
"^5/Integer" forall x. x ^ (5 :: Integer) = let u = x in u*u*u*u*u
#-}
when rewriting an expression according to general laws might produce code that's better to optimise, e.g.
{-# RULES
"map/map" forall f g. (map f) . (map g) = map (f . g)
#-}
Extensive use of RULES in the latter style is made in fusion frameworks, for example in the text library, and for the list functions in base, a different kind of fusion (foldr/build fusion) is implemented using rules.
When to make arguments of a function strict and when does that help? I understand that making argument strict will make the arguments to be evaluated to normal form, then why should I not add strictness to all function arguments? How do I decide?
Making an argument strict will ensure that it is evaluated to weak head normal form, not to normal form.
You do not make all arguments strict because some functions must be non-strict in some of their arguments to work at all and some are less efficient if strict in all arguments.
For example partition must be non-strict in its second argument to work at all on infinite lists, more general every function used in foldr must be non-strict in the second argument to work on infinite lists. On finite lists, having the function non-strict in the second argument can make it dramatically more efficient (foldr (&&) True (False:replicate (10^9) True)).
You make an argument strict, if you know that the argument must be evaluated before any worthwhile work can be done anyway. In many cases, the strictness analyser of GHC can do that on its own, but of course not in all.
A very typical case are accumulators in loops or tail recursions, where adding strictness prevents the building of huge thunks on the way.
I know no hard-and-fast rules for where to add strictness, for me it's a matter of experience, after a while you learn in what places adding strictness is likely to help and where to harm.
As a rule of thumb, it makes sense to keep small data (like Int) evaluated, but there are exceptions.
How do I see and check I have a space leak in my program? What are the general patterns which constitute to a space leak?
The first step is to use the +RTS -s option (if the programme was linked with rtsopts enabled). That shows you how much memory was used overall, and you can often judge by that whether you have a leak.
A more informative output can be obtained from running the programme with the +RTS -hT option, that produces a heap profile that can help locating the space leak (also, the programme needs to be linked with enabled rtsopts).
If further analysis is required, the programme needs to be compiled with profiling enabled (-rtsops -prof -fprof-auto, in older GHCs, the -fprof-auto option wasn't available, the -prof-auto-all option is the closest correspondence there).
Then you run it with various profiling options and look at the generated heap profiles.
The two most common causes for space leaks are
too much laziness
too much strictness
the third place is probably taken by unwanted sharing, GHC does little common subexpression elimination, but it occasionally shares long lists even where not wanted.
For finding the cause of a leak, I know again no hard-and-fast rules, and occasionally, a leak can be fixed by adding strictness in one place or by adding laziness in another.
How do I see if there is a problem with too much lazyness? I can always check the heap profiling but I want to know what are the general cause, examples and patterns where lazyness hurts?
Generally, laziness is wanted where results can be built up incrementally, and unwanted where no part of the result can be delivered before processing is complete, like in left folds or generally in tail-recursive functions.
I recommend reading the GHC documentation on Pragmas and Rewrite Rules, as they address many of your questions about SPECIALIZE and RULES.
To briefly address your questions:
SPECIALIZE is used to force the compiler to build a specialized version of a polymorphic function for a particular type. The advantage is that applying the function in that case will no longer require the dictionary. The disadvantage is that it will increase the size of your program. Specialization is particularly valuable for functions called in "inner-loops", and it's essentially useless for infrequently called top-level functions. Refer to the GHC documentation for interactions with INLINE.
RULES allows you to specify rewrite rules that you know to be valid but the compiler couldn't infer on its own. The common example is {-# RULES "mapfusion" forall f g xs. map f (map g xs) = map (f.g) xs #-}, which tells GHC how to fuse map. It can be finicky to get GHC to use the rules because of interference with INLINE. 7.19.3 touches on how to avoid conflicts and also how to force GHC to use a rule even when it would normally avoid it.
Strict arguments are most vital for something like an accumulator in a tail-recursive function. You know that the value will ultimately be fully calculated, and building up a stack of closures to delay the computation completely defeats the purpose. Enforced strictness must naturally be avoided anytime the function may be applied to a value which must be processed lazily, like an infinite list. Generally, the best idea is to initially only force strictness where it's obviously useful (like accumulators), and then add more later only as profiling shows it's needed.
My experience has been that most show-stopping space leaks came from lazy accumulators and unevaluated lazy values in very large data-structures, although I'm sure this is specific to the kinds of programs you're writing. Using unboxed data-structures whenever possible fixes a lot of the problems.
Outside of the instances where laziness causes space-leaks, the major situation where it should be avoided is in IO. Lazily processing resource inherently increases the amount of wall-clock time that the resource is needed. This can be bad for cache performance, and it's obviously bad if something else wants exclusive rights to use the same resource.

Resources