I've tried to search, but was unable to find: what are requisites for functions so that gcc would optimize the tail recursion? Is there any reference or list that would contain the most important cases? Since this is version specific, of my interests are 4.6.3 or higher versions (the newer the better). However, in fact, any concrete reference would be greatly appreciated.
Thanks in advance!
With -O2 enabled, gcc will perform tail call optimization, if possible. Now, the question is: When is it possible?
It is possible to eliminate a single recursive call whenever the recursive call happens either immediately before or inside the (single) return statement, so there are no observable side effects afterwards other than possibly returning a different value. This includes returning expressions involving the ternary operator.
Note that even if there are several recursive calls in the return expression (such as in the well-known fibonacci example: return (n==1) ? 1 : fib(n-1)+fib(n-2);), tail call recursion is still applied, but only ever to the last recursive call.
As an obvious special case, if the compiler can prove that the recursive function (and its arguments) qualifies as constant expression, it can (up to a configurable maximum recursion depth, by default 500) eliminate not only the tail call, but the entire execution of the function. This is something GCC does routinely and quite reliably at -O2, which even includes calls to certain library functions such as strlen on literals.
Related
I have started learning Haskell and I have read that every function in haskell takes only one argument and I can't understand what magic happens under the hood of Haskell that makes it possible and I am wondering if it is efficient.
Example
>:t (+)
(+) :: Num a => a -> a -> a
Signature above means that (+) function takes one Num then returns another function which takes one Num and returns a Num
Example 1 is relatively easy but I have started wondering what happens when functions are a little more complex.
My Questions
For sake of the example I have written a zipWith function and executed it in two ways, once passing one argument at the time and once passing all arguments.
zipwithCustom f (x:xs) (y:ys) = f x y : zipwithCustom f xs ys
zipwithCustom _ _ _ = []
zipWithAdd = zipwithCustom (+)
zipWithAddTo123 = zipWithAdd [1,2,3]
test1 = zipWithAddTo123 [1,1,1]
test2 = zipwithCustom (+) [1,2,3] [1,1,1]
>test1
[2,3,4]
>test2
[2,3,4]
Is passing one argument at the time (scenario_1) as efficient as passing all arguments at once (scenario_2)?
Are those scenarios any different in terms of what Haskell is actually doing to compute test1 and test2 (except the fact that scenario_1 probably takes more memory as it needs to save zipWithAdd and zipWithAdd123)
Is this correct and why? In scenario_1 I iterate over [1,2,3] and then over [1,1,1]
Is this correct and why? In scenario_1 and scenario_2 I iterate over both lists at the same time
I realise that I have asked a lot of questions in one post but I believe those are connected and will help me (and other people who are new to Haskell) to better understand what actually is happening in Haskell that makes both scenarios possible.
You ask about "Haskell", but Haskell the language specification doesn't care about these details. It is up to implementations to choose how evaluation happens -- the only thing the spec says is what the result of the evaluation should be, and carefully avoids giving an algorithm that must be used for computing that result. So in this answer I will talk about GHC, which, practically speaking, is the only extant implementation.
For (3) and (4) the answer is simple: the iteration pattern is exactly the same whether you apply zipWithCustom to arguments one at a time or all at once. (And that iteration pattern is to iterate over both lists at once.)
Unfortunately, the answer for (1) and (2) is complicated.
The starting point is the following simple algorithm:
When you apply a function to an argument, a closure is created (allocated and initialized). A closure is a data structure in memory, containing a pointer to the function and a pointer to the argument. When the function body is executed, any time its argument is mentioned, the value of that argument is looked up in the closure.
That's it.
However, this algorithm kind of sucks. It means that if you have a 7-argument function, you allocate 7 data structures, and when you use an argument, you may have to follow a 7-long chain of pointers to find it. Gross. So GHC does something slightly smarter. It uses the syntax of your program in a special way: if you apply a function to multiple arguments, it generates just one closure for that application, with as many fields as there are arguments.
(Well... that might be not quite true. Actually, it tracks the arity of every function -- defined again in a syntactic way as the number of arguments used to the left of the = sign when that function was defined. If you apply a function to more arguments than its arity, you might get multiple closures or something, I'm not sure.)
So that's pretty nice, and from that you might think that your test1 would then allocate one extra closure compared to test2. And you'd be right... when the optimizer isn't on.
But GHC also does lots of optimization stuff, and one of those is to notice "small" definitions and inline them. Almost certainly with optimizations turned on, your zipWithAdd and zipWithAddTo123 would both be inlined anywhere they were used, and we'd be back to the situation where just one closure gets allocated.
Hopefully this explanation gets you to where you can answer questions (1) and (2) yourself, but just in case it doesn't, here's explicit answers to those:
Is passing one argument at the time as efficient as passing all arguments at once?
Maybe. It's possible that passing arguments one at a time will be converted via inlining to passing all arguments at once, and then of course they will be identical. In the absence of this optimization, passing one argument at a time has a (very slight) performance penalty compared to passing all arguments at once.
Are those scenarios any different in terms of what Haskell is actually doing to compute test1 and test2?
test1 and test2 will almost certainly be compiled to the same code -- possibly even to the point that only one of them is compiled and the other is an alias for it.
If you want to read more about the ideas in the implementation, the Spineless Tagless G-machine paper is much more approachable than its title suggests, and only a little bit out of date.
I think that I understand the textbook definition of a tail recursrive function: a function that doesn't perform any calculations after the function call. I also get that as a consequence a tail recursive function will be more memory efficient, because it will need only one record for every call, instead of needing to keep a record each (as in normal recursion).
What is less clear to me is how does this definition apply to nested calls. I'll provide an example:
func foo91(x int)
if(x > 100):
return x - 10
else:
return foo91(foo91(x+11))
The answer I originally came up with was that it was not tail recursive by definition (because the external call is performed after evaluating the interal one, so other calculations are done after the first call), so generally functions with nested recursive calls are not tail recursive; on the other hand it is the same in practice, as it shares the side effects of a tail recursive function: it seems to me that a single activation record is needed for the whole function. Is that true?
Are nested recursive function calls generally considerable tail recursive?
Your understanding is only partially correct.
You defined tail recursion correctly. But whether it is actually efficient is implementation dependent. In some languages like Scheme it is. But most languages are like JavaScript and it is not. In particular one of the key tradeoffs is that making tail recursion efficient means you lose stack backtraces.
Now to your actual question, the inner call is not tail recursive because it has to come back to your call and do something else. But the outer call is tail recursive.
I am currently teaching my self Erlang. Everything is going well until I found a problem with this function.
-module(chapter).
-compile(export_all).
list_length([]) -> 0;
list_length([_|Xs]) -> 1+list_length([Xs]).
This was taken out of an textbook. When I run this code using OTP 17, it just hangs, meaning it just sits as shown below.
1> c(chapter).
{ok,chapter}
2> chapter:list_length([]).
0
3> chapter:list_length([1,2]).
When looking in the task manager the Erlang OTP is using 200 Mb to 330 Mb of memory. What causes this.
It is not terminating because you are creating a new non-empty list in every case: [Anything] is always a non-empty list, even if that list contains an empty list as its only member ([[]] is a non-empty list of one member).
A proper list terminates like this: [ Something | [] ].
So with that in mind...
list_length([]) -> 0;
list_length([_|Xs]) -> 1 + list_length(Xs).
In most functional languages "proper lists" are cons-style lists. Check out the Wikipedia entry on "cons" and the Erlang documentation about lists, and then meditate on examples of list operations you see written in example code for a bit.
NOTES
Putting whitespace around operators is a Good Thing; it will prevent you from doing confused stuff with arrows and binary syntax operators next to each other along with avoiding a few other ambiguities (and its easier to read anyway).
As Steve points out, the memory explosion you noticed is because while your function is recursive, it is not tail recursive -- that is, 1 + list_length(Xs) leaves pending work to be done, which must leave a reference on the stack. To have anything to add 1 to it must complete execution of list_length/1, return a value, and in this case remember that pending value as many times as there are members in the list. Read Steve's answer for an example of how tail recursive functions can be written using an accumulator value.
Since the OP is learning Erlang, note also that the list_length/1 function isn't amenable to tail call optimization because of its addition operation, which requires the runtime to call the function recursively, take its return value, add 1 to it, and return the result. This requires stack space, which means that if the list is long enough, you can run out of stack.
Consider this approach instead:
list_length(L) -> list_length(L, 0).
list_length([], Acc) -> Acc;
list_length([_|Xs], Acc) -> list_length(Xs, Acc+1).
This approach, which is very common in Erlang code, creates an accumulator in list_length/1 to hold the length value, initializing it to 0 and passing it to list_length/2, which does the recursion. Each call of list_length/2 then increments the accumulator, and when the list is empty, the first clause of list_length/2 returns the accumulator as the result. But note that the addition operation here occurs before the recursive call takes place, which means the calls are true tail calls and thus do not require extra stack space.
For non-beginner Erlang programmers, it can be instructive to compile both the original and modified versions of this module with erlc -S and examine the generated Erlang assembler. With the original version, the assembler contains allocate calls for stack space and uses call for recursive invocation, where call is the instruction for normal function calls. But for this modified version, no allocate calls are generated, and instead of using call it performs recursion using call_only, which is optimized for tail calls.
Sometimes the value of a variable accessed within the control-flow of a program cannot possibly have any effect on a its output. For example:
global var_1
global var_2
start program hello(var_3, var_4)
if (var_2 < 0) then
save-log-to-disk (var_1, var_3, var_4)
end-if
return ("Hello " + var_3 + ", my name is " + var_1)
end program
Here only var_1 and var_3 have any influence on the output, while var_2 and var_4 are only used for side effects.
Do variables such as var_1 and var_3 have a name in dataflow-theory/compiler-theory?
Which static dataflow analysis techniques can be used to discover them?
References to academic literature on the subject would be particularly appreciated.
The problem that you stated is undecidable in general,
even for the following very narrow special case:
Given a single routine P(x), where x is a parameter of type integer. Is the output of P(x) independent of the value of x, i.e., does
P(0) = P(1) = P(2) = ...?
We can reduce the following still undecidable version of the halting problem to the question above: Given a Turing machine M(), does the program
never stop on the empty input?
I assume that we use a (Turing-complete) language in which we can build a "Turing machine simulator":
Given the program M(), construct this routine:
P(x):
if x == 0:
return 0
Run M() for x steps
if M() has terminated then:
return 1
else:
return 0
Now:
P(0) = P(1) = P(2) = ...
=>
M() does not terminate.
M() does terminate
=> P(x) = 1 for a sufficiently large x
=> P(x) != P(0) = 0
So, it is very difficult for a compiler to decide whether a variable actually does not influence the return value of a routine; in your example, the "side effect routine" might manipulate one of its values (or even loop infinitely, which would most definitely change the return value of the routine ;-)
Of course overapproximations are still possible. For example, one might conclude that a variable does not influence the return value if it does not appear in the routine body at all. You can also see some classical compiler analyses (like Expression Simplification, Constant propagation) having the side effect of eliminating appearances of such redundant variables.
Pachelbel has discussed the fact that you cannot do this perfectly. OK, I'm an engineer, I'm willing to accept some dirt in my answer.
The classic way to answer you question is to do dataflow tracing from program outputs back to program inputs. A dataflow is the connection of a program assignment (or sideeffect) to a variable value, to a place in the application that consumes that value.
If there is (transitive) dataflow from a program output that you care about (in your example, the printed text stream) to an input you supplied (var2), then that input "affects" the output. A variable that does not flow from the input to your desired output is useless from your point of view.
If you focus your attention only the computations involved in the dataflows, and display them, you get what is generally called a "program slice" . There are (very few) commercial tools that can show this to you.
Grammatech has a good reputation here for C and C++.
There are standard compiler algorithms for constructing such dataflow graphs; see any competent compiler book.
They all suffer from some limitation due to Turing's impossibility proofs as pointed out by Pachelbel. When you implement such a dataflow algorithm, there will be places that it cannot know the right answer; simply pick one.
If your algorithm chooses to answer "there is no dataflow" in certain places where it is not sure, then it may miss a valid dataflow and it might report that a variable does not affect the answer incorrectly. (This is called a "false negative"). This occasional error may be satisfactory if
the algorithm has some other nice properties, e.g, it runs really fast on a millions of code. (The trivial algorithm simply says "no dataflow" in all places, and it is really fast :)
If your algorithm chooses to answer "yes there is a dataflow", then it may claim that some variable affects the answer when it does not. (This is called a "false positive").
You get to decide which is more important; many people prefer false positives when looking for a problem, because then you have to at least look at possibilities detected by the tool. A false negative means it didn't report something you might care about. YMMV.
Here's a starting reference: http://en.wikipedia.org/wiki/Data-flow_analysis
Any of the books on that page will be pretty good. I have Muchnick's book and like it lot. See also this page: (http://en.wikipedia.org/wiki/Program_slicing)
You will discover that implementing this is pretty big effort, for any real langauge. You are probably better off finding a tool framework that does most or all this for you already.
I use the following algorithm: a variable is used if it is a parameter or it occurs anywhere in an expression, excluding as the LHS of an assignment. First, count the number of uses of all variables. Delete unused variables and assignments to unused variables. Repeat until no variables are deleted.
This algorithm only implements a subset of the OP's requirement, it is horribly inefficient because it requires multiple passes. A garbage collection may be faster but is harder to write: my algorithm only requires a list of variables with usage counts. Each pass is linear in the size of the program. The algorithm effectively does a limited kind of dataflow analysis by elimination of the tail of a flow ending in an assignment.
For my language the elimination of side effects in the RHS of an assignment to an unused variable is mandated by the language specification, it may not be suitable for other languages. Effectiveness is improved by running before inlining to reduce the cost of inlining unused function applications, then running it again afterwards which eliminates parameters of inlined functions.
Just as an example of the utility of the language specification, the library constructs a thread pool and assigns a pointer to it to a global variable. If the thread pool is not used, the assignment is deleted, and hence the construction of the thread pool elided.
IMHO compiler optimisations are almost invariably heuristics whose performance matters more than effectiveness achieving a theoretical goal (like removing unused variables). Simple reductions are useful not only because they're fast and easy to write, but because a programmer using a language who understand basics of the compiler operation can leverage this knowledge to help the compiler. The most well known example of this is probably the refactoring of recursive functions to place the recursion in tail position: a pointless exercise unless the programmer knows the compiler can do tail-recursion optimisation.
Does Scala support tail recursion optimization?
Scala does tail recursion optimisation at compile-time, as other posters have said. That is, a tail recursive function is transformed into a loop by the compiler (a method invoke is transformed into a jump), as can be seen from the stack trace when running a tail recursive function.
Try the following snippet:
def boom(n: Int): Nothing = if(n<=0) throw new Exception else boom(n-1)
boom(10)
and inspect the stack trace. It will show only one call to the function boom - therefore the compiled bytecode is not recursive.
There is a proposal floating around to implement tail calls at the JVM level - which in my opinion would a great thing to do, as then the JVM could do runtime optimizations, rather than just compile time optimizations of the code - and could possibly mean more flexible tail recursion. Basically a tailcall invoke would behave exactly like a normal method invoke but will drop the stack of the caller when it's safe to do so - the specification of the JVM states that stack frames must be preserved, so the JIT has to do some static code analysis to find out if the stack frames are never going to be used.
The current status of it is proto 80%. I don't think it will be done in time for Java 7 (invokedynamic has a greater priority, and the implementation is almost done) but Java 8 might see it implemented.
In Scala 2.8 you can use #tailrec to mark specific method that you expect the compiler will optimise:
import scala.annotation.tailrec
#tailrec def factorialAcc(acc: Int, n: Int): Int = {
if (n <= 1) acc
else factorialAcc(n * acc, n - 1)
}
If a method can not be optimized you get a compile-time error.
Scala 2.7.x supports tail-call optimization for self-recursion (a function calling itself) of final methods and local functions.
Scala 2.8 might come with library support for trampoline too, which is a technique to optimize mutually recursive functions.
A good deal of information about the state of Scala recursion can be found in Rich Dougherty's blog.
Only in very simple cases where the function is self-recursive.
Proof of tail recursion ability.
It looks like Scala 2.8 might be improving tail-recursion recognition, though.