Tail call optimization for procs possible? - ruby

I see that tail call optimization can be enabled in ruby and that it works with method calls following these instructions, but I need it to work for Procs too for use with blocks/lambdas, since only these can pass closures around as first class functions (that I'm aware of). But I can't find any information about tail call optimization and Procs. From my experiments it doesn't seem to use the setting that regular functions calls do. Is it possible or is there an alternative that could be useful?
I'm implementing a lazy language and I need to support deep recursion in that language since that is how the target lang works. I've tried switching the code to Continuation Passing Style and returning thunks that get evaluated trampoline style (this avoids the stack overflow that deep recursion would call), but it is about 3 times slower than my previous implementation just using the stack. Essentially I need the tail call opt because all my CPS stuff is a tail call so I could eliminate the trampoline which is expensive (maybe doubles the number of Procs created to use it). I realize Ruby probably isn't the best language for implementing this, but it isn't a super serious project, still it would be nice for it not to be very slow.

Related

Performance of recursive function in register based compiler

I have a question of whether there will be a performance hit when we write recursive functions in Register based compilers like DVM. I'm aware that recursion isn't recommended in compilers with limited depth like compilers for python.
Being register-based does not help for recursive functions, they still have the same problem: conceptually, every call creates a new stack frame. If that is implemented literally, then a recursive call is inherently a little slower than looping, and perhaps more importantly, uses up a finite resource so the recursion depth is limited. A register-based code representation does not have the concept of an operand stack, but that concept is mostly disjoint from the concept of a call stack, which is still necessary just to have general subroutines. Subroutines can be implemented without a call stack if recursion is banned, in which case they need not be re-entrant so the local variables and the variable that holds the return address can be statically allocated.
Going through a trampoline works around the stack growth by quickly returning to a special caller that then calls the continuation, that way recursion doesn't grow the stack at all since the old frame gets deallocated before a new one is created, but it adds even more run-time overhead. Tail call elimination by rewriting the call into a jump achieves a similar effect but by reusing the same frame, with less associated overhead, this requires explicit support from the VM.
Both of those techniques apply equally to stack based and register based representations of the code, which incidentally is primarily a difference in the format in which the code is stored, and need not reflect a difference in the way the code is actually executed: a JIT compiler can turn both of them into whatever form the machine requires.

Will there be functions if there were no stacks?

I know that a stack data structure is used to store the local variables among many other things of a function that is being run.
I also understand how stack can be used to eleganlty manage recursion.
Suppose there was a machine that did not provide a stack area in memory, I don't think there will be programming languages for the machine that will support recursion. I am also wondering if programming languages for the machine would support functions without recursion.
Please, someone shread some sight on this for me.
A bit of theoretical framework is needed to understand that recursion is indeed not tied to functions at all, rather it is tied to expressiveness.
I won't dig into that, leaving Google fill any gaps.
Yes, we can have functions without a stack.
We don't need even the call/ret machinery for functions, we can just have the compiler inline every function call.
So there is no need for a stack at all.
This considers only functions in the programming sense, not mathematical sense.
A better name would be routines.
Anyway that is a simply proof of concepts that functions, intended as reusable code, don't need a stack.
However not all functions, in the mathematical sense, can implemented this way.
This is analogous to say: "We can have dogs on the bed but not all dogs can be on the bed".
You are in the right track by citing recursion, however when it comes to recursion, we need to be a lot more formal as there are various forms of recursion.
For example in-lining every function call may loop the compiler if the function being inlined is not constrained somehow.
Without digging into the theory, in order to be always sure that our compiler won't loop we can only allow primitive (bounded) recursion.
What you probably means by "recursion" is general recursion, that cannot be achieved by in-lining, we can show that we need an infinite amount of memory for GR and that is the demarcation between PR and GR, not having a stack.
So we can have function without a stack, even recursive (for some form of recursion) functions.
If your question was more practical then just consider MIPS.
There is no stack instructions or stack pointer register in the MIPS ISA, everything related to stack is just convention.
The compiler could use any memory area and treat it like a stack.

Inlining Algorithm

Does anyone know of any papers discussion inlining algorithms? And closely related, the relationship of parent-child graph to call graph.
Background: I have a compiler written in Ocaml which aggressively inlines functions, primarily as a result of this and some other optimisations it generates faster code for my programming language than most others in many circumstances (including even C).
Problem #1: The algorithm has trouble with recursion. For this my rule is only to inline children into parents, to prevent infinite recursion, but this precludes sibling functions inlining once into each other.
Problem #2: I do not know of a simple way to optimise inlining operations. My algorithm is imperative with mutable representation of function bodies because it does not seem even remotely possible to make an efficient functional inlining algorithm. If the call graph is a tree, it is clear that a bottom up inlining is optimal.
Technical information: Inlining consists of a number of inlining steps. The problem is the ordering of the steps.
Each step works as follows:
we make a copy of the function to be inlined and beta-reduce by
replacing both type parameters and value parameters with arguments.
We then replace return statement with an assignment to a new variable
followed by a jump to the end of the function body.
The original call to the function is then replaced by this body.
However we're not finished. We must also clone all the children of
the function, beta-reducting them as well, and reparent the clones to
the calling function.
The cloning operation makes it extremely hard to inline recursive functions. The usual trick of keeping a list of what is already in progress and just checking to see if we're already processing this call does not work in naive form because the recursive call is now moved into the beta-reduced code being stuffed into the calling function, and the recursion target may have changed to a cloned child. However that child, in calling the parent, is still calling the original parent which calls its child, and now the unrolling of the recursion will not stop. As mentioned I broke this regress by only allowing inlining a recursive call to a child, preventing sibling recursions being inlined.
The cost of inlining is further complicated by the need to garbage collect unused functions. Since inlining is potentially exponential, this is essential. If all the calls to a function are inlined, we should get rid of the function if it has not been inlined into yet, otherwise we'll waste time inlining into a function which is no longer used. Actually keeping track of who calls what is extremely difficult, because when inlining we're not working with an actual function representation, but an "unravelled" one: for example, the list of instructions is being processed sequentially and a new list built up, and at any one point in time there may not be a coherent instruction list.
In his ML compiler Steven Weeks chose to use a number of small optimisations applied repeatedly, since this made the optimisations easy to write and easy to control, but unfortunately this misses a lot of optimisation opportunities compared to a recursive algorithm.
Problem #3: when is it safe to inline a function call?
To explain this problem generically: in a lazy functional language, arguments are wrapped in closures and then we can inline an application; this is the standard model for Haskell. However it also explains why Haskell is so slow. The closures are not required if the argument is known, then the parameter can be replaced directly with its argument where is occurs (this is normal order beta-reduction).
However if it is known the argument evaluation is not non-terminating, eager evaluation can be used instead: the parameter is assigned the value of the expression once, and then reused. A hybrid of these two techniques is to use a closure but cache the result inside the closure object. Still, GHC hasn't succeeded in producing very efficient code: it is clearly very difficult, especially if you have separate compilation.
In Felix, I took the opposite approach. Instead of demanding correctness and gradually improving efficiency by proving optimisations preserved semantics, I mandate that the optimisation defines the semantics. This guarantees correct operation of the optimiser at the expense of uncertainty about what how certain code will behave. The idea is to provide ways for the programmer to force the optimiser to conform to intended semantics if the default optimisation strategy is too aggressive.
For example, the default parameter passing mode allows the compiler to chose whether to wrap the argument in a closure, replace the parameter with the argument, or assign the argument to the parameter. If the programmer wants to force a closure, they can just pass in a closure. If the programmer wants to force eager evaluation, they mark the parameter var.
The complexity here is much greater than a functional programming language: Felix is a procedural language with variables and pointers. It also has Haskell style typeclasses. This makes the inlining routine extremely complex, for example, type-class instances replace abstract functions whenever possible (due to type specialisation when calling a polymorphic function, it may be possible to find an instance whilst inlining, so now we have a new function we can inline).
Just to be clear I have to add some more notes.
Inlining and several other optimisations such as user defined term reductions, typeclass instantiations, linear data flow checks for variable elimination, tail rec optimisation, are done all at once on a given function.
The ordering problem isn't the order to apply different optimisations, the problem is to order the functions.
I use a brain dead algorithm to detect recursion: I build up a list of everything used directly by a each function, find the closure, and then check if the function is in the result. Note the usage set is built up many times during optimisation, and this is a serious bottleneck.
Whether a function is recursive or not can change unfortunately. A recursive function might become non-recursive after tail rec optimisation. But there is a much harder case: instantiating a typeclass "virtual" function can make what appeared to be non-recursive recursive.
As to sibling calls, the problem is that given f and g where f calls g and g calls f I actually want to inline f into g, and g into f .. once. My infinite regress stopping rule is to only allow inlining of f into g if they're mutually recursive if f is a child of g, which excludes inlining siblings.
Basically I want to "flatten out" all code "as much as possible".
I realize you probably already know all this, but it seems important to still write it in full, at least for further reference.
In the functional community, there is some litterature mostly from the GHC people. Note that they consider inlining as a transformation in the source language, while you seem to work at a lower level. Working in the source language -- or an intermediate language of reasonably similar semantics -- is, I believe, a big help for simplicity and correctness.
GHC Wiki : Inlining (contains a bibliography)
Secrets of the Glasgow Haskell inliner
For the question of the ordering compiler passes, this is quite arcane. Still in a Haskell setting, there is the Compilation by Transformation in a Non-strict Functional Language PhD Thesis which discusses the ordering of different compiler passes (and also inlining).
There is also the quite recent paper on Compilation by Equality Saturation which propose a novel approach to optimisation passes ordering. I'm not sure it has yet demonstrated applicability at a large scale, but it's certainly an interesting direction to explore.
As for the recursion case, you could use Tarjan algorithm on your call graph to detect circular dependency clusters, and exclude them from inlining. It won't affect sibling calls.
http://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm

How to "debug" Haskell with printfs?

coming from the Ocaml community, I'm trying to learn a bit of Haskell. The transition goes quite well but I'm a bit confused with debugging. I used to put (lots of) "printf" in my ocaml code, to inspect some intermediate values, or as flag to see where the computation exactly failed.
Since printf is an IO action, do I have to lift all my haskell code inside the IO monad to be able to this kind of debugging ? Or is there a better way to do this (I really don't want to do it by hand if it can be avoided)
I also find the trace function :
http://www.haskell.org/haskellwiki/Debugging#Printf_and_friends
which seems exactly what I want, but I don't understand it's type: there is no IO anywhere!
Can someone explain me the behaviour of the trace function ?
trace is the easiest to use method for debugging. It's not in IO exactly for the reason you pointed: no need to lift your code in the IO monad. It's implemented like this
trace :: String -> a -> a
trace string expr = unsafePerformIO $ do
putTraceMsg string
return expr
So there is IO behind the scenes but unsafePerformIO is used to escape out of it. That's a function which potentially breaks referential transparency which you can guess looking at its type IO a -> a and also its name.
trace is simply made impure. The point of the IO monad is to preserve purity (no IO unnoticed by the type system) and define the order of execution of statements, which would otherwise be practically undefined through lazy evaluation.
On own risk however, you can nevertheless hack together some IO a -> a, i.e. perform impure IO. This is a hack and of course "suffers" from lazy evaluation, but that's what trace simply does for the sake of debugging.
Nevertheless though, you should probably go other ways for debugging:
Reducing the need for debugging intermediate values
Write small, reusable, clear, generic functions whose correctness is obvious.
Combine the correct pieces to greater correct pieces.
Write tests or try out pieces interactively.
Use breakpoints etc. (compiler-based debugging)
Use generic monads. If your code is monadic nevertheless, write it independent of a concrete monad. Use type M a = ... instead of plain IO .... You can afterwards easily combine monads through transformers and put a debugging monad on top of it. Even if the need for monads is gone, you could just insert Identity a for pure values.
For what it's worth, there are actually two kinds of "debugging" at issue here:
Logging intermediate values, such as the value a particular subexpression has on each call into a recursive function
Inspecting the runtime behavior of the evaluation of an expression
In a strict imperative language these usually coincide. In Haskell, they often do not:
Recording intermediate values can change the runtime behavior, such as by forcing the evaluation of terms that would otherwise be discarded.
The actual process of computation can dramatically differ from the apparent structure of an expression due to laziness and shared subexpressions.
If you just want to keep a log of intermediate values, there are many ways to do so--for instance, rather than lifting everything into IO, a simple Writer monad will suffice, this being equivalent to making functions return a 2-tuple of their actual result and an accumulator value (some sort of list, typically).
It's also not usually necessary to put everything into the monad, only the functions that need to write to the "log" value--for instance, you can factor out just the subexpressions that might need to do logging, leaving the main logic pure, then reassemble the overall computation by combining pure functions and logging computations in the usual manner with fmaps and whatnot. Keep in mind that Writer is kind of a sorry excuse for a monad: with no way to read from the log, only write to it, each computation is logically independent of its context, which makes it easier to juggle things around.
But in some cases even that's overkill--for many pure functions, just moving subexpressions to the toplevel and trying things out in the REPL works pretty well.
If you want to actually inspect run-time behavior of pure code, however--for instance, to figure out why a subexpression diverges--there is in general no way to do so from other pure code--in fact, this is essentially the definition of purity. So in that case, you have no choice but to use tools that exist "outside" the pure language: either impure functions such as unsafePerformPrintfDebugging--errr, I mean trace--or a modified runtime environment, such as the GHCi debugger.
trace also tends to over-evaluate its argument for printing, losing a lot of the benefits of laziness in the process.
If you can wait until the program is finished before studying the output, then stacking a Writer monad is the classic approach to implementing a logger. I use this here to return a result set from impure HDBC code.
Well, since whole Haskell is built around principle of lazy evaluation (so that order of calculations is in fact non-deterministic), use of printf's make very little sense in it.
If REPL+inspect resulting values is really not enough for your debugging, wrapping everything into IO is the only choice (but it's not THE RIGHT WAY of Haskell programming).

Cons of first class continuations

What are some of the criticisms leveled against exposing continuations as first class objects?
I feel that it is good to have first class continuations. It allow complete control over the execution flow of instructions. Advanced programmers can develop intuitive solutions to certain kind of problems. For instance, continuations are used to manage state on web servers. A language implementation can provide useful abstractions on top of continuations. For example, green threads.
Despite all these, are there strong arguments against first class continuations?
The reality is that many of the useful situations where you could use continuations are already covered by specialized language constructs: throw/catch, return, C#/Python yield. Thus, language implementers don't really have all that much incentive to provide them in a generalized form usable for roll-your-own solutions.
In some languages, generalized continuations are quite hard to implement efficiently. Stack-based languages (i.e. most languages) basically have to copy the whole stack every time you create a continuation.
Those languages can implement certain continuation-like features, those that don't break the basic stack-based model, a lot more efficiently than the general case, but implementing generalized continuations is quite a bit harder and not worth it.
Functional languages are more likely to implement continuations for a couple of reasons:
They are frequently implemented in continuation passing style, which means the "call stack" is probably a linked list allocated on the heap. This makes it trivial to pass a pointer to the stack as a continuation, since you don't need to overwrite the stack context when you pop the current frame and push a new one. (I've never implemented CPS but that's my understanding of it.)
They favor immutable data bindings, which make your old continuation a lot more useful because you will not have altered the contents of variables that the stack pointed to when you created it.
For these reasons, continuations are likely to remain mostly just in the domain of functional languages.
First up, there is more then just call/cc when it comes to continuation. I suggest starting with Mark Feelys paper: A better API for first class continuations
Next up I suggest reading about the control operators shift and reset, which is a different way of representing contunations.
A significant objection is implementation cost. If the runtime uses a stack, then first-class continuations require a stack copy at some point. The copy cost can be controlled (see Representing Control in the Presence of First-Class Continuations for a good strategy), but it also means that mutable variables cannot be allocated on the stack. This isn't an issue for functional or mostly-functional (e.g., Scheme) languages, but this adds significant overhead for OO languages.
Most programmers don't understand them. If you have code that uses them, it's harder to find replacement programmers who will be able to work with it.
Continuations are hard to implement on some platforms. For example, JRuby doesn't support continuations.
First-class continuations undermine the ability to reason about code, especially in languages that allow continuations to be imperatively assigned to variables, because the insides of closures can be brought alive again in hairy ways.
Cf. Kent Pitman's complaint about continuations, about the tricky way that unwind-protect interacts with call/cc
Call/cc is the 'goto' of advanced functional programming (a la the example here).
in ruby 1.8 the implementation was extremely slow. better in 1.9, and of course most schemes have had them built in and performing well from the outset.

Resources