Performance of recursive function in register based compiler - algorithm

I have a question of whether there will be a performance hit when we write recursive functions in Register based compilers like DVM. I'm aware that recursion isn't recommended in compilers with limited depth like compilers for python.

Being register-based does not help for recursive functions, they still have the same problem: conceptually, every call creates a new stack frame. If that is implemented literally, then a recursive call is inherently a little slower than looping, and perhaps more importantly, uses up a finite resource so the recursion depth is limited. A register-based code representation does not have the concept of an operand stack, but that concept is mostly disjoint from the concept of a call stack, which is still necessary just to have general subroutines. Subroutines can be implemented without a call stack if recursion is banned, in which case they need not be re-entrant so the local variables and the variable that holds the return address can be statically allocated.
Going through a trampoline works around the stack growth by quickly returning to a special caller that then calls the continuation, that way recursion doesn't grow the stack at all since the old frame gets deallocated before a new one is created, but it adds even more run-time overhead. Tail call elimination by rewriting the call into a jump achieves a similar effect but by reusing the same frame, with less associated overhead, this requires explicit support from the VM.
Both of those techniques apply equally to stack based and register based representations of the code, which incidentally is primarily a difference in the format in which the code is stored, and need not reflect a difference in the way the code is actually executed: a JIT compiler can turn both of them into whatever form the machine requires.

Related

Will there be functions if there were no stacks?

I know that a stack data structure is used to store the local variables among many other things of a function that is being run.
I also understand how stack can be used to eleganlty manage recursion.
Suppose there was a machine that did not provide a stack area in memory, I don't think there will be programming languages for the machine that will support recursion. I am also wondering if programming languages for the machine would support functions without recursion.
Please, someone shread some sight on this for me.
A bit of theoretical framework is needed to understand that recursion is indeed not tied to functions at all, rather it is tied to expressiveness.
I won't dig into that, leaving Google fill any gaps.
Yes, we can have functions without a stack.
We don't need even the call/ret machinery for functions, we can just have the compiler inline every function call.
So there is no need for a stack at all.
This considers only functions in the programming sense, not mathematical sense.
A better name would be routines.
Anyway that is a simply proof of concepts that functions, intended as reusable code, don't need a stack.
However not all functions, in the mathematical sense, can implemented this way.
This is analogous to say: "We can have dogs on the bed but not all dogs can be on the bed".
You are in the right track by citing recursion, however when it comes to recursion, we need to be a lot more formal as there are various forms of recursion.
For example in-lining every function call may loop the compiler if the function being inlined is not constrained somehow.
Without digging into the theory, in order to be always sure that our compiler won't loop we can only allow primitive (bounded) recursion.
What you probably means by "recursion" is general recursion, that cannot be achieved by in-lining, we can show that we need an infinite amount of memory for GR and that is the demarcation between PR and GR, not having a stack.
So we can have function without a stack, even recursive (for some form of recursion) functions.
If your question was more practical then just consider MIPS.
There is no stack instructions or stack pointer register in the MIPS ISA, everything related to stack is just convention.
The compiler could use any memory area and treat it like a stack.

Does recursive function parameters order affect performance

Consider a frequently called recursive function having some parameters which do vary a lot among executions, and some which don't, representing some kind of a context information. For example, a tree traversal might look like this
private void Visit(Node node, List<Node> results)
{
if (IsMatch(node)) {
results.Add(node);
}
Visit(node.Left, results);
Visit(node.Right, results);
}
...
Visit(root, new List<Node>());
Obviously results collection is created once and the same reference used throughout all traversal calls.
The question is, does it matter for performance whether the function is declared as Visit(Node, List<Node>) or Visit(List<Node>, Node)? Is there a convention for the arguments order?
The vague idea is that fixed parameters might not be pushed in or popped out of stack constantly, improving the performance, but I'm not sure how feasible would that be.
I'm primarily interested in C# or Java, but would like to hear about any language for which the order matters.
Note: sometimes I happen to have three of four parameters overall. I realize that it is possible to create a class holding a context, or an anonymous function closing the context variables, but the question is about plain recursion.
This is going into the deepest conjecture territory for me, since there are an explosive number of variables that could affect the outcome ranging from the calling conventions of the platform to the characteristics of the compiler.
The question is, does it matter for performance whether the function
is declared as Visit(Node, List) or Visit(List, Node)? Is
there a convention for the arguments order?
In this kind of simple example, probably the most concise answer I can give is, "no unless proven otherwise" -- "innocent until proven guilty" sort of thing. It's highly unlikely.
But there are interesting scenarios that crop up in a more complex example like this:
API_EXPORT void f1(small a, small b, small c, small d, big e);
... vs:
API_EXPORT void f2(big a, small b, small c, small d, small e);
In this case, f1 might actually be more efficient than f2 in some cases.
It's because some calling conventions, like Win x64, allow the first four arguments passed to a function to be passed directly through registers without stack spills provided that they fit into a register and make up the first four parameters of the function.
In this case, with f1, the first four small arguments might fit into registers but the fifth one (regardless of size) needs to be spilled to the stack.
With f2, the first argument might be too big to fit into a register, so it needs to be spilled into the stack along with the fifth argument, so we could end up with twice as many variables being spilled to the stack and take a small performance hit (I've never actually measured a case like this, however).
This is a rare case scenario though even if it occurs, and it was critical for me to put that API_EXPORT specifier there because, unless this function is being exported, the compiler (or linker in some cases) can do all kinds of magic and inline the function and things of that sort and obliterate the need to pass arguments in this kind of precise way defined by its ABI.
In this kind of case, it might help a little bit to have parameters sorted in ascending order from smallest types that fit in registers to biggest types that don't.
But even then, is List<Node> a big type or a small type? Can it fit into a general-purpose register, e.g.? Languages like Java and C# treat everything referring to a UDT as a reference to garbage-collected memory. That might boil down to a simple memory address which does fit into a register. I don't know what goes on here under the hood, but I would suspect the reference itself can fit into a register and is actually a small type. A big type would be like a 4x4 matrix being passed by value (as a deep copy).
Anyway, this is all case-by-case and drawing upon a lot of conjecture, but these are potential factors that might sway things one way or another in some rare scenario.
Nevertheless, for the example you cited, I'd suggest a strong "no unless proven otherwise -- highly improbable".
I realize that it is possible to create a class holding a context, or
an anonymous function closing the context variables [...].
Often languages that provide classes generate machine instructions as though there is one extra implicit parameter in a function. For example:
class Foo
{
public:
void f();
};
... might translate to something analogical to:
void f(Foo* this);
It's worth keeping that in mind when it comes to thinking about calling conventions, aliasing, and how parameters are passed through registers and stack.
... but the question is about plain recursion
Recursion probably wouldn't affect this so much except that there might come a point where the compiler might cease to inline recursive functions, e.g. after a few levels, at which point we inevitably pay the full cost required of the ABI. But otherwise it's mostly the same: a function call is a function call whether it's recursive or not and the same concerns tend to apply for both. Potentially relevant things here most likely have much more to do with compiler and platform ABI than whether or not a function is being called recursively or non-recursively.

Iteration vs Recursion efficiency

I got a basic idea of how recursion works - but I've always programmed iteratively.
When we look at the keywords CPU/stack/calls and space, how is recursion different from iterations?
It needs more memory because of running many "stacks(?)" which each (most likely) stores a value. It therefore takes up much more space, than an iterative solution to the same problem. This is generally speaking. There are some cases where Recursion would be better, such as programming Towers of Hanoi and such.
Am I all wrong? I've got an exam soon and I have to prepare a lot of subjects. Recursion is not my strong suit, so I would appreciate some help on this matter :)
This really depends on the nature of the language and compiler/interpreter.
Some functional languages implement tail recursion, for example, to recognize specific cases where the stack frame can be destroyed/freed prior to recursing into the next call. In those special cases among the languages/compilers/interpreters that support it, you can actually have infinite recursion without overflowing the stack.
If you're working with languages that use the hardware stack and don't implement tail recursion, then typically you have to push arguments to the stack prior to branching into a function and pop them off along with return values, so there's somewhat of an implicit data structure there under the hood (if we can call it that). There's all kinds of additional things that can happen here as well, like register shadowing to optimize it.
The hardware stack is usually very efficient, typically just incrementing and decrementing a stack pointer register to push and pop, but it does involve a bit more state and instructions than branching with a loop counter or a condition. Perhaps more importantly, it tends to involve more distant branching to jump into another function's code as opposed to looping within the same body of code which could involve more instruction cache and page misses.
In these types of languages/compilers/interpreters that use the hardware stack and will always overflow it with enough recursion, the loopy routes often provide a performance advantage (but can be more tedious to code).
As a twist, you also have aggressive optimizers sometimes which do all kinds of magic with your code in the process of translating it to machine instructions and linking it like inlining functions and unrolling loops, and when you take all these factors into account, it's often better to just code things a bit more naturally and then measure with a profiler if it could use some tweaking. Of course you shouldn't use recursion in cases that can overflow, but I generally wouldn't worry about the overhead most of the time.

Does 64 bit calling convention make a difference to cost of recursive algorithms

When I was taught computer science there was some discussion as to the cost of recursion, because of the function call overhead, and how to convert to something more efficient. E.g., to iteration, seehttp://stackoverflow.com/questions/159590/way-to-go-from-recursion-to-iteration?rq=1, or turning a naturally recursive algorithm into an iterative one: e.g. running an algorithm bottom up rather than top down.
One of the interesting things about 64 bit architectures is the support for passing more parameters to and fro using registers. To quote Agner Fog
It is more efficient to use registers for transferring parameters to a function and for receiving
the return value than to store these values on the stack... In 64-bit systems, the use of registers for parameter transfer is standardized. All systems use registers for return values if the returned object fits into the registers that are assigned for this purpose
Does this mean that I don't need to worry so much about the cost of recursive function calls on 64 bit architectures?
Even if the parameters are passed in registers, the function will need to save some state on the stack as it recurses. If any of that state is a pointer, you've doubled the space requirements. This could potentially halve the maximum recursion depth you can achieve for a given stack size.
Passing parameters in registers won't change memory consumption properties of the algorithm, it will still require the same stack space. Calling conventions fix registers used for every positional parameter, and obviously the same register cannot be used to pass the different value deeper down the call stack, so compiler will have to spill it into memory.

Inlining Algorithm

Does anyone know of any papers discussion inlining algorithms? And closely related, the relationship of parent-child graph to call graph.
Background: I have a compiler written in Ocaml which aggressively inlines functions, primarily as a result of this and some other optimisations it generates faster code for my programming language than most others in many circumstances (including even C).
Problem #1: The algorithm has trouble with recursion. For this my rule is only to inline children into parents, to prevent infinite recursion, but this precludes sibling functions inlining once into each other.
Problem #2: I do not know of a simple way to optimise inlining operations. My algorithm is imperative with mutable representation of function bodies because it does not seem even remotely possible to make an efficient functional inlining algorithm. If the call graph is a tree, it is clear that a bottom up inlining is optimal.
Technical information: Inlining consists of a number of inlining steps. The problem is the ordering of the steps.
Each step works as follows:
we make a copy of the function to be inlined and beta-reduce by
replacing both type parameters and value parameters with arguments.
We then replace return statement with an assignment to a new variable
followed by a jump to the end of the function body.
The original call to the function is then replaced by this body.
However we're not finished. We must also clone all the children of
the function, beta-reducting them as well, and reparent the clones to
the calling function.
The cloning operation makes it extremely hard to inline recursive functions. The usual trick of keeping a list of what is already in progress and just checking to see if we're already processing this call does not work in naive form because the recursive call is now moved into the beta-reduced code being stuffed into the calling function, and the recursion target may have changed to a cloned child. However that child, in calling the parent, is still calling the original parent which calls its child, and now the unrolling of the recursion will not stop. As mentioned I broke this regress by only allowing inlining a recursive call to a child, preventing sibling recursions being inlined.
The cost of inlining is further complicated by the need to garbage collect unused functions. Since inlining is potentially exponential, this is essential. If all the calls to a function are inlined, we should get rid of the function if it has not been inlined into yet, otherwise we'll waste time inlining into a function which is no longer used. Actually keeping track of who calls what is extremely difficult, because when inlining we're not working with an actual function representation, but an "unravelled" one: for example, the list of instructions is being processed sequentially and a new list built up, and at any one point in time there may not be a coherent instruction list.
In his ML compiler Steven Weeks chose to use a number of small optimisations applied repeatedly, since this made the optimisations easy to write and easy to control, but unfortunately this misses a lot of optimisation opportunities compared to a recursive algorithm.
Problem #3: when is it safe to inline a function call?
To explain this problem generically: in a lazy functional language, arguments are wrapped in closures and then we can inline an application; this is the standard model for Haskell. However it also explains why Haskell is so slow. The closures are not required if the argument is known, then the parameter can be replaced directly with its argument where is occurs (this is normal order beta-reduction).
However if it is known the argument evaluation is not non-terminating, eager evaluation can be used instead: the parameter is assigned the value of the expression once, and then reused. A hybrid of these two techniques is to use a closure but cache the result inside the closure object. Still, GHC hasn't succeeded in producing very efficient code: it is clearly very difficult, especially if you have separate compilation.
In Felix, I took the opposite approach. Instead of demanding correctness and gradually improving efficiency by proving optimisations preserved semantics, I mandate that the optimisation defines the semantics. This guarantees correct operation of the optimiser at the expense of uncertainty about what how certain code will behave. The idea is to provide ways for the programmer to force the optimiser to conform to intended semantics if the default optimisation strategy is too aggressive.
For example, the default parameter passing mode allows the compiler to chose whether to wrap the argument in a closure, replace the parameter with the argument, or assign the argument to the parameter. If the programmer wants to force a closure, they can just pass in a closure. If the programmer wants to force eager evaluation, they mark the parameter var.
The complexity here is much greater than a functional programming language: Felix is a procedural language with variables and pointers. It also has Haskell style typeclasses. This makes the inlining routine extremely complex, for example, type-class instances replace abstract functions whenever possible (due to type specialisation when calling a polymorphic function, it may be possible to find an instance whilst inlining, so now we have a new function we can inline).
Just to be clear I have to add some more notes.
Inlining and several other optimisations such as user defined term reductions, typeclass instantiations, linear data flow checks for variable elimination, tail rec optimisation, are done all at once on a given function.
The ordering problem isn't the order to apply different optimisations, the problem is to order the functions.
I use a brain dead algorithm to detect recursion: I build up a list of everything used directly by a each function, find the closure, and then check if the function is in the result. Note the usage set is built up many times during optimisation, and this is a serious bottleneck.
Whether a function is recursive or not can change unfortunately. A recursive function might become non-recursive after tail rec optimisation. But there is a much harder case: instantiating a typeclass "virtual" function can make what appeared to be non-recursive recursive.
As to sibling calls, the problem is that given f and g where f calls g and g calls f I actually want to inline f into g, and g into f .. once. My infinite regress stopping rule is to only allow inlining of f into g if they're mutually recursive if f is a child of g, which excludes inlining siblings.
Basically I want to "flatten out" all code "as much as possible".
I realize you probably already know all this, but it seems important to still write it in full, at least for further reference.
In the functional community, there is some litterature mostly from the GHC people. Note that they consider inlining as a transformation in the source language, while you seem to work at a lower level. Working in the source language -- or an intermediate language of reasonably similar semantics -- is, I believe, a big help for simplicity and correctness.
GHC Wiki : Inlining (contains a bibliography)
Secrets of the Glasgow Haskell inliner
For the question of the ordering compiler passes, this is quite arcane. Still in a Haskell setting, there is the Compilation by Transformation in a Non-strict Functional Language PhD Thesis which discusses the ordering of different compiler passes (and also inlining).
There is also the quite recent paper on Compilation by Equality Saturation which propose a novel approach to optimisation passes ordering. I'm not sure it has yet demonstrated applicability at a large scale, but it's certainly an interesting direction to explore.
As for the recursion case, you could use Tarjan algorithm on your call graph to detect circular dependency clusters, and exclude them from inlining. It won't affect sibling calls.
http://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm

Resources