Compiling container operations to efficient incremental code

Compiling container operations to efficient incremental code - algorithm

Most programming languages nowadays provide containers like list, set, multiset, map. Operations on all elements of a container e.g. copy_if or transform normally take O(n) time. Lazy evaluation can make this sublinear if you only need the first few elements of the result, but it's back to linear if you need the full result.
Consider for example the following implementation of the DPLL algorithm for propositional satisfiability. It's essentially a translation of the pseudocode to C++, thus optimized for readability. But each step takes time linear in the number of variables, so even if it could guess all the assignments correctly the first time, total time and memory consumption would be quadratic in the number of variables, making the implementation too slow to use in practice. This is true even if we apply well-known optimizations like passing containers by constant reference and using lazy evaluation anywhere it might avoid unnecessary work.
Efficient implementations of DPLL use incremental techniques, where instead of scanning and copying entire containers, only the consequences of making a small change like assigning a single variable are computed, using O(1) time and memory per step.
A human can translate pseudocode or unoptimized reference implementation to an efficient incremental implementation; this is what we do when we write practical SAT solvers, theorem provers etc. The price we pay is to thenceforth work with a large, complex body of optimized code, which is much more difficult than working with a concise description of the mathematical logic.
What we ideally want is a higher-level compiler, that can compile the unoptimized reference implementation into efficient incremental code.
My question is: has any work yet been done on such a thing? Are there any existing implementations, partial implementations or discussions to look at? Higher-level optimization of container operations is not entirely unknown in principle, e.g. SQL query optimizers, Haskell loop fusion, but I'm not aware of anything that tries to go as far as I'm looking for here.
Unoptimized reference implementation (simple translation of pseudocode) of DPLL:
bool dpll(map<Var, bool> m, set<set<Literal>> clauses) {
clauses = eval(m, clauses);
// Solved
if (isFalse(clauses))
return false;
if (isTrue(clauses))
return true;
// Unit clause
auto unitClauses =
copy_if(clauses, [](set<Literal> clause) { return clause.size() == 1; });
if (unitClauses.size()) {
auto x = front(front(unitClauses));
return dpll(m + makeMap(x.var, x.pol), clauses);
}
// Pure literal
auto pureVars =
copy_if(vars(clauses), [=](Var x) { return pure(x, clauses); });
if (pureVars.size()) {
auto x = front(pureVars);
return dpll(m + makeMap(x, pol(x, clauses)), clauses);
}
// Choice
auto x = choose(vars(clauses));
return dpll(m + makeMap(x, false), clauses) ||
dpll(m + makeMap(x, true), clauses);
}
Similar reference implementation of helper functions: https://github.com/russellw/ayane/blob/master/logic/dpll.cpp

Related

OpenCL, substituting branches with arithmetic operations

The following question is more related to design, rather than actual coding. I don't know if there's a technical term for such problem, so I'll proceed with an example.
I have some openCL code not optimized at all, and in the Kernel there's essentially a switch statement similar to the following
switch(const) {
case const_a : do_something_a(...); break;
case const_b : do_something_b(....); break;
... //etc
}
I cannot write the actual statement since is quite long. As a simple example consider the following switch statement:
int a;
switch(input):
case 13 : {a = 3; break;}
case 1 : {a = 7; break;}
case 23 : {a = 1; break;}
default : {...}
The question is... would it be better to change such switch with an expression like
a = (input == 13)*3 + (input == 1)*7 + (input == 23)
?
If it's not, is it possible to make it more efficient anyway?
You can assume input only takes values in the set of cases of the switch statement.

You've discovered an interesting question that GPU compilers wrestle with. The general advice is try not to branch. Tricks to make that possible are splitting kernels up (as suggested above) and preprocessor (program-time definitions). Research in GPU algorithm development basically works from this axiom.
Branching all over the place won't get great efficiency because of the inherent divergence (channel = work item within the SIMD thread/warp). Remember that all these channels must execute together. So in a switch where all are taking different paths everyone else goes along for the ride silently waiting for their "case" to execute. Now, if input is always always the same value, it can still be a win.
Another popular option is a table indirection.
kernel void foo(const int *t, ...)
...
a = tbl[input];
This case has a few problems too depending on hardware, inputs, and problem size.
Without more specific context, I can conjure up a case where any of these can run well or poorly.
Switching (or big if-then-else chains).
PROS: If all work items generally take the same path (input is mostly the same value), it's going to be efficient. You could also write an if-then-else chain putting the most common cases first. (On GPUs a switch is not necessarily as easy as an indirect jump since there are multiple work items and they may take different paths.)
CONS: Might generate lots of program code and could blow out the instruction cache. Branching all over the place can get a little costly depending on how many cases need to be evaluated. It might just be better to grind through the compute with the predicated code.
Predicated Code (Your (input == 13)*3 ... code).
PROS: This will probably generate smaller programs and stress the I$ less. (Lookup the OpenCL select function to see a more general approach for your case.)
CONS: We've basically punted and decided to evaluate every "case in the switch". If input is usually the same value, we're wasting time here.
Lookup-table based approaches (my example).
PROS: If the switch you are evaluating has a massive number of cases (branches), but can be indexed by integer you might be ahead to just use a lookup table. On some hardware this means a read from global memory (far far away). Other architectures have a dedicated constant cache, but I understand that a vector lookup will serialize (K cycles for each channel). So it might be only marginally better than the global memory table. However, the code table-lookup generated will be short (I$ friendly) and as the number of branches (case statements) grow this will win in the limit. This approach also deals well with uniform/scattered distributions of input's value.
CONS: The read from global memory (or serialized access from the constant cache) has a big latency even compared to branching. In some cases, to eliminate the extra memory traffic I've seen compilers convert lookup tables into if-then-else/switch chains. It's rare that we have 100 element case statements.
I am now inspired to go study this cutoff. :-)

How to assess maximum number of recursive calls before stack overflows

Lets take a recursive function, for example factorial. Lets also assume that we have a stack of 1 MB size. Using a pen and paper, how can I estimate the number of recursive calls to the function before the stack overflows? I'm not interested in any particular language but rather in an abstract approach.
There are questions on SO that look similar but most of them are concerned with a specific language, or extending stack size, or estimating it by running specific function, or preventing overflow. I would like to find a mathematical way to estimate it.
I found similar question in an algorithmic challenge but couldn't come up with any reasonable solution.
Any suggestion highly appreciated.
EDIT
In response to provided replays if the language truly cannot be taken out of the equation let's assume it's C#. Also, since we are passing simple int or long to the function it's not passed by reference but as a copy. Also, assume a naive implementation, without hashing, without multi-threading, an implementation that as much as possible resembles a mathematical representation of the function:
private static long Factorial(long n)
{
if (n < 0)
{
throw new ArgumentException("Negative numbers not supported");
}
switch (n)
{
case 0:
return 1;
case 1:
return 1;
default:
return n * Factorial(n - 1);
}
}

It highly depends on the implementation of the function. How much memory does the function use, before calling itself again. When it recurses 100 times, you will also have 100 function scopes in memory, including the function arguments and variables. It also reserves 100 places on the stack to store the return values.
I don't think the language can easily be taken out of the equation, because you need to know exactly how the stack is used. For examples are objects passed by reference? Or are the objects copy as a new instance on the stack?

Is Scala functional programming slower than traditional coding?

In one of my first attempts to create functional code, I ran into a performance issue.
I started with a common task - multiply the elements of two arrays and sum up the results:
var first:Array[Float] ...
var second:Array[Float] ...
var sum=0f;
for (ix<-0 until first.length)
sum += first(ix) * second(ix);
Here is how I reformed the work:
sum = first.zip(second).map{ case (a,b) => a*b }.reduceLeft(_+_)
When I benchmarked the two approaches, the second method takes 40 times as long to complete!
Why does the second method take so much longer? How can I reform the work to be both speed efficient and use functional programming style?

The main reasons why these two examples are so different in speed are:
the faster one doesn't use any generics, so it doesn't face boxing/unboxing.
the faster one doesn't create temporary collections and, thus, avoids extra memory copies.
Let's consider the slower one by parts. First:
first.zip(second)
That creates a new array, an array of Tuple2. It will copy all elements from both arrays into Tuple2 objects, and then copy a reference to each of these objects into a third array. Now, notice that Tuple2 is parameterized, so it can't store Float directly. Instead, new instances of java.lang.Float are created for each number, the numbers are stored in them, and then a reference for each of them is stored into the Tuple2.
map{ case (a,b) => a*b }
Now a fourth array is created. To compute the values of these elements, it needs to read the reference to the tuple from the third array, read the reference to the java.lang.Float stored in them, read the numbers, multiply, create a new java.lang.Float to store the result, and then pass this reference back, which will be de-referenced again to be stored in the array (arrays are not type-erased).
We are not finished, though. Here's the next part:
reduceLeft(_+_)
That one is relatively harmless, except that it still do boxing/unboxing and java.lang.Float creation at iteration, since reduceLeft receives a Function2, which is parameterized.
Scala 2.8 introduces a feature called specialization which will get rid of a lot of these boxing/unboxing. But let's consider alternative faster versions. We could, for instance, do map and reduceLeft in a single step:
sum = first.zip(second).foldLeft(0f) { case (a, (b, c)) => a + b * c }
We could use view (Scala 2.8) or projection (Scala 2.7) to avoid creating intermediary collections altogether:
sum = first.view.zip(second).map{ case (a,b) => a*b }.reduceLeft(_+_)
This last one doesn't save much, actually, so I think the non-strictness if being "lost" pretty fast (ie, one of these methods is strict even in a view). There's also an alternative way of zipping that is non-strict (ie, avoids some intermediary results) by default:
sum = (first,second).zipped.map{ case (a,b) => a*b }.reduceLeft(_+_)
This gives much better result that the former. Better than the foldLeft one, though not by much. Unfortunately, we can't combined zipped with foldLeft because the former doesn't support the latter.
The last one is the fastest I could get. Faster than that, only with specialization. Now, Function2 happens to be specialized, but for Int, Long and Double. The other primitives were left out, as specialization increases code size rather dramatically for each primitive. On my tests, though Double is actually taking longer. That might be a result of it being twice the size, or it might be something I'm doing wrong.
So, in the end, the problem is a combination of factors, including producing intermediary copies of elements, and the way Java (JVM) handles primitives and generics. A similar code in Haskell using supercompilation would be equal to anything short of assembler. On the JVM, you have to be aware of the trade-offs and be prepared to optimize critical code.

I did some variations of this with Scala 2.8. The loop version is as you write but the
functional version is slightly different:
(xs, ys).zipped map (_ * _) reduceLeft(_ + _)
I ran with Double instead of Float, because currently specialization only kicks in for Double. I then tested with arrays and vectors as the carrier type. Furthermore, I tested Boxed variants which work on java.lang.Double's instead of primitive Doubles to measure
the effect of primitive type boxing and unboxing. Here is what I got (running Java 1.6_10 server VM, Scala 2.8 RC1, 5 runs per test).
loopArray 461 437 436 437 435
reduceArray 6573 6544 6718 6828 6554
loopVector 5877 5773 5775 5791 5657
reduceVector 5064 4880 4844 4828 4926
loopArrayBoxed 2627 2551 2569 2537 2546
reduceArrayBoxed 4809 4434 4496 4434 4365
loopVectorBoxed 7577 7450 7456 7463 7432
reduceVectorBoxed 5116 4903 5006 4957 5122
The first thing to notice is that by far the biggest difference is between primitive array loops and primitive array functional reduce. It's about a factor of 15 instead of the 40 you have seen, which reflects improvements in Scala 2.8 over 2.7. Still, primitive array loops are the fastest of all tests whereas primitive array reduces are the slowest. The reason is that primitive Java arrays and generic operations are just not a very good fit. Accessing elements of primitive Java arrays from generic functions requires a lot of boxing/unboxing and sometimes even requires reflection. Future versions of Scala will specialize the Array class and then we should see some improvement. But right now that's what it is.
If you go from arrays to vectors, you notice several things. First, the reduce version is now faster than the imperative loop! This is because vector reduce can make use of efficient bulk operations. Second, vector reduce is faster than array reduce, which illustrates the inherent overhead that arrays of primitive types pose for generic higher-order functions.
If you eliminate the overhead of boxing/unboxing by working only with boxed java.lang.Double values, the picture changes. Now reduce over arrays is a bit less than 2 times slower than looping, instead of the 15 times difference before. That more closely approximates the inherent overhead of the three loops with intermediate data structures instead of the fused loop of the imperative version. Looping over vectors is now by far the slowest solution, whereas reducing over vectors is a little bit slower than reducing over arrays.
So the overall answer is: it depends. If you have tight loops over arrays of primitive values, nothing beats an imperative loop. And there's no problem writing the loops because they are neither longer nor less comprehensible than the functional versions. In all other situations, the FP solution looks competitive.

This is a microbenchmark, and it depends on how the compiler optimizes you code. You have 3 loops composed here,
zip . map . fold
Now, I'm fairly sure the Scala compiler cannot fuse those three loops into a single loop, and the underlying data type is strict, so each (.) corresponds to an intermediate array being created. The imperative/mutable solution would reuse the buffer each time, avoiding copies.
Now, an understanding of what composing those three functions means is key to understanding performance in a functional programming language -- and indeed, in Haskell, those three loops will be optimized into a single loop that reuses an underlying buffer -- but Scala cannot do that.
There are benefits to sticking to the combinator approach, however -- by distinguishing those three functions, it will be easier to parallelize the code (replace map with parMap etc). In fact, given the right array type, (such as a parallel array) a sufficiently smart compiler will be able to automatically parallelize your code, yielding more performance wins.
So, in summary:
naive translations may have unexpected copies and inefficiences
clever FP compilers remove this overhead (but Scala can't yet)
sticking to the high level approach pays off if you want to retarget your code, e.g. to parallelize it

Don Stewart has a fine answer, but it might not be obvious how going from one loop to three creates a factor of 40 slowdown. I'll add to his answer that Scala compiles to JVM bytecodes, and not only does the Scala compiler not fuse the three loops into one, but the Scala compiler is almost certainly allocating all the intermediate arrays. Notoriously, implementations of the JVM are not designed to handle the allocation rates required by functional languages. Allocation is a significant cost in functional programs, and that's one the loop-fusion transformations that Don Stewart and his colleagues have implemented for Haskell are so powerful: they eliminate lots of allocations. When you don't have those transformations, plus you're using an expensive allocator such as is found on a typical JVM, that's where the big slowdown comes from.
Scala is a great vehicle for experimenting with the expressive power of an unusual mix of language ideas: classes, mixins, modules, functions, and so on. But it's a relatively young research language, and it runs on the JVM, so it's unreasonable to expect great performance except on the kind of code that JVMs are good at. If you want to experiment with the mix of language ideas that Scala offers, great—it's a really interesting design—but don't expect the same performance on pure functional code that you'd get with a mature compiler for a functional language, like GHC or MLton.
Is scala functional programming slower than traditional coding?
Not necessarily. Stuff to do with first-class functions, pattern matching, and currying need not be especially slow. But with Scala, more than with other implementations of other functional languages, you really have to watch out for allocations—they can be very expensive.

The Scala collections library is fully generic, and the operations provided are chosen for maximum capability, not maximum speed. So, yes, if you use a functional paradigm with Scala without paying attention (especially if you are using primitive data types), your code will take longer to run (in most cases) than if you use an imperative/iterative paradigm without paying attention.
That said, you can easily create non-generic functional operations that perform quickly for your desired task. In the case of working with pairs of floats, we might do the following:
class FastFloatOps(a: Array[Float]) {
def fastMapOnto(f: Float => Float) = {
var i = 0
while (i < a.length) { a(i) = f(a(i)); i += 1 }
this
}
def fastMapWith(b: Array[Float])(f: (Float,Float) => Float) = {
val len = a.length min b.length
val c = new Array[Float](len)
var i = 0
while (i < len) { c(i) = f(a(i),b(i)); i += 1 }
c
}
def fastReduce(f: (Float,Float) => Float) = {
if (a.length==0) Float.NaN
else {
var r = a(0)
var i = 1
while (i < a.length) { r = f(r,a(i)); i += 1 }
r
}
}
}
implicit def farray2fastfarray(a: Array[Float]) = new FastFloatOps(a)
and then these operations will be much faster. (Faster still if you use Double and 2.8.RC1, because then the functions (Double,Double)=>Double will be specialized, not generic; if you're using something earlier, you can create your own abstract class F { def f(a: Float) : Float } and then call with new F { def f(a: Float) = a*a } instead of (a: Float) => a*a.)
Anyway, the point is that it's not the functional style that makes functional coding in Scala slow, it's that the library is designed with maximum power/flexibility in mind, not maximum speed. This is sensible, since each person's speed requirements are typically subtly different, so it's hard to cover everyone supremely well. But if it's something you're doing more than just a little, you can write your own stuff where the performance penalty for a functional style is extremely small.

I am not an expert Scala programmer, so there is probably a more efficient method, but what about something like this. This can be tail call optimized, so performance should be OK.
def multiply_and_sum(l1:List[Int], l2:List[Int], sum:Int):Int = {
if (l1 != Nil && l2 != Nil) {
multiply_and_sum(l1.tail, l2.tail, sum + (l1.head * l2.head))
}
else {
sum
}
}
val first = Array(1,2,3,4,5)
val second = Array(6,7,8,9,10)
multiply_and_sum(first.toList, second.toList, 0) //Returns: 130

To answer the question in the title: Simple functional constructs may be slower than imperative on the JVM.
But, if we consider only simple constructs, then we might as well throw out all modern languages and stick with C or assembler. If you look a the programming language shootout, C always wins.
So why choose a modern language? Because it lets you express a cleaner design. Cleaner design leads to performance gains in the overall operation of the application. Even if some low-level methods can be slower. One of my favorite examples is the performance of BuildR vs. Maven. BuildR is written in Ruby, an interpreted, slow, language. Maven is written in Java. A build in BuildR is twice as fast as Maven. This is due mostly to the design of BuildR which is lightweight compared with that of Maven.

Your functional solution is slow because it is generating unnecessary temporary data structures. Removing these is known as deforesting and it is easily done in strict functional languages by rolling your anonymous functions into a single anonymous function and using a single aggregator. For example, your solution written in F# using zip, map and reduce:
let dot xs ys = Array.zip xs ys |> Array.map (fun (x, y) -> x * y) -> Array.reduce ( * )
may be rewritten using fold2 so as to avoid all temporary data structures:
let dot xs ys = Array.fold2 (fun t x y -> t + x * y) 0.0 xs ys
This is a lot faster and the same transformation can be done in Scala and other strict functional languages. In F#, you can also define the fold2 as inline in order to have the higher-order function inlined with its functional argument whereupon you recover the optimal performance of the imperative loop.

Here is dbyrnes solution with arrays (assuming Arrays are to be used) and just iterating over the index:
def multiplyAndSum (l1: Array[Int], l2: Array[Int]) : Int =
{
def productSum (idx: Int, sum: Int) : Int =
if (idx < l1.length)
productSum (idx + 1, sum + (l1(idx) * l2(idx))) else
sum
if (l2.length == l1.length)
productSum (0, 0) else
error ("lengths don't fit " + l1.length + " != " + l2.length)
}
val first = (1 to 500).map (_ * 1.1) toArray
val second = (11 to 510).map (_ * 1.2) toArray
def loopi (n: Int) = (1 to n).foreach (dummy => multiplyAndSum (first, second))
println (timed (loopi (100*1000)))
That needs about 1/40 of the time of the list-approach. I don't have 2.8 installed, so you have to test #tailrec yourself. :)

Derivative of a program

Let us assume you can represent a program as mathematical function, that's possible. How does the program representation of the first derivative of that function look like? Is there a way to transform a program to its "derivative" form, and does this make sense at all?

Yes it does make sense, it's known as Automatic Differentiation. There are one or two experimental compilers which can do this, for example NAGware's Differentiation Enabled Fortran Compiler Technology. And there are a lot of research papers on the topic. I suggest you get Googling.

First, it only makes sense to try to get the derivative of a pure function (one that does not affect external state and returns the exact same output for every input). Second, the type system of many programming languages involves a lot of step functions (e.g. integers), meaning you'd have to get your program to work in terms of continuous functions in order to get a valid first derivative. Third, getting the derivative of any function involves breaking it down and manipulating it symbolically. Thus, you can't get the derivative of a function without knowing how what operations it is made of. This could be achieved with reflection.
You could create a derivative approximation function if your programming language supports closures (that is, nested functions and the ability to put functions into variables and return them). Here is a JavaScript example taken from http://en.wikipedia.org/wiki/Closure_%28computer_science%29 :
function derivative(f, dx) {
return function(x) {
return (f(x + dx) - f(x)) / dx;
};
}
Thus, you could say:
function f(x) { return x*x; }
f_prime = derivative(f, 0.0001);
Here, f_prime will approximate function(x) {return 2*x;}
If a programming language implemented higher-order functions and enough algebra, one could implement a real derivative function in it. That would be really cool.

See Lambda the Ultimate discussions on Derivatives and dissections of data types and Derivatives of Regular Expressions

How do you define the mathematical function of a program?
A derivative represent the rate of change of a function. If your function isn't continuous its derivative will be undefined over most of the domain.

I'm just gonna say that this doesn't make a lot of sense, as a program is much more abstract and "ruleless" than a mathematical function. As a derivative is a measure of the change in output as the input changes, there are certainly some programs where this could apply. However, you'd need to be able to quantify your input/output both in numerical terms.
Since input/output would both numerical, it's reasonable to assume that your program represents or operates similarly to a mathematical function, or series of functions. Hence, you can easily represent a derivative, but it would be no different than converting the mathematical derivative of a function to a computer program.

If the program is denoted as a distribution (Schwartz) then you have some notion of derivative assuming that tests functions models your postcondition (you can still take the limit to get a characteristic function). For instance, the assignment x:=x+1 is associated to the Dirac distribution \delta_{x_0+1} where x_0 is the initial value of the variable x. However, I have no idea what is the computational meaning of \delta_{x_0+1}'.

I am wondering, what if the program your're trying to "derive" uses some form of heursitics ? How can it be derived then ?
Half-jokingly, we all know that all real programs use at least a rand().

Why should recursion be preferred over iteration?

Iteration is more performant than recursion, right? Then why do some people opine that recursion is better (more elegant, in their words) than iteration? I really don't see why some languages like Haskell do not allow iteration and encourage recursion? Isn't that absurd to encourage something that has bad performance (and that too when more performant option i.e. recursion is available) ? Please shed some light on this. Thanks.

Iteration is more performant than
recursion, right?
Not necessarily.
This conception comes from many C-like languages, where calling a function, recursive or not, had a large overhead and created a new stackframe for every call.
For many languages this is not the case, and recursion is equally or more performant than an iterative version. These days, even some C compilers rewrite some recursive constructs to an iterative version, or reuse the stack frame for a tail recursive call.

Try implementing depth-first search recursively and iteratively and tell me which one gave you an easier time of it. Or merge sort. For a lot of problems it comes down to explicitly maintaining your own stack vs. leaving your data on the function stack.
I can't speak to Haskell as I've never used it, but this is to address the more general part of the question posed in your title.

Haskell do not allow iteration because iteration involves mutable state (the index).

As others have stated, there's nothing intrinsically less performant about recursion. There are some languages where it will be slower, but it's not a universal rule.
That being said, to me recursion is a tool, to be used when it makes sense. There are some algorithms that are better represented as recursion (just as some are better via iteration).
Case in point:
fib 0 = 0
fib 1 = 1
fib n = fib(n-1) + fib(n-2)
I can't imagine an iterative solution that could possibly make the intent clearer than that.

Here is some information on the pros & cons of recursion & iteration in c:
http://www.stanford.edu/~blp/writings/clc/recursion-vs-iteration.html
Mostly for me Recursion is sometimes easier to understand than iteration.

Iteration is just a special form of recursion.

Recursion is one of those things that seem elegant or efficient in theory but in practice is generally less efficient (unless the compiler or dynamic recompiler) is changing what the code does. In general anything that causes unnecessary subroutine calls is going to be slower, especially when more than 1 argument is being pushed/popped. Anything you can do to remove processor cycles i.e. instructions the processor has to chew on is fair game. Compilers can do a pretty good job of this these days in general but it's always good to know how to write efficient code by hand.

Several things:
Iteration is not necessarily faster
Root of all evil: encouraging something just because it might be moderately faster is premature; there are other considerations.
Recursion often much more succinctly and clearly communicates your intent
By eschewing mutable state generally, functional programming languages are easier to reason about and debug, and recursion is an example of this.
Recursion takes more memory than iteration.

I don't think there's anything intrinsically less performant about recursion - at least in the abstract. Recursion is a special form of iteration. If a language is designed to support recursion well, it's possible it could perform just as well as iteration.
In general, recursion makes one be explicit about the state you're bringing forward in the next iteration (it's the parameters). This can make it easier for language processors to parallelize execution. At least that's a direction that language designers are trying to exploit.

As a low level ITERATION deals with the CX registry to count loops, and of course data registries.
RECURSION not only deals with that it also adds references to the stack pointer to keep references of the previous calls and then how to go back.-
My University teacher told me that whatever you do with recursion can be done with Iterations and viceversa, however sometimes is simpler to do it by recursion than Iteration (more elegant) but at a performance level is better to use Iterations.-

In Java, recursive solutions generally outperform non-recursive ones. In C it tends to be the other way around. I think this holds in general for adaptively compiled languages vs. ahead-of-time compiled languages.
Edit:
By "generally" I mean something like a 60/40 split. It is very dependent on how efficiently the language handles method calls. I think JIT compilation favors recursion because it can choose how to handle inlining and use runtime data in optimization. It's very dependent on the algorithm and compiler in question though. Java in particular continues to get smarter about handling recursion.
Quantitative study results with Java (PDF link). Note that these are mostly sorting algorithms, and are using an older Java Virtual Machine (1.5.x if I read right). They sometimes get a 2:1 or 4:1 performance improvement by using the recursive implementation, and rarely is recursion significantly slower. In my personal experience, the difference isn't often that pronounced, but a 50% improvement is common when I use recursion sensibly.

I find it hard to reason that one is better than the other all the time.
Im working on a mobile app that needs to do background work on user file system. One of the background threads needs to sweep the whole file system from time to time, to maintain updated data to the user. So in fear of Stack Overflow , i had written an iterative algorithm. Today i wrote a recursive one, for the same job. To my surprise, the iterative algorithm is faster: recursive -> 37s, iterative -> 34s (working over the exact same file structure).
Recursive:
private long recursive(File rootFile, long counter) {
long duration = 0;
sendScanUpdateSignal(rootFile.getAbsolutePath());
if(rootFile.isDirectory()) {
File[] files = getChildren(rootFile, MUSIC_FILE_FILTER);
for(int i = 0; i < files.length; i++) {
duration += recursive(files[i], counter);
}
if(duration != 0) {
dhm.put(rootFile.getAbsolutePath(), duration);
updateDurationInUI(rootFile.getAbsolutePath(), duration);
}
}
else if(!rootFile.isDirectory() && checkExtension(rootFile.getAbsolutePath())) {
duration = getDuration(rootFile);
dhm.put(rootFile.getAbsolutePath(), getDuration(rootFile));
updateDurationInUI(rootFile.getAbsolutePath(), duration);
}
return counter + duration;
}
Iterative: - iterative depth-first search , with recursive backtracking
private void traversal(File file) {
int pointer = 0;
File[] files;
boolean hadMusic = false;
long parentTimeCounter = 0;
while(file != null) {
sendScanUpdateSignal(file.getAbsolutePath());
try {
Thread.sleep(Constants.THREADS_SLEEP_CONSTANTS.TRAVERSAL);
} catch (InterruptedException e) {
e.printStackTrace();
}
files = getChildren(file, MUSIC_FILE_FILTER);
if(!file.isDirectory() && checkExtension(file.getAbsolutePath())) {
hadMusic = true;
long duration = getDuration(file);
parentTimeCounter = parentTimeCounter + duration;
dhm.put(file.getAbsolutePath(), duration);
updateDurationInUI(file.getAbsolutePath(), duration);
}
if(files != null && pointer < files.length) {
file = getChildren(file,MUSIC_FILE_FILTER)[pointer];
}
else if(files != null && pointer+1 < files.length) {
file = files[pointer+1];
pointer++;
}
else {
pointer=0;
file = getNextSybling(file, hadMusic, parentTimeCounter);
hadMusic = false;
parentTimeCounter = 0;
}
}
}
private File getNextSybling(File file, boolean hadMusic, long timeCounter) {
File result= null;
//se o file é /mnt, para
if(file.getAbsolutePath().compareTo(userSDBasePointer.getParentFile().getAbsolutePath()) == 0) {
return result;
}
File parent = file.getParentFile();
long parentDuration = 0;
if(hadMusic) {
if(dhm.containsKey(parent.getAbsolutePath())) {
long savedValue = dhm.get(parent.getAbsolutePath());
parentDuration = savedValue + timeCounter;
}
else {
parentDuration = timeCounter;
}
dhm.put(parent.getAbsolutePath(), parentDuration);
updateDurationInUI(parent.getAbsolutePath(), parentDuration);
}
//procura irmao seguinte
File[] syblings = getChildren(parent,MUSIC_FILE_FILTER);
for(int i = 0; i < syblings.length; i++) {
if(syblings[i].getAbsolutePath().compareTo(file.getAbsolutePath())==0) {
if(i+1 < syblings.length) {
result = syblings[i+1];
}
break;
}
}
//backtracking - adiciona pai, se tiver filhos musica
if(result == null) {
result = getNextSybling(parent, hadMusic, parentDuration);
}
return result;
}
Sure the iterative isn't elegant, but alhtough its currently implemented on an ineficient way, it is still faster than the recursive one. And i have better control over it, as i dont want it running at full speed, and will let the garbage collector do its work more frequently.
Anyway, i wont take for granted that one method is better than the other, and will review other algorithms that are currently recursive. But at least from the 2 algorithms above, the iterative one will be the one in the final product.

I think it would help to get some understanding of what performance is really about. This link shows how a perfectly reasonably-coded app actually has a lot of room for optimization - namely a factor of 43! None of this had anything to do with iteration vs. recursion.
When an app has been tuned that far, it gets to the point where the cycles saved by iteration as against recursion might actually make a difference.

Recursion is the typical implementation of iteration. It's just a lower level of abstraction (at least in Python):
class iterator(object):
def __init__(self, max):
self.count = 0
self.max = max
def __iter__(self):
return self
# I believe this changes to __next__ in Python 3000
def next(self):
if self.count == self.max:
raise StopIteration
else:
self.count += 1
return self.count - 1
# At this level, iteration is the name of the game, but
# in the implementation, recursion is clearly what's happening.
for i in iterator(50):
print(i)

I would compare recursion with an explosive: you can reach big result in no time. But if you use it without cautions the result could be disastrous.
I was impressed very much by proving of complexity for the recursion that calculates Fibonacci numbers here. Recursion in that case has complexity O((3/2)^n) while iteration just O(n). Calculation of n=46 with recursion written on c# takes half minute! Wow...
IMHO recursion should be used only if nature of entities suited for recursion well (trees, syntax parsing, ...) and never because of aesthetic. Performance and resources consumption of any "divine" recursive code need to be scrutinized.

Iteration is more performant than recursion, right?
Yes.
However, when you have a problem which maps perfectly to a Recursive Data Structure, the better solution is always recursive.
If you pretend to solve the problem with iterations you'll end up reinventing the stack and creating a messier and ugly code, compared to the elegant recursive version of the code.
That said, Iteration will always be faster than Recursion. (in a Von Neumann Architecture), so if you use recursion always, even where a loop will suffice, you'll pay a performance penalty.
Is recursion ever faster than looping?

"Iteration is more performant than recursion" is really language- and/or compiler-specific. The case that comes to mind is when the compiler does loop-unrolling. If you've implemented a recursive solution in this case, it's going to be quite a bit slower.
This is where it pays to be a scientist (testing hypotheses) and to know your tools...

on ntfs UNC max path as is 32K
C:\A\B\X\C.... more than 16K folders can be created...
But you can not even count the number of folders with any recursive method, sooner or later all will give stack overflow.
Only a Good lightweight iterative code should be used to scan folders professionally.
Believe or not, most top antivirus cannot scan maximum depth of UNC folders.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio