Efficiency: recursion vs loop - performance

This is just curiosity on my part, but what is more efficient, recursion or a loop?
Given two functions (using common lisp):
(defun factorial_recursion (x)
(if (> x 0)
(* x (factorial_recursion (decf x)))
1))
and
(defun factorial_loop (x)
(loop for i from 1 to x for result = 1 then
(* result i) finally
(return result)))
Which is more efficient?

I don't even have to read your code.
Loop is more efficient for factorials. When you do recursion, you have up to x function calls on the stack.
You almost never use recursion for performance reasons. You use recursion to make the problem more simple.

Mu.
Seriously now, it doesn't matter. Not for examples this size. They both have the same complexity. If your code is not fast enough for you, this is probably one of the last places you'd look at.
Now, if you really want to know which is faster, measure them. On SBCL, you can call each function in a loop and measure the time. Since you have two simple functions, time is enough. If your program was more complicated, a profiler would be more useful. Hint: if you don't need a profiler for your measurements, you probably don't need to worry about performance.
On my machine (SBCL 64 bit), I ran your functions and got this:
CL-USER> (time (loop repeat 1000 do (factorial_recursion 1000)))
Evaluation took:
0.540 seconds of real time
0.536034 seconds of total run time (0.496031 user, 0.040003 system)
[ Run times consist of 0.096 seconds GC time, and 0.441 seconds non-GC time. ]
99.26% CPU
1,006,632,438 processor cycles
511,315,904 bytes consed
NIL
CL-USER> (time (loop repeat 1000 do (factorial_loop 1000)))
Evaluation took:
0.485 seconds of real time
0.488030 seconds of total run time (0.488030 user, 0.000000 system)
[ Run times consist of 0.072 seconds GC time, and 0.417 seconds non-GC time. ]
100.62% CPU
902,043,247 processor cycles
511,322,400 bytes consed
NIL
After putting your functions in a file with (declaim (optimize speed)) at the top, the recursion time dropped to 504 milliseconds and the loop time dropped to 475 milliseconds.
And if you really want to know what's going on, try dissasemble on your functions and see what's in there.
Again, this looks like a non-issue to me. Personally, I try to use Common Lisp like a scripting language for prototyping, then profile and optimize the parts that are slow. Getting from 500ms to 475ms is nothing. For instance, in some personal code, I got a couple of orders of magnitude speedup by simply adding an element type to an array (thus making the array storage 64 times smaller in my case). Sure, in theory it would have been faster to reuse that array (after making it smaller) and not allocate it over and over. But simply adding :element-type bit to it was enough for my situation - more changes would have required more time for very little extra benefit. Maybe I'm sloppy, but 'fast' and 'slow' don't mean much to me. I prefer 'fast enough' and 'too slow'. Both your functions are 'fast enough' in most cases (or both are 'too slow' in some cases) so there's no real difference between them.

If you can write recursive functions in such a way that the recursive call is the very last thing done (and the function is thus tail-recursive) and the language and compiler/interpreter you are using supports tail recursion, then the recursive function can (usually) be optimised into code that is really iterative, and is as fast as an iterative version of the same function.
Sam I Am is correct though, iterative functions are usually faster than their recursive counterparts. If a recursive function is to be as fast as an iterative function that does the same thing, you have to rely on the optimiser.
The reason for this is that a function call is much more expensive than a jump, plus you consume stack space, a (very) finite resource.
The function you give is not tail recursive because you call factorial_recursion and then you multiply it by x. An example of a tail-recursive version would be
(defun factorial-recursion-assist (x cur)
(if (> x 1)
(factorial-recursion-assist (- x 1) (+ cur (* (- x 1) x)))
cur))
(defun factorial-recursion (x)
(factorial-recursion-assist x 1))
(print (factorial-recursion 4))

Here's a tail-recursive factorial (I think):
(defun fact (x)
(funcall (alambda (i ret)
(cond ((> i 1)
(self (1- i) (* ret i)))
(t
ret)))
x 1))

Related

Clojure comp doesn't tail-call-optimize (can create StackOverflow exception)

I got stuck on a Clojure program handling a very large amount of data (image data). When the image was larger than 128x128, the program would crash with a StackOverflow exception. Because it worked for smaller images, I knew it wasn't an infinite loop.
There were lots of possible causes of high memory usage, so I spent time digging around. Making sure I was using lazy sequences correctly, making sure to use recur as appropriate, etc. The turning point came when I realized that this:
at clojure.core$comp$fn__5792.invoke(core.clj:2569)
at clojure.core$comp$fn__5792.invoke(core.clj:2569)
at clojure.core$comp$fn__5792.invoke(core.clj:2569)
referred to the comp function.
So I looked at where I was using it:
(defn pipe [val funcs]
((apply comp funcs) val))
(pipe the-image-vec
(map
(fn [index] (fn [image-vec] ( ... )))
(range image-size)))
I was doing per-pixel operations, mapping a function to each pixel to process it. Interestingly, comp doesn't appear to benefit from tail-call optimization, or do any kind of sequential application of functions. It seems that it was just composing things in the basic way, which when there are 65k functions, understandably overflows the stack. Here's the fixed version:
(defn pipe [val funcs]
(cond
(= (count funcs) 0) val
:else (recur ((first funcs) val) (rest funcs))))
recur ensures the recursion gets tail-call optimized, preventing a stack buildup.
If anybody can explain why comp works this way (or rather, doesn't work this way), I'd love to be enlightened.
First, a more straightforward MCVE:
(def fs (repeat 1e6 identity))
((apply comp fs) 99)
#<StackOverflowError...
But why does this happen? If you look at the (abridged) comp source:
(defn comp
([f g]
(fn
([x] (f (g x)))
([f g & fs]
(reduce1 comp (list* f g fs))))
You can see that the whole thing is basically just 2 parts:
The first argument overload that does the main work of wrapping each composed function call in another function.
Reducing over the functions using comp.
I believe the first point is the problem. comp works by taking the list of functions and continually wrapping each set of calls in functions. Eventually, this will exhaust the stack space if you try to compose too many functions, as it ends up creating a massive function that's wrapping many other functions.
So, why can TCO not help here? Because unlike most StackOverflowErrors, recursion is not the problem here. The recursive calls only ever reach one frame deep in the variardic case at the bottom. The problem is the building up of a massive function, which can't simply be optimizated away.
Why were you able to "fix" it though? Because you have access to val, so you're able to evaluate the functions as you go instead of building up one function to call later. comp was written using a simple implementation that works fine for most cases, but fails for extreme cases like the one you presented. It's fairly trivial to write a specialized version that handles massive collections though:
(defn safe-comp [& fs]
(fn [value]
(reduce (fn [acc f]
(f acc))
value
(reverse fs))))
Of course note though, this doesn't handle multiple arities like the core version does.
Honestly though, in 3 and a bit years of using Clojure, I've never once written (apply comp ...). While it's certainly possible you have experienced a case I've never needed to deal with, it's more likely that you're using the wrong tool for the job here. When this code is complete, post it on Code Review and we may be able to suggest better ways of accomplishing what you're trying to do.

how is the efficiency and cost of nested function in scheme language

In SICP, I have learned using functions, it's amazing and usefull. But I am confused with the cost of function nested, like below code:
(define (sqrt x)
(define (good-enough? guess)
(< (abs (- (square guess) x)) 0.001))
(define (improve guess)
(average guess (/ x guess)))
(define (sqrt-iter guess)
(if (good-enough? guess)
guess
(sqrt-iter (improve guess))))
(sqrt-iter 1.0))
It defines three child-functions, how is the efficiency and cost? If I use more function
calls like this?
UPDATE:
look at code below, in Searching for divisors
(define (smallest-divisor n)
(find-divisor n 2))
(define (find-divisor n test-divisor)
(cond ((> (square test-divisor) n) n)
((divides? test-divisor n) test-divisor)
(else (find-divisor n (+ test-divisor 1)))))
(define (divides? a b)
(= (remainder b a) 0))
(define (prime? n)
(= n (smallest-divisor n)))
Q1: the divides? and smallest-divisor are not necessary, just for clarification. How are the efficiency and cost? Does Scheme compiler optimize for this situation. I think I should learn some knowledge about compiler.(͡๏̯͡๏)
q2: How about in interpreter?
It's implementation dependent. It does not say anything about how much cost a closure should have, but since Scheme is designed around closures any implementation should stride to make closures cheap. Many implementations does CPS conversions and that introduces a closure per evaluation operation.
There is a compiler technique called lambda lifting where local functions get transformed to global by changing free variables not in global scope to bounded ones and lifting the procedure out of the procedure it was defined in. The SICP code might get translated to something like:
(define (lift:good-enough? x guess)
(< (abs (- (square guess) x)) 0.001))
(define (lift:improve x guess)
(average guess (/ x guess)))
(define (lift:sqrt-iter x guess)
(if (lift:good-enough? x guess)
guess
(lift:sqrt-iter (lift:improve x guess))))
(define (sqrt x)
(lift:sqrt-iter x 1.0))
Where all the lift:-prefixed identifiers are unique so that it does not collide with existing global bindings.
The efficiency question breaks down to a question of compiler optimization. Generally anytime you have a lexical procedure, such as your improve procedure, that references a free variable, such as your x, then a closure needs to be created. The closure has an 'environment' that must be allocated to store all free variables. Thus there is some space overhead and some time overhead (to allocate the memory and to fill the memory).
Where does compiler optimization come to play? When a lexical procedure does not 'escape' its lexical block, such as all of yours, then the compiler can inline the procedures. In that case a closure with its environment need not be created.
But, importantly, in every day use, you shouldn't worry about the use of lexical procedures.
In one word: negligible. Using nested functions instead of top-level defined functions won't have a noticeable effect on performance, we use nested functions ("block structure" in SICP's terms) for clarity and better structuring a procedure, not for efficiency:
Such nesting of definitions, called block structure, is basically the right solution to the simplest name-packaging problem
There might be a small difference in the time it takes to look up a function depending on where it was defined, but that will depend on implementation details of the interpreter. It's not worth worrying about it.
Not relevant.
One of the important aspects of designing a programming language is often choosing between efficiency on one side, and expressiveness on the other side. In most situations these two aspects defines the charactersistics of a low-level and high-level language, respectively.
Scheme is a small, but powerful high-level language from the family of Lisp languages. One of the most powerful feature of Scheme is it's expressiveness and ability to abstract. As a programmer of Scheme you use block structure inside procedures because it encapsulates related behaviour and solves your problem in a structured way, but you don't consider low-level properties of this behaviour, such as runtime-cost of calling procedures or allocating lists. This is part of the joy of programming in a high-level language such as Scheme.
As you say: It's amazing and useful, so continue your work and create something nice. Until the program becomes considerably slow in operation I wouldn't care about these things and just concentrate on the harder problems like defining concepts and behaviour in your program.

How can I make this clojure code run faster?

I have a version implemented in Lisp(SBCL) which runs under 0.001 seconds with 12 samples. However this version(in clojure) takes more than 1.1 secs.
What should I do for making this code run as fast as original Lisp version?
To make it sure, my numbers are not including times for starting repl and others. And is from time function of both sbcl and clojure.(Yes, my laptop is rather old atom based one)
And this application is/will be used in the repl, not as compiled in single app, so running thousand times before benchmarking seems not meaningful.
Oh, the fbars are like this: [[10.0 10.5 9.8 10.1] [10.1 10.8 10.1 10.7] ... ], which is
Open-High-Low-Close price bars for the stocks.
(defn- build-new-smpl [fmx fmn h l c o]
(let [fmax (max fmx h)
fmin (min fmn l)
fc (/ (+ c fmax fmin) 3.0)
fcd (Math/round (* (- fc o) 1000))
frd (Math/round (* (- (* 2.0 c) fmax fmin) 1000))]
(if (and (> fcd 0) (> frd 0))
[1 fmax fmin]
(if (and (< fcd 0) (< frd 0))
[-1 fmax fmin]
[0 fmax fmin]))))
(defn binary-smpls-using [fbars]
(let [fopen (first (first fbars))]
(loop [fbars fbars, smpls [], fmax fopen, fmin fopen]
(if (> (count fbars) 0)
(let [bar (first fbars)
[_ h l c _] bar
[nsmpl fmx fmn] (build-new-smpl fmax fmin h l c fopen)]
(recur (rest fbars) (conj smpls nsmpl) fmx fmn))
smpls))))
================================================
Thank you. I managed to make the differences for 1000 iteration as 0.5 secs (1.3 secs on SBCL and 1.8 on Clojure). Major factor is I should have created fbars as not lazy but as concrete(?) vector or array, and this solves my problem.
You need to use a proper benchmarking library; the standard Clojure solution is Hugo Duncan's Criterium.
The reason is that code on the JVM starts running in interpreted mode and then eventually gets compiled by the JIT compiler; it's the steady-state behaviour after JIT compilation that you want to benchmark and not behaviour during the profiling stage. This is, however, quite tricky, since the JIT compiler optimizes no-ops away where it can see them, so you need to make sure your code causes side effects that won't be optimized away, but then you still need to run it in a loop to obtain meaningful results etc. -- quick and dirty solutions just don't cut it. (See the Elliptic Group, Inc. Java benchmarking article, also linked to by Criterium's README, for an extended discussion of the issues involved.)
Cycling the two samples you listed in a vector of length 1000 results in a timing of ~327 µs in a Criterium benchmark on my machine:
(require '[criterium.core :as c])
(def v (vec (take 1000 (cycle [[10.0 10.5 9.8 10.1] [10.1 10.8 10.1 10.7]]))))
(c/bench (binary-smpls-using v))
WARNING: Final GC required 4.480116525558204 % of runtime
Evaluation count : 184320 in 60 samples of 3072 calls.
Execution time mean : 327.171892 µs
Execution time std-deviation : 3.129050 µs
Execution time lower quantile : 322.731261 µs ( 2.5%)
Execution time upper quantile : 333.117724 µs (97.5%)
Overhead used : 1.900032 ns
Found 1 outliers in 60 samples (1.6667 %)
low-severe 1 (1.6667 %)
Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
A really good benchmark would actually involve an interesting dataset (all different samples, preferably coming from the real world).
When I run this with 1000 samples I get an answer in 46 ms, though here are some common tips on making clojure code faster:
turn on reflection warnings:
(set! warn-on-reflection true)
add type hints until the reflection warnings go away, which isn't a problem here
make it a lazy sequence so you don't have to construct huge sequences in ram, which results in lots of GC overhead. (in some cases this adds overhead though I think it's a decent idea here)
in cases like this see if incanter can do the job for you (though that may be cheating)
time the creation of the result excluding the time it takes the repl to print it.

P-NP problems solved? FindBugs solves the halting prob?

There is a tool called FindBugs it can detect infinite never ending loops in a given program/ code base.
This implies FindBugs can detect if a program will end or not by analyzing the code.
Halting problem is the problem defining that:
Given a description of an arbitrary computer program, decide whether
the program finishes running or continues to run forever
So does this imply that the halting problem is solved or a subset of the halting problem is solved?
No, it's not solved. Findbugs only finds some of the cases of infinite never ending loops, such as this one:
public void myMethod() {
int a = 0;
while (true) {
a++;
}
}
IIRC, the only false negative it suffers from is, if the above method myMethod is never called, in which case you 'll still want to delete it as it's dead code.
It does suffers from false positives: there are many cases of non-ending programs that findbugs will not detect.
Imagine that you have a tool that always detects infinite loops.
Suppose there exists a unievrsal machine HALT(CODE, INPUT) that halts iff CODE halts on INPUT. Now consider this:
if HALT(CODE, CODE), loop forever
else halt
If CODE halts on CODE, you'll get a contradiction, and also if it doesn't. Why?
Assuming CODE halts on CODE, then the program will loop forever.. meaning that... it doesn't stop..Now assume that CODE doesn't halt on CODE, you'll get that.... it does stop..
If you were to make a program to analyze a program for the same platform with the same limits as the analyzing programs it's impossible for such analyzer to exist. This is known as the halting problem.
When that said, halting problem is solvable for programs that has a lot less memory consumption and code length than the analyzing program can have. Eg. I can make a halt? procedure for all 2 byte BrainFuck-programs like this:
;; takes a valid 2 byte BF-program
;; and returns if it will halt
(define (halt? x)
(cond ((equal? x "[]") #f)
(else #t)))
A larger example is by making an interpreter and hash memory states and pc-location. If a previous state is found it's an infinite loop. Even with a very good data model the memory used by the interpreter must be considerable larger than what it interprets.
I'm thinking of constant folding programs by doing and the halting problem becomes an issue. My idea is to have a data structure that has the number of times a particular branch in AST has been seen and have a cutoff limit that is very large. Thus if the interpreter has been at a branch more than the cutoff it will end up in the compiled program instead of it's computation. It takes a lot less memory and will establish that some or all parts of a program certainly does return (halt).
Imagine this code:
(define (make-list n f)
(if (zero? n)
'()
(cons (f) (make-list (- n 1) f))))
(define (a)
(b))
(define (b)
(c))
(define (c)
(b))
(display (make-list 4 read))
(display (make-list 4 a))
It's actually pretty bad code since you don't know which order the input might get. The compiler get to choose whats best and it might turn into:
(display-const "(")
(display (read))
(display-const " ")
(display (read))
(display-const " ")
(display (read))
(display-const " ")
(display (read))
(display-const ")")
(display (cons (b) (cons (b) (cons (b) (cons (b) '())))) ; gave up on (b)

Generating list of million random elements

How to efficiently generate a list of million random elements in scheme? The following code hits maximum recursion depth with 0.1 million itself.
(unfold (lambda(x)(= x 1000000)) (lambda(x)(random 1000)) (lambda(x)(+ x 1)) 0)
It really depends on the system you're using, but here's a common way to do that in plain scheme:
(let loop ([n 1000000] [r '()])
(if (zero? n)
r
(loop (- n 1) (cons (random 1000) r))))
One note about running this code as is: if you just type it into a REPL, it will lead to printing the resulting list, and that will usually involve using much more memory than the list holds. So it's better to do something like
(define l ...same...)
There are many other tools that can be used to varying degrees of convenience. unfold is one of them, and another is for loops as can be found in PLT Scheme:
(for/list ([i (in-range 1000000)]) (random 1000))
I don't know much scheme but couldn't you just use tail-recursion (which is really just looping) instead of unfold (or any other higher-order function)?
Use the do-loop-construct as described here.
Some one correct me if I am wrong but the Fakrudeen's code should end up being optimized away since it is tail recursive. Or it should be with a proper implementation of unfold. It should never reach a maximum recursion depth.
What version of scheme are you using Fakrudeen?
DrScheme does not choke on a mere million random numbers.
Taking Chicken-Scheme as implementation, here is a try with some results.
(use srfi-1)
(use extras)
(time (unfold (lambda(x)(= x 1000000))
(lambda(x)(random 1000))
(lambda(x)(+ x 1)) 0))
(time (let loop ([n 1000000] [r '()])
(if (zero? n)
r
(loop (- n 1) (cons (random 1000) r)))))
(define (range min max body)
(let loop ((current min) (ret '()))
(if (= current max)
ret
(loop (+ current 1) (cons (body current ret) ret)))))
(time (range 0 1000000 (lambda params (random 1000))))
The results are here with csc -O3 t.scm
0.331s CPU time, 0.17s GC time (major), 12/660 GCs (major/minor)
0.107s CPU time, 0.02s GC time (major), 1/290 GCs (major/minor)
0.124s CPU time, 0.022s GC time (major), 1/320 GCs (major/minor)
As you can see, the version of the author is much more slowlier than using plain tail recursive calls. It's hard to say why the unfold call is much more slowlier but I'd guess that it's because it taking a lot more time doing function calls.
The 2 other versions are quite similar. My version is almost the same thing with the exception that I'm creating a high order function that can be reused.
Unlike the plain loop, it could be reused to create a range of function. The position and current list is sent to the function in case they are needed.
The higher order version is probably the best way to do even if it takes a bit more time to execute. It is probably also because of the function calls. It could be optimized by removing parameters and it will get almost as fast as the named let.
The advantage of the higher order version is that the user doesn't have to write the loop itself and can be used with an abstract lambda function.
Edit
Looking at this specific case. Ef we are to create a million of element ranged between 0 and 999, we could possibly create a fixed length vector of a million and with values from 0 to 999 in it. Shuffle the thing back after. Then the whole random process would depend on the shuffle function which should not have to create new memory swapping values might get faster than generating random numbers. That said, the shuffle method somewhat still rely on random.
Edit 2
Unless you really need a list, you could get away with a vector instead.
Here is my second implementation with vector-map
(time (vector-map (lambda (x y) (random 1000)) (make-vector 1000000)))
# 0.07s CPU time, 0/262 GCs (major/minor)
As you can see, it is terribly faster than using a list.
Edit 3 fun
(define-syntax bigint
(er-macro-transformer
(lambda (exp rename compare)
(let ((lst (map (lambda (x) (random 1000)) (make-list (cadr exp)))))
(cons 'list lst)))))
100000
0.004s CPU time, 0/8888 GCs (major/minor)
It's probably not a good idea to use this but I felt it might be interesting. Since it's a macro, it will get executed at compile time. The compile time will be huge, but as you can see, the speed improvement is also huge. Unfortunately using chicken, I couldn't get it to build a list of a million. My guess is that the type it might use to build the list is overflowing and accessing invalid memory.
To answer the question in the comments:
I'm not a Scheme professional. I'm pretty new to it too and as I understand, the named loop or the high order function should be the way to go. The high order function is good because it's reusable. You could define a
(define (make-random-list quantity maxran)
...)
Then thats the other interesting part, since scheme is all about high order functions. You could then replace the implementation of make-random-list with anything you like. If you need some compile time execution, define the macro otherwise use a function. All that really matters is to be able to reuse it. It has to be fast and not use memory.
Common sense tells you that doing less execution it will be faster, tail recursive calls aren't suppose to consume memory. And when you're not sure, you can hide implementation into a function that can be optimized later.
MIT Scheme limits a computation's stack. Given the size of your problem, you are likely running out of stack size. Fortunately, you can provide a command-line option to change the stack size. Try:
$ mit-scheme --stack <number-of-1024-word-blocks>
There are other command-line options, check out mit-scheme --help
Note that MIT Scheme, in my experience, is one of the few schemes that has a limited stack size. This explains why trying your code in others Schemes will often succeed.
As to your question of efficiency. The routine unfold is probably not implemented with a tail-recursive/iterative algorithm. Here is a tail recursive version with a tail recursive version of 'list reverse in-place':
(define (unfold stop value incr n0)
(let collecting ((n n0) (l '()))
(if (stop n)
(reverse! l)
(collecting (incr n) (cons (value n) l)))))
(define (reverse! list)
(let reving ((list list) (rslt '()))
(if (null? list)
rslt
(let ((rest (cdr list)))
(set-cdr! list rslt)
(reving rest list)))))
Note:
$ mit-scheme --version
MIT/GNU Scheme microcode 15.3
Copyright (C) 2011 Massachusetts Institute of Technology
This is free software; see the source for copying conditions. There is NO warranty; not even
for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Image saved on Tuesday November 8, 2011 at 10:45:46 PM
Release 9.1.1 || Microcode 15.3 || Runtime 15.7 || SF 4.41 || LIAR/x86-64 4.118 || Edwin 3.116
Moriturus te saluto.

Resources