How can I speed up compilation of Common Lisp `IF` statements? - performance

I have a system that generates decision trees and converts them into nested Common Lisp if statements with predicates that check if a variable value is >= or <= a given integer e.g.
(LAMBDA (V1 V2)
(IF (>= V1 2)
(IF (<= V1 3)
(IF (<= V2 3)
(IF (>= V2 2) 16 (IF (>= V2 1) 6 0))
(IF (<= V2 4) 10 0))
(IF (<= V1 4)
(IF (>= V2 1) (IF (<= V2 3) 6 0) 0)
0))
(IF (>= V1 1)
(IF (>= V2 2) (IF (<= V2 4) 10 0) 0)
0)))
I then use eval to compile the Lisp code, producing functions that run much faster than interpreting the original decision tree. This compilation step takes surprisingly long, though: a function with 5000 nested ifs takes over a minute to compile (in Clozure Common Lisp on a powerbook), even though generating the if statement took about 100 milliseconds. Why does such a simple structure take so long? Is there anything I can do to substantially speed it up, some declaration maybe? I'd greatly appreciate any pointers you can offer.

The actual portable function to compile functions is called COMPILE.
You can tell the Common Lisp compiler to invest less work via low optimize qualities for speed, space, debug and compilation-speed - whether this has any influence depends on the implementation.
The Clozure CL compiler is usually not the brightest one, but relatively fast. Generally I think the compiler maintainer might be able to give you more hints how to speed up compilation. Generally I would look for three
tell the compiler to do less work: no type inference, no code optimization, no generation of debug information, no space saving effort, ...
if it is necessary tell the compiler things which it would have to infer - like instead of type inference by the compiler, declare all the types during code generation. But that would mean that you actually need some advantage from type declarations like increased runtime safety or code optimizations.
the compiler itself may have speed penalties which may depend on the size of the source code. For example if that is quadratic, the compile time would increase by four if we double the code size. Only the compiler maintainers may know what to do in those cases - maybe they would need to implement more efficient data structures or similar....
The next option is to use a Lisp interpreter. They usually have very little definition time overhead - but the code usually runs much slower at runtime. In some problem domains it may be possible to follow a mixed approach: compile code which changes not very often and interpret code which changes often.

You could certainly (declare (optimize (compilation-speed 3))), and maybe reduce other qualities (see http://clhs.lisp.se/Body/d_optimi.htm#optimize).
However, I'd guess that the slow compilation is caused by the optimizations the compiler makes, so the result seems likely to be not so fast at execution time. But maybe not, you'd have to experiment.
I'd also think about what optimizations you could make yourself using your domain knowledge. Hints for that might also come from analyzing the output of disassemble on your generated functions.
Finally, maybe you can translate your decision trees into lookup tables, if the number of distinct values is not too big.

Related

About speed of procedures between user-made and built-in in scheme (related with SICP exercise 1.23)

//My question was so long So I reduced.
In scheme, user-made procedures consume more time than built-in procedures?
(If both's functions are same)
//This is my short version question.
//Below is long long version question.
EX 1.23 is problem(below), why the (next) procedure isn't twice faster than (+ 1)?
This is my guess.
reason 1 : (next) contains 'if' (special-form) and it consumes time.
reason 2 : function call consumes time.
http://community.schemewiki.org/?sicp-ex-1.23 says reason 1 is right.
And I want to know reason 2 is also right.
SO I rewrote the (next) procedure. I didn't use 'if' and checked the number divided by 2 just once before use (next)(so (next) procedure only do + 2). And I remeasured the time. It was more fast than before BUT still not 2. SO I rewrote again. I changed (next) to (+ num 2). Finally It became 2 or almost 2. And I thought why. This is why I guess the 'reason 2'. I want to know what is correct answer.
ps. I'm also curious about why some primes are being tested (very?) fast than others? It doesn't make sense because if a number n is prime, process should see from 2 to sqrt(n). But some numbers are tested faster. Do you know why some primes are tested faster?
Exercise 1.23. The smallest-divisor procedure shown at the start of this section does lots of needless testing: After it checks to see if the number is divisible by 2 there is no point in checking to see if it is divisible by any larger even numbers. This suggests that the values used for test-divisor should not be 2, 3, 4, 5, 6, ..., but rather 2, 3, 5, 7, 9, .... To implement this change, define a procedure next that returns 3 if its input is equal to 2 and otherwise returns its input plus 2. Modify the smallest-divisor procedure to use (next test-divisor) instead of (+ test-divisor 1). With timed-prime-test incorporating this modified version of smallest-divisor, run the test for each of the 12 primes found in exercise 1.22. Since this modification halves the number of test steps, you should expect it to run about twice as fast. Is this expectation confirmed? If not, what is the observed ratio of the speeds of the two algorithms, and how do you explain the fact that it is different from 2?
Where the book is :
http://mitpress.mit.edu/sicp/full-text/book/book-Z-H-11.html#%_sec_1.2.6
Your long ans short questions are actually addressing two different problems.
EX 1.23 is problem(below), why the (next) procedure isn't twice faster than (+ 1)?
You provide two possible reasons to explain the relative lack of speed up (and 30% speed gain is already a good achievement):
The apparent use of a function next instead of a simple arithmetic expression (as I understand it from your explanations),
the inclusion of a test in that very next function,
The first reason is an illusion: (+ 1) is a function incrementing its argument, so in both cases there is a function call (although the increment function is certainly a builtin one, a point which is addressed by your other question).
The second reason is indeed relevant: any test in a block of code will introduce a potential redirection in the code flow (a jump from the current executing instruction to some other address in the program), which may incur a delay. Note that this is analogous to a function call, which will also induce an address jump, so both reasons actually resolve to only one potential cause.
Regarding your short question, builtin functions are indeed usually faster, because the compiler is able to apply a special treatment to them in certain cases. This is due to the facts that:
knowing the semantics of the builtins, compiler designers are able to include special rules pertaining to the algebraic properties of these builtins, and for instance fuse successive incréments in a single call, or suppress a combination of increment and decrement called in sequence.
A builtin function call, when not optimised away, will be converted into a native machine code function call, which may not have to abide to all the calling convention rules. If your scheme compiler produce machine code from the source, then there might be only a marginal gain, but if it produce so called bytecode, the gain might be quite substantial, since user written functions will be translated to that bytecode format, and still require some form of interpretation. If you are only using an interpreter, then the gain is even more important.
I believe this is highly implementation and setting dependent. In many implementations there are different kinds of optimizations and in some there are none. To get the best performance you may need to compile your code or use settings to reduce debug information / stack traces. Getting the best performance in one implementation can worsen the performance in another.
Primitive procedures are usually compiled to be native and in some implementations, like ikarus, it's even inlined. (When you do (map car lst) ikarus changes it to (map (lambda (x) (car x)) lst) since car isn't a procedure) Lambdas are supposed to be cheap.. Remember many scheme implementations change your code to CPS and that is one procedure call for each expression in the body of a procedure call. It will never be as fast as machine code since it needs to do load closure variables.
To check which of the two options which are correct for your implementation make next do the same as it originally did. eg. no if but just increment the argument. The difference now is the extra call and nothing else. Then you can inline next by writing it's code directly in your procedure and substituting arguments for the operands. Is it still slower, then it's if. you need to run the tests several times, preferably with large enough number of primes to produce that it runs for a minute or so. Use time or similar in both the Scheme implementations to get the differences in ms. I use unix time command as well to see how the OS reflects on it.
You should also test to see if you get the same reason in some other implementation. It's not like it's not enough Scheme implementations out there so know yourself out! The differences between them might amaze you. I always use racket (raco exe source to make executable) and Ikarus. WHen doing a large test I include Chicken, Gambit and Chibi.

Improve performance of a ClojureScript program

I have a ClojureScript program that mainly performs math calculations on collections. It was developed in idiomatic, host-independent Clojure, so it's easy to benchmark it. To my surprise (and contrary to what the answers would suggest to Which is faster, Clojure or ClojureScript (and why)?), the same code in ClojureScript runs 5-10 times slower than its Clojure equivalent.
Here is what I did. I opened a lein repl and a browser repl at http://clojurescript.net/. Then I tried these snippets in both REPLs.
(time (dotimes [x 1000000] (+ 2 8)))
(let [coll (list 1 2 3)] (time (dotimes [x 1000000] (first coll))))
Then I opened a javascript console at the browser repl and wrote a minimalist benchmark function,
function benchmark(count, fun) {
var t0 = new Date();
for (i = 0; i < count; i++) {
fun();
}
var t1 = new Date();
return t1.getTime() - t0.getTime();
}
Back to the browser REPL:
(defn multiply [] (* 42 1.2))
Then try both native javascript multiplication, and its clojurescript variant in the javascript console,
benchmark(1000000, cljs.user.multiply);
benchmark(1000000, function(){ 42 * 1.2 });
What I found
Native javascript math is comparable to math in clojure
ClojureScript is 5-10 times slower than either of them
Now my question is, how can I improve the performance of my ClojureScript program?
There are some approaches I've considered so far
Fall back to using mutable javascript arrays and objects behind the scenes. (Is this possible at all?)
Fall back to using native javascript math operators. (Is this possible at all?)
Use javascript arrays explicitly with (aget js/v 0)
Use a less ambitious implementation of clojure-for-javascript, like https://github.com/chlorinejs/chlorine or https://github.com/gozala/wisp They generate a more idiomatic javascript, but they don't support namespaces which I 'm using a lot.
JavaScript has explicit return, so
function () { 42 * 1.2 }
does nothing; you'll want to benchmark
function () { return 42 * 1.2 }
instead. This happens to be exactly what the ClojureScript version compiles to, so there won't be any difference (in ClojureScript, basic arithmetic functions in non-higher-order usage get inlined as regular operator-based JavaScript expressions).
Now, Clojure is definitely faster than ClojureScript at this point. Part of the reason is that Clojure is still more carefully tuned than ClojureScript, although ClojureScript is improving at a pretty great pace in this department. Another part is that Clojure has a more mature JIT to take advantage of (the modern JS engines, V8 in particular, are pretty great, but not quite HotSpot-grade just yet).
The magnitude of the difference is somewhat tricky to measure, though; the fact that JITs are involved means that a loop with a body free of any side effects, such as the one in the question, will likely be optimized away, possibly even on the first run through it (through the use of on-stack replacement, used by HotSpot and I think also V8 -- I'd have to check to be sure though). So, better to benchmark something like
(def arr (long-array 1))
;;; benchmark this
(dotimes [_ 1000000]
(aset (longs arr) 0 (inc (aget (longs arr) 0))))
(longs call to avoid reflection in Clojure; could also use ^longs hint).
Finally, it certainly is the case, in both Clojure and ClojureScript, that for certain kinds of particularly performance-sensitive code it's best to use native arrays and such. Happily, there's no problem with doing so: on the ClojureScript side, you've got array, js-obj, aget, aset, make-array, you can use :mutable metadata on fields in deftype to be able to set! them in method bodies etc.
ClojureScript math is JavaScript math. Yes, if performance is critical, use JavaScript arrays and the provided low-level operators, these are guaranteed to produce optimal code where possible (i.e. no higher order usage). The ClojureScript persistent data structures are written this way: array mutation, arithmetic, bit twiddling.
I have a small example of efficient ClojureScript - http://github.com/swannodette/cljs-stl/blob/master/src/cljs_stl/spectral/demo.cljs that you might find useful as a guide.

Get the least significant bit of a number in Scheme

How do I get the least significant bit of a (whole) number in Scheme?
Is this a fairly easy thing todo?
Thanks,
If your Scheme implementation supports SRFI 60, you can simply use:
(logand n 1)
(bitwise-and is a synonym that some Scheme implementations prefer, so if you don't have logand, check for bitwise-and too.)
Otherwise, if you are willing to assume you're on a two's-complement system (all mainstream platforms are), you can also use odd? or even?:
(if (odd? n) 1 0)

What is the fastest way to do integer division?

Using scheme I have the need to use the following function. (All args are natural numbers [0, inf) )
(define safe-div
(lambda (num denom safe)
(if (zero? denom)
safe
(div num denom))))
However, this function is called quite often and is not performing well enough (speed wise).
Is there a more efficient way of implementing the desired behavior (integer division of num and denom, return safe value if denom is zero)?
Notes, I am using Chez Scheme, however this is being used in a library that imports rnrs only, not full Chez.
For maximum performance, you need to get as close to the silicon as possible. Adding safety checks like this isn't going to do it, unless they get just-in-time compiled into ultra-efficient machine code by the scheme system.
I see two options. One is to create a native (i.e. foreign) implementation in C (or assembly) and invoke that. That might not be compatible with packaging it as a lambda, but then again, the dynamic nature of lambdas leads to notational efficiency but not necessarily runtime efficiency. (Function pointers excepted, there's a reason lambda expressions are not present in C, despite being many years older.) If you go this route, it might be best to take a step back and see if the larger processing of which safe-div is a part should be taken native. There's little point in speeding up the division at the center of a loop if everything around it is still slow.
Assuming that division by zero is expected to be rare, another approach is to just use div and hope its implementation is fast. Yes, this can lead to division by zero, but when it comes to speed, sometimes it is better to beg forgiveness than to ask permission. In other words, skip the checking before the division and just do it. If it fails, the scheme runtime should catch the division by zero fault and you can install an exception handler for it. This leads to slower code in the exceptional case and faster code in the normal case. Hopefully this tradeoff works out to a performance win.
Lastly, depending on what you are dividing by, it might be faster to multiply by the reciprocal than to perform an actual division. This requires fast reciprocal computation or revising earlier computations to yield a reciprocal directly. Since you are dealing with integers, the reciprocal would be stored in fixed-point, which is essentially 2^32 * 1/denom. Multiply this by num and shift right by 32 bits to get the quotient. This works out to a win because more processors these days have single cycle multiply instructions, but division executes in a loop on the chip, which is much slower. This might be overkill for your needs, but could be useful at some point.

How can I tell if my tail-recursive Scheme function is being optimized correctly

I have a Scheme function who's basic form looks like this
(define (foo param var)
(cond ((end-condition) (return-something))
((other-end-condit) (return-something-else))
(else
(let ((newvar (if some-condition
(make-some-updated var)
(destructive-update! var))))
(foo param newvar)))))
I feel like this is pretty clearly something that needs to be optimized to iteration in compilation, but when I compile it (with chicken) it still runs incredibly slowly. (if I understand the R5RS specs: http://groups.csail.mit.edu/mac/ftpdir/scheme-reports/r5rs-html.old/r5rs_22.html, this looks like it should work)
I wrote the same exact algorithm with a while loop in python and the interpreted program terminated in seconds. My compiled scheme takes about 15 minutes, and I am positive the algorithm is the same.
I think this is a tail recursion not getting optimized issue, as I can't think what else it could possibly be, but I cant figure it out. Any ideas? For what its worth, the var is a hash and the destructive update is merely adding an element, although it also returns the updated hash to be passed in as newvar.
That function is indeed tail-recursive, so you're good there. However, tail recursion just means that stack space won't grow, not that your program is guaranteed to run fast. If you want to see if your program is really running tail-recursively, run it while watching the total memory taken by Chicken (and make sure you aren't allocating memory in make-some-updated, which you might be). If the memory grows, then Chicken isn't compiling your program correctly according to the standard.

Resources