Why do function calls slow things down in clojure? - performance

I've been playing around with the Is Clojure is Still Fast? (and prequel Clojure is Fast) code. It seemed unfortunate that inlining the differential equation (f) is one of the steps taken to improving performance. The cleanest/fastest thing I've been able to come up without doing this is the following:
; As in the referenced posts, for giving a rough measure of cycles/iteration (I know this is a very rough
; estimate...)
(def cpuspeed 3.6) ;; My computer runs at 3.6 GHz
(defmacro cyclesperit [expr its]
`(let [start# (. System (nanoTime))
ret# ( ~#expr (/ 1.0 ~its) ~its )
finish# (. System (nanoTime))]
(println (int (/ (* cpuspeed (- finish# start#)) ~its)))))
;; My solution
(defn f [^double t ^double y] (- t y))
(defn mysolveit [^double t0 ^double y0 ^double h ^long its]
(if (> its 0)
(let [t1 (+ t0 h)
y1 (+ y0 (* h (f t0 y0)))]
(recur t1 y1 h (dec its)))
[t0 y0 h its]))
; => 50-55 cycles/it
; The fastest solution presented by the author (John Aspden) is
(defn faster-solveit [^double t0 ^double y0 ^double h ^long its]
(if (> its 0)
(let [t1 (+ t0 h)
y1 (+ y0 (* h (- t0 y0)))]
(recur t1 y1 h (dec its)))
[t0 y0 h its]))
; => 25-30 cycles/it
The type hinting in my solution helps quite a bit (it's 224 cycles/it without type hinting on either f or solveit), but it's still nearly 2x slower than the inlined version. Ultimately this performance is still pretty decent, but this hit is unfortunate.
Why is there such a performance hit for this? Is there a way around it? Are there plans to find ways of improvingthis? As pointed out by John in the original post, it seems funny/unfortunate for function calls to be inefficient in a functional language.
Note: I'm running Clojure 1.5 and have :jvm-opts ^:replace [] in a project.clj file so that I can use lein exec/run without it slowing things down (and it will if you don't do this I discovered...)

Benchmarking in the presence of a JIT compiler is tricky; you really must allow for a warm-up period, but then you also can't just run it all in a loop, since it may then be proved a no-op and optimized away. In Clojure, the usual solution is to use Hugo Duncan's Criterium.
Running a Criterium benchmark for (solveit 0.0 1.0 (/ 1.0 1000000) 1000000) for both versions of solveit results in pretty much exactly the same timings on my machine (mysolveit ~3.44 ms, faster-solveit ~3.45 ms). That's in a 64-bit JVM run with -XX:+UseConcMarkSweepGC, using Criterium 0.4.2 (criterium.core/bench). Presumably HotSpot just inlines f. In any case, there's no performance hit at all.

Adding to the already good answers, the JVM JIT most often does inline the primitive function calls when warmed up, and in this case, when you bench it with a warmed JIT you see the same results. Just wanted to say Clojure also has an inlining feature though for cases where that yields benefits.
(defn f
{:inline-arities #{2}
:inline (fn [t y] `(- (double ~t) (double ~y)))}
^double [^double t ^double y]
(- t y))
Now Clojure will compile away the calls to f, inlining the function at compile time. Whereas the JIT will inline the function at runtime as needed otherwise.
Also note that I added a ^double type hint to the return of f, if you don't do that, it gets compiled to return Object, and a cast needs to be added, I'm not sure if that really affects performance much, but if you want a fully primitive function that takes primitives and return primitives you need to type hint the return as well.

Related

How does this Scheme code return a value?

This code is taken from Sussman and Wisdom's Structure and Interpretation of Classical Mechanics, its purpose is to derive (close to) the smallest positive floating point the host machine supports.
https://github.com/hnarayanan/sicm/blob/e37f011db68f8efc51ae309cd61bf497b90970da/scmutils/src/kernel/numeric.scm
Running it in DrRacket results in 2.220446049250313e-016 on my machine.
My question, what causes this to even return a value? This code is tail recursive, and it makes sense at some point the computer can no longer divide by 2. Why does it not throw?
(define *machine-epsilon*
(let loop ((e 1.0))
(if (= 1.0 (+ e 1.0))
(* 2 e)
(loop (/ e 2)))))
*machine-epsilon*
This code is tail recursive, and it makes sense at some point the computer can no longer divide by 2. Why does it not throw?
No, the idea is different: at some point the computer still can divide by 2, but the result (e) becomes indistinguishable from 0 [upd: in the context of floating-point addition only - very good point mentioned in the comment] (e + 1.0 = 1.0, this is exactly what if clause is checking). We know for sure that the previous e was still greater than zero "from the machine point of view" (otherwise we wouldn't get to the current execution point), so we simply return e*2.
This form of let-binding is syntactic sugar for recursion.
You may avoid using too much syntax until you master the language and write as much as possible using the kernel language, to focus on essential problem. For example, in full SICP text, never is specified this syntactic sugar for iteration.
The r6rs definition for iteration is here.
The purpose of this code is not to find the smallest float that the machine can support: it is to find the smallest float, epsilon such that (= (+ 1.0 epsilon) 1.0) is false. This number is useful because it's the upper bound on the error you get from adding numbers In particular what you know is that, say, (+ x y) is in the range [(x+y)*(1 - epsilon), (x+y)*(1 + epsilon)], where in the second expression + &c mean the ideal operations on numbers.
In particular (/ *machine-epsilon* 2) is a perfectly fine number, as is (/ *machine-epsilon* 10000) for instance, and (* (/ *machine-epsilon* x) x) will be very close to *machine-epsilon* for many reasonable values of x. It's just the case that (= (+ (/ *machine-epsilon* 2) 1.0) 1.0) is true.
I'm not familiar enough with floating-point standards, but the number you are probably thinking of is what Common Lisp calls least-positive-double-float (or its variants). In Racket you can derive some approximation to this by
(define *least-positive-mumble-float*
;; I don't know what float types Racket has if it even has more than one.
(let loop ([t 1.0])
(if (= (/ t 2) 0.0)
t
(loop (/ t 2)))))
I am not sure if this is allowed to raise an exception: it does not in practice and it gets a reasonable-looking answer.
It becomes clearer when you get rid of the confusing named let notation.
(define (calculate-epsilon (epsilon 1.0))
(if (= 1.0 (+ 1.0 epsilon))
(* epsilon 2)
(calculate-epsilon (/ epsilon 2))))
(define *machine-epsilon* (calculate-epsilon))
Is what the code does actually.
So now we see for what the named let expression is good.
It defines locally the function and runs it. Just that the name of the function as loop was very imprecise and confusing and the naming of epsilon to e is a very unhappy choice. Naming is the most important thing for readable code.
So this example of SICP should be an example for bad naming choices. (Okay, maybe they did it by intention to train the students).
The named let defines and calls/runs a function/procedure. Avoiding it would lead to better code - since clearer.
In common lisp such a construct would be much clearer expressed:
(defparameter *machine-epsilon*
(labels ((calculate-epsilon (&optional (epsilon 1.0))
(if (= 1.0 (+ 1.0 epsilon))
(* epsilon 2)
(calculate-epsilon (/ epsilon 2)))))
(calculate-epsilon)))
In CLISP implementation, this gives: 1.1920929E-7

How can I trace code execution in Clojure?

Why learning Clojure, I sometimes need to see what a function does at each step. For example:
(defn kadane [coll]
(let [pos+ (fn [sum x] (if (neg? sum) x (+ sum x)))
ending-heres (reductions pos+ 0 coll)]
(reduce max ending-heres)))
Should I insert println here and there (where, how); or is there a suggested workflow/tool?
This may not be what you're after at the level of a single function (see Charles Duffy's comment below), but if you wanted to do get an overview of what's going on at the level of a namespace (or several), you could use tools.trace (disclosure: I'm a contributor):
(ns foo.core)
(defn foo [x] x)
(defn bar [x] (foo x))
(in-ns 'user) ; standard REPL namespace
(require '[clojure.tools.trace :as trace])
(trace/trace-ns 'foo.core)
(foo.core/bar 123)
TRACE t20387: (foo.core/bar 123)
TRACE t20388: | (foo.core/foo 123)
TRACE t20388: | => 123
TRACE t20387: => 123
It won't catch inner functions and such (as pointed out by Charles), and might be overwhelming with large code graphs, but when exploring small-ish code graphs it can be quite convenient.
(It's also possible to trace individually selected Vars if the groups of interest aren't perfectly aligned with namespaces.)
If you use Emacs with CIDER as most Clojurians do, you already have a built-in debugger:
https://docs.cider.mx/cider/debugging/debugger.html
Chances are your favorite IDE/Editor has something built-in or a plugin already.
There is also (in no particular order):
spyscope
timbre/spy
tupelo/spyx
sayid
tools.trace
good old println
I would look at the above first. However there were/are other possibilities:
https://gist.github.com/ato/252421
https://github.com/philoskim/debux
https://github.com/pallet/ritz/tree/develop/nrepl-core
https://github.com/hozumi/eyewrap
probably many more
Also, if the function is simple enough you can add defs at development-time to peek inside the bindings at a given time inside your function.
Sayid is a tool presented at Clojure Conj 2016 that's directly appropriate to the purpose and comes with an excellent Emacs plugin. See the talk at which it was presented.
To see inside invocations of transient functions, see ws-add-inner-trace-fn (previously, ws-add-deep-trace-fn).
I frequently use the spyx and related functions like spy-let from the Tupelo library for this purpose:
(ns tst.clj.core
(:require [tupelo.core :as t] ))
(t/refer-tupelo)
(defn kadane [coll]
(spy-let [ pos+ (fn [sum x] (if (neg? sum) x (+ sum x)))
ending-heres (reductions pos+ 0 coll) ]
(spyx (reduce max ending-heres))))
(spyx (kadane (range 5)))
will produce output:
pos+ => #object[tst.clj.core$kadane$pos_PLUS___21786 0x3e7de165 ...]
ending-heres => (0 0 1 3 6 10)
(reduce max ending-heres) => 10
(kadane (range 5)) => 10
IMHO it is hard to beat a simple println or similar for debugging. Log files are also invaluable as you get closer to production.

Why isn't this function showing a performance speedup when its primary constituent function does?

I am optimizing a program I've been working on, and have hit a wall. The function julia-subrect maps over for-each-pixel a large number of times. I've optimized for-each-pixel to have a ~16x speedup. However, my optimized version of julia-subrect shows no evidence of this. Here are my benchmarks and relevant code:
; ======== Old `for-each-pixel` ========
;(bench (julia/for-each-pixel (->Complex rc ic) max-itrs radius r-min x-step y-step [xt yt])))
;Evaluation count : 3825300 in 60 samples of 63755 calls.
;Execution time mean : 16.018466 µs
; ======== New `for-each-pixel`. optimized 16x. ========
;(bench (julia/for-each-pixel-opt [rc ic] [max-itrs radius r-min] [x-step y-step] [xt yt])))
;Evaluation count : 59542860 in 60 samples of 992381 calls.
;Execution time mean : 1.038955 µs
(defn julia-subrect [^Long start-x ^Long start-y ^Long end-x ^Long end-y ^Long total-width ^Long total-height ^Complex constant ^Long max-itrs]
(let [grid (for [y (range start-y end-y)]
(vec (for [x (range start-x end-x)]
[x y])))
radius (calculate-r constant)
r-min (- radius)
r-max radius
x-step (/ (Math/abs (- r-max r-min)) total-width)
y-step (/ (Math/abs (- r-max r-min)) total-height)
; Uses old implementation of `for-each-pixel`
calculate-pixel (partial for-each-pixel constant max-itrs radius r-min x-step y-step)
for-each-row (fn [r] (map calculate-pixel r))]
(map for-each-row grid)))
; ======== Old `julia-subrect` ========
;(bench (doall (julia/julia-subrect start-x start-y end-x end-y total-width total-height c max-itrs))))
;Evaluation count : 22080 in 60 samples of 368 calls.
;Execution time mean : 2.746852 ms
(defn julia-subrect-opt [[^long start-x ^long start-y ^long end-x ^long end-y] [^double rc ^double ic] total-width total-height max-itrs ]
(let [grid (for [y (range start-y end-y)]
(vec (for [x (range start-x end-x)]
[x y])))
radius (calculate-r-opt rc ic)
r-min (- radius)
r-max radius
x-step (/ (Math/abs (- r-max r-min)) total-width)
y-step (/ (Math/abs (- r-max r-min)) total-height)
;Uses new implementation of `for-each-pixel`
calculate-pixel (fn [px] (for-each-pixel-opt [rc ic] [max-itrs radius r-min] [x-step y-step] px))
for-each-row (fn [r] (map calculate-pixel r))]
(map for-each-row grid)))
; ======== New `julia-subrect`, but no speedup ========
;(bench (doall (julia/julia-subrect-opt [start-x start-y end-x end-y] [rc ic] total-width total-height max-itrs))))
;Evaluation count : 21720 in 60 samples of 362 calls.
;Execution time mean : 2.831553 ms
Here is a gist containing source code for all the functions I've specified:
https://gist.github.com/johnmarinelli/adc5533c19fb0b6d74cf4ef04ae55ee6
So, can anyone tell me why julia-subrect is showing no signs of speedup? Also, I'm still new to clojure so bear with me if the code is unidiomatic/ugly. Right now, I'm focusing on making the program run quicker.
As a general guideline:
profile!
actually get around to profiling, like for real ;-)
remove reflection (looks like you did this)
split the operations into easy to think about functions
remove laziness (transducers should be the last step in this part)
combine steps using loop/recur to make your code impossible to figure out and slightly faster (this is the last step for a reason)
Specifically thinking about the code you posted:
At a glance, it looks like this function will spend much of it's time generating a lazy list of value in the for loop which are then immediately realized (evaluated to no longer be lazy) so the time spent generating that structure is wasted. You may consider changing this to produce vectors directly, mapv is useful for this.
The second part is the call to map in for-each-row which will produce a lot of intermediate data structures. For that one you may consider using a non-lazy expression like mapv or loop/recur.
It looks like you have done steps 2-4 already, and there is no obvious reason for you to skip to step seven. I'd spend the next couple hours on limiting laziness and if you have to, learning about transducers.

Clojure core.async for data computation

I've started using the clojure core.async library. I found the concepts of CSP, channels, go blocks really easy to use. However, I'm not sure if I'm using them right. I've got the following code -
(def x-ch (chan))
(def y-ch (chan))
(def w1-ch (chan))
(def w2-ch (chan))
; they all return matrices
(go (>! x-ch (Mat/* x (map #(/ 1.0 %) (max-fold x)))))
(go (>! y-ch (Mat/* y (map #(/ 1.0 %) (max-fold y)))))
(go (>! w1-ch (gen-matrix 200 300)))
(go (>! w2-ch (gen-matrix 300 100)))
(let [x1 (<!! (go (<! x-ch)))
y1 (<!! (go (<! y-ch)))
w1 (<!! (go (<! w1-ch)))
w2 (<!! (go (<! w2-ch)))]
;; do stuff w/ x1 y1 w1 w2
)
I've got predefined (matrix) vectors in symbols x and y. I need to modify both vectors before I use them. Those vectors are pretty large. I also need to generate two random matrices. Since go macro starts the computation asyncronously, I split all four computation tasks into separate go blocks and put the consequent result into channels. Then I've got a let block where I take values from the channels and store them into symbols. They are all using blocking <!! take functions since they're on the main thread.
What I'm trying to do basically is speed up my computation time by splitting program fragments into async processes. Is this the right way to do it?
For this kind of processing, future may be slightly more adequate.
The example from the link is simple to grasp:
(def f
(future
(Thread/sleep 10000)
(println "done")
100))
The processing, the future block is started immediately, so the above does start a thread, wait for 10s and prints "done" when finished.
When you need the value you can just use:
(deref f)
; or #f
Which will block and return the value of the code block of the future.
In the same example, if you call deref before the 10 seconds have gone, the call will block until the computation is finished.
In your example, since you are just waiting for computations to finish, and are not so much concern about messages and interactions between the channel participants future is what I would recommend. So:
(future
(Mat/* x (map #(/ 1.0 %) (max-fold x))))
go blocks return a channel with the result of the expression, so you don't need to create intermediate channels for their results. The code below lets you kick off all 4 calculations at the same time, and then block on the values until they return. If you don't need some of the results straight away, you could block on the value only when you actually use it.
(let [x1-ch (go (Mat/* x (map #(/ 1.0 %) (max-fold x))))
y1-ch (go (Mat/* y (map #(/ 1.0 %) (max-fold y))))
w1-ch (go (gen-matrix 200 300))
w2-ch (go (gen-matrix 300 100))
x1 (<!! x1-ch)
y1 (<!! y1-ch)
w1 (<!! w1-ch)
w2 (<!! w2-ch)]
;; do stuff w/ x1 y1 w1 w2
)
If you're looking to speed up your program more generally by running code in parallel, then you could look at using Clojure's Reducers, or Aphyr's Tesser. These work by splitting up the work on a single computation into parallelisable parts, then combining them together. These will efficiently run the work over as many cores as your computer has. If you run each of your computations with a future or in a go block, then each computation will run on a single thread, some may finish before others and those cores will be idle.

Concurrent cartesian product algorithm in Clojure

Is there a good algorithm to calculate the cartesian product of three seqs concurrently in Clojure?
I'm working on a small hobby project in Clojure, mainly as a means to learn the language, and its concurrency features. In my project, I need to calculate the cartesian product of three seqs (and do something with the results).
I found the cartesian-product function in clojure.contrib.combinatorics, which works pretty well. However, the calculation of the cartesian product turns out to be the bottleneck of the program. Therefore, I'd like to perform the calculation concurrently.
Now, for the map function, there's a convenient pmap alternative that magically makes the thing concurrent. Which is cool :). Unfortunately, such a thing doesn't exist for cartesian-product. I've looked at the source code, but I can't find an easy way to make it concurrent myself.
Also, I've tried to implement an algorithm myself using map, but I guess my algorithmic skills aren't what they used to be. I managed to come up with something ugly for two seqs, but three was definitely a bridge too far.
So, does anyone know of an algorithm that's already concurrent, or one that I can parallelize myself?
EDIT
Put another way, what I'm really trying to achieve, is to achieve something similar to this Java code:
for (ClassA a : someExpensiveComputation()) {
for (ClassB b : someOtherExpensiveComputation()) {
for (ClassC c : andAnotherOne()) {
// Do something interesting with a, b and c
}
}
}
If the logic you're using to process the Cartesian product isn't somehow inherently sequential, then maybe you could just split your inputs into halves (perhaps splitting each input seq in two), calculate 8 separate Cartesian products (first-half x first-half x first-half, first-half x first-half x second-half, ...), process them and then combine the results. I'd expect this to give you quite a boost already. As for tweaking the performance of the Cartesian product building itself, I'm no expert, but I do have some ideas & observations (one needs to calculate a cross product for Project Euler sometimes), so I've tried to summarise them below.
First of all, I find the c.c.combinatorics function a bit strange in the performance department. The comments say it's taken from Knuth, I believe, so perhaps one of the following obtains: (1) it would be very performant with vectors, but the cost of vectorising the input sequences kills its performance for other sequence types; (2) this style of programming doesn't necessarily perform well in Clojure in general; (3) the cumulative overhead incurred due to some design choice (like having that local function) is large; (4) I'm missing something really important. So, while I wouldn't like to dismiss the possibility that it might be a great function to use for some use cases (determined by the total number of seqs involved, the number of elements in each seq etc.), in all my (unscientific) measurements a simple for seems to fare better.
Then there are two functions of mine, one of which is comparable to for (somewhat slower in the more interesting tests, I think, though it seems to be actually somewhat faster in others... can't say I feel prepared to make a fully educated comparison), the other apparently faster with a long initial input sequence, as it's a restricted functionality parallel version of the first one. (Details follow below.) So, timings first (do throw in the occasional (System/gc) if you care to repeat them):
;; a couple warm-up runs ellided
user> (time (last (doall (pcross (range 100) (range 100) (range 100)))))
"Elapsed time: 1130.751258 msecs"
(99 99 99)
user> (time (last (doall (cross (range 100) (range 100) (range 100)))))
"Elapsed time: 2428.642741 msecs"
(99 99 99)
user> (require '[clojure.contrib.combinatorics :as comb])
nil
user> (time (last (doall (comb/cartesian-product (range 100) (range 100) (range 100)))))
"Elapsed time: 7423.131008 msecs"
(99 99 99)
;; a second time, as no warm-up was performed earlier...
user> (time (last (doall (comb/cartesian-product (range 100) (range 100) (range 100)))))
"Elapsed time: 6596.631127 msecs"
(99 99 99)
;; umm... is syntax-quote that expensive?
user> (time (last (doall (for [x (range 100)
y (range 100)
z (range 100)]
`(~x ~x ~x)))))
"Elapsed time: 11029.038047 msecs"
(99 99 99)
user> (time (last (doall (for [x (range 100)
y (range 100)
z (range 100)]
(list x y z)))))
"Elapsed time: 2597.533138 msecs"
(99 99 99)
;; one more time...
user> (time (last (doall (for [x (range 100)
y (range 100)
z (range 100)]
(list x y z)))))
"Elapsed time: 2179.69127 msecs"
(99 99 99)
And now the function definitions:
(defn cross [& seqs]
(when seqs
(if-let [s (first seqs)]
(if-let [ss (next seqs)]
(for [x s
ys (apply cross ss)]
(cons x ys))
(map list s)))))
(defn pcross [s1 s2 s3]
(when (and (first s1)
(first s2)
(first s3))
(let [l1 (count s1)
[half1 half2] (split-at (quot l1 2) s1)
s2xs3 (cross s2 s3)
f1 (future (for [x half1 yz s2xs3] (cons x yz)))
f2 (future (for [x half2 yz s2xs3] (cons x yz)))]
(concat #f1 #f2))))
I believe that all versions produce the same results. pcross could be extended to handle more sequences or be more sophisticated in the way it splits its workload, but that's what I came up with as a first approximation... If you do test this out with your programme (perhaps adapting it to your needs, of course), I'd be very curious to know the results.
'clojure.contrib.combinatorics has a cartesian-product function.
It returns a lazy sequence and can cross any number of sequences.

Resources