Clojure core.async for data computation - performance

I've started using the clojure core.async library. I found the concepts of CSP, channels, go blocks really easy to use. However, I'm not sure if I'm using them right. I've got the following code -
(def x-ch (chan))
(def y-ch (chan))
(def w1-ch (chan))
(def w2-ch (chan))
; they all return matrices
(go (>! x-ch (Mat/* x (map #(/ 1.0 %) (max-fold x)))))
(go (>! y-ch (Mat/* y (map #(/ 1.0 %) (max-fold y)))))
(go (>! w1-ch (gen-matrix 200 300)))
(go (>! w2-ch (gen-matrix 300 100)))
(let [x1 (<!! (go (<! x-ch)))
y1 (<!! (go (<! y-ch)))
w1 (<!! (go (<! w1-ch)))
w2 (<!! (go (<! w2-ch)))]
;; do stuff w/ x1 y1 w1 w2
)
I've got predefined (matrix) vectors in symbols x and y. I need to modify both vectors before I use them. Those vectors are pretty large. I also need to generate two random matrices. Since go macro starts the computation asyncronously, I split all four computation tasks into separate go blocks and put the consequent result into channels. Then I've got a let block where I take values from the channels and store them into symbols. They are all using blocking <!! take functions since they're on the main thread.
What I'm trying to do basically is speed up my computation time by splitting program fragments into async processes. Is this the right way to do it?

For this kind of processing, future may be slightly more adequate.
The example from the link is simple to grasp:
(def f
(future
(Thread/sleep 10000)
(println "done")
100))
The processing, the future block is started immediately, so the above does start a thread, wait for 10s and prints "done" when finished.
When you need the value you can just use:
(deref f)
; or #f
Which will block and return the value of the code block of the future.
In the same example, if you call deref before the 10 seconds have gone, the call will block until the computation is finished.
In your example, since you are just waiting for computations to finish, and are not so much concern about messages and interactions between the channel participants future is what I would recommend. So:
(future
(Mat/* x (map #(/ 1.0 %) (max-fold x))))

go blocks return a channel with the result of the expression, so you don't need to create intermediate channels for their results. The code below lets you kick off all 4 calculations at the same time, and then block on the values until they return. If you don't need some of the results straight away, you could block on the value only when you actually use it.
(let [x1-ch (go (Mat/* x (map #(/ 1.0 %) (max-fold x))))
y1-ch (go (Mat/* y (map #(/ 1.0 %) (max-fold y))))
w1-ch (go (gen-matrix 200 300))
w2-ch (go (gen-matrix 300 100))
x1 (<!! x1-ch)
y1 (<!! y1-ch)
w1 (<!! w1-ch)
w2 (<!! w2-ch)]
;; do stuff w/ x1 y1 w1 w2
)

If you're looking to speed up your program more generally by running code in parallel, then you could look at using Clojure's Reducers, or Aphyr's Tesser. These work by splitting up the work on a single computation into parallelisable parts, then combining them together. These will efficiently run the work over as many cores as your computer has. If you run each of your computations with a future or in a go block, then each computation will run on a single thread, some may finish before others and those cores will be idle.

Related

Can I make this Clojure code (scoring a graph bisection) more efficient?

My code is spending most of its time scoring bisections: determining how many edges of a graph cross from one set of nodes to the other.
Assume bisect is a set of half of a graph's nodes (ints), and edges is a list of (directed) edges [ [n1 n2] ...] where n1,n2 are also nodes.
(defn tstBisectScore
"number of edges crossing bisect"
([bisect edges]
(tstBisectScore bisect 0 edges))
([bisect nx edge2check]
(if (empty? edge2check)
nx
(let [[n1 n2] (first edge2check)
inb1 (contains? bisect n1)
inb2 (contains? bisect n2)]
(if (or (and inb1 inb2)
(and (not inb1) (not inb2)))
(recur bisect nx (rest edge2check))
(recur bisect (inc nx) (rest edge2check))))
)))
The only clues I have via sampling the execution of this code (using VisualVM) shows most of the time spent in clojure.core$empty_QMARK_, and most of the rest in clojure.core$contains_QMARK_. (first and rest take only a small fraction of the time.) (See attached .
Any suggestions as to how I could tighten the code?
First I would say that you haven't expanded that profile deep enough. empty? is not an expensive function in general. The reason it is taking up all your time is almost surely because the input to your function is a lazy sequence, and empty? is the poor sap whose job it is to look at its elements first. So all the time in empty? is probably actually time you should be accounting to whatever generates the input sequence. You could confirm this by profiling (tstBisectScore bisect (doall edges)) and comparing to your existing profile of (tstBisectScore bisect edges).
Assuming that my hypothesis is true, almost 80% of your real workload is probably in generating the bisects, not in scoring them. So anything we do in this function can get us at most a 20% speedup, even if we replaced the whole thing with (map (constantly 0) edges).
Still, there are many local improvements to be made. Let's imagine we've determined that producing the input argument is as efficient as we can get it, and we need more speed.
When iterating eagerly over something, use next instead of rest. The point of rest is that it's a bit lazier, and always returns a non-nil sequence instead of peeking to see if there is a next element. If you know you will need the next element anyway, use next to get both bits of information at once.
In general, empty? is not an efficient way to test a sequence. (defn empty? [x] (not (seq x))) is obviously a wasted not. If you care about efficiency, write (seq x) instead, and invert your if branches. Better still, if you know x is the result of a next call, it can never be an empty sequence: only nil, or a non-empty sequence. So just write (if x ...).
(or (and inb1 inb2)
(and (not inb1) (not inb2)))
is a very expensive way to write (= inb1 inb2).
So for starters, you could instead write
(defn tstBisectScore
([bisect edges] (tstBisectScore bisect 0 (seq edges)))
([bisect nx edges]
(if edges
(recur bisect (let [[n1 n2] (first edges)
inb1 (contains? bisect n1)
inb2 (contains? bisect n2)]
(if (= inb1 inb2) nx (inc nx)))
(next edges))
nx)))
Note that I've also rearranged things a bit, by putting the if and let inside of the recur instead of duplicating the other arguments to the recur. This isn't a very popular style, and it doesn't matter to efficiency. Here it serves a pedagogical purpose: to draw your attention to the basic structure of this function that you missed. Your whole function has the structure(if xs (recur (f acc x) (next xs))). This is exactly what reduce already does!
I could write out the translation to use reduce, but first I'll also point out that you also have a map step hidden in there, mapping some elements to 1 and some to 0, and then your reduce phase is just summing the list. So, instead of using lazy sequences to do that, we'll use a transducer, and avoid allocating the intermediate sequences:
(defn tstBisectScore [bisect edges]
(transduce (map (fn [[n1 n2]]
(if (= (contains? bisect n1)
(contains? bisect n2)
0, 1)))
+ 0 edges))
This is a lot less code because you let existing abstractions do the work for you, and it should be more efficient because (a) these abstractions don't make the local mistakes you did, and (b) they also handle chunked sequences more efficiently, which is a sizeable boost that comes up surprisingly often when using basic tools like map, range, and filter.
This answer is based this answer from amalloy and shows some additional ways to speed up this code:
Use Java arrays:
Convert edges with (into-array (map into-array edges)). This allows you to use operations like aget, aset and especially areduce.
Use Java functions
In the following code, I replaced = with .equals and contains? with .contains.
Use type hints
Using these tips, I rewrote your function like this:
(defn tst-bisect-score [^HashSet bisect
^"[[Ljava.lang.Long;" edges]
(areduce edges
i
ret
(long 0)
(+ ret
(let [^"[Ljava.lang.Long;" e (aget edges i)]
(if (.equals ^Boolean
(.contains ^HashSet bisect
^Long (aget e 0))
^Boolean
(.contains ^HashSet bisect
^Long (aget e 1)))
0 1)))))
Convert your arguments in advance with (HashSet. ^Collection bisect) and (into-array (map into-array edges)) and then call:
(tst-bisect-score bisect edges)

Why isn't this function showing a performance speedup when its primary constituent function does?

I am optimizing a program I've been working on, and have hit a wall. The function julia-subrect maps over for-each-pixel a large number of times. I've optimized for-each-pixel to have a ~16x speedup. However, my optimized version of julia-subrect shows no evidence of this. Here are my benchmarks and relevant code:
; ======== Old `for-each-pixel` ========
;(bench (julia/for-each-pixel (->Complex rc ic) max-itrs radius r-min x-step y-step [xt yt])))
;Evaluation count : 3825300 in 60 samples of 63755 calls.
;Execution time mean : 16.018466 µs
; ======== New `for-each-pixel`. optimized 16x. ========
;(bench (julia/for-each-pixel-opt [rc ic] [max-itrs radius r-min] [x-step y-step] [xt yt])))
;Evaluation count : 59542860 in 60 samples of 992381 calls.
;Execution time mean : 1.038955 µs
(defn julia-subrect [^Long start-x ^Long start-y ^Long end-x ^Long end-y ^Long total-width ^Long total-height ^Complex constant ^Long max-itrs]
(let [grid (for [y (range start-y end-y)]
(vec (for [x (range start-x end-x)]
[x y])))
radius (calculate-r constant)
r-min (- radius)
r-max radius
x-step (/ (Math/abs (- r-max r-min)) total-width)
y-step (/ (Math/abs (- r-max r-min)) total-height)
; Uses old implementation of `for-each-pixel`
calculate-pixel (partial for-each-pixel constant max-itrs radius r-min x-step y-step)
for-each-row (fn [r] (map calculate-pixel r))]
(map for-each-row grid)))
; ======== Old `julia-subrect` ========
;(bench (doall (julia/julia-subrect start-x start-y end-x end-y total-width total-height c max-itrs))))
;Evaluation count : 22080 in 60 samples of 368 calls.
;Execution time mean : 2.746852 ms
(defn julia-subrect-opt [[^long start-x ^long start-y ^long end-x ^long end-y] [^double rc ^double ic] total-width total-height max-itrs ]
(let [grid (for [y (range start-y end-y)]
(vec (for [x (range start-x end-x)]
[x y])))
radius (calculate-r-opt rc ic)
r-min (- radius)
r-max radius
x-step (/ (Math/abs (- r-max r-min)) total-width)
y-step (/ (Math/abs (- r-max r-min)) total-height)
;Uses new implementation of `for-each-pixel`
calculate-pixel (fn [px] (for-each-pixel-opt [rc ic] [max-itrs radius r-min] [x-step y-step] px))
for-each-row (fn [r] (map calculate-pixel r))]
(map for-each-row grid)))
; ======== New `julia-subrect`, but no speedup ========
;(bench (doall (julia/julia-subrect-opt [start-x start-y end-x end-y] [rc ic] total-width total-height max-itrs))))
;Evaluation count : 21720 in 60 samples of 362 calls.
;Execution time mean : 2.831553 ms
Here is a gist containing source code for all the functions I've specified:
https://gist.github.com/johnmarinelli/adc5533c19fb0b6d74cf4ef04ae55ee6
So, can anyone tell me why julia-subrect is showing no signs of speedup? Also, I'm still new to clojure so bear with me if the code is unidiomatic/ugly. Right now, I'm focusing on making the program run quicker.
As a general guideline:
profile!
actually get around to profiling, like for real ;-)
remove reflection (looks like you did this)
split the operations into easy to think about functions
remove laziness (transducers should be the last step in this part)
combine steps using loop/recur to make your code impossible to figure out and slightly faster (this is the last step for a reason)
Specifically thinking about the code you posted:
At a glance, it looks like this function will spend much of it's time generating a lazy list of value in the for loop which are then immediately realized (evaluated to no longer be lazy) so the time spent generating that structure is wasted. You may consider changing this to produce vectors directly, mapv is useful for this.
The second part is the call to map in for-each-row which will produce a lot of intermediate data structures. For that one you may consider using a non-lazy expression like mapv or loop/recur.
It looks like you have done steps 2-4 already, and there is no obvious reason for you to skip to step seven. I'd spend the next couple hours on limiting laziness and if you have to, learning about transducers.

Random Walk in Clojure

I have written the following piece of code for a random walk, which draws random values from {-1,1}.
(defn notahappyfoo [n]
(reverse (butlast (butlast (reverse (interleave (take n (iterate rand (- 0 1)))(take n (iterate rand 1))))))))
However, the code fails to generate a satisfactory walk. The main problem stems from the function rand. It's lower bound is 0, which forced the awkward code I wrote. Namely, the function interleave ends up causing wild shifts in the walk as values are forced to swing from positive to negative. It will be hard to garner any sense of a continuous path with this code.
I believe there should be an elegant form in Clojure to construct this walk. But I am not able to piece the right functions together to generate such a walk. The goals of the function I am looking to construct consist of lower and upper bounds for the random number. In the code above I have forced the interval -1 to 1. It would be nice to generalize this to -a and a. Moreover, how do I form a collection of random reals (floating points) between -a and a that has some notion of continuity?
You need a random function that takes a range
(defn myrand [a b]
(+ a (rand (- b a))))
You can then create a sequence
(def s (repeatedly #(myrand -1 1)))
finally you can use reductions to get a sample walk
(take 10 s)
(reductions + (take 10 s))

Why do function calls slow things down in clojure?

I've been playing around with the Is Clojure is Still Fast? (and prequel Clojure is Fast) code. It seemed unfortunate that inlining the differential equation (f) is one of the steps taken to improving performance. The cleanest/fastest thing I've been able to come up without doing this is the following:
; As in the referenced posts, for giving a rough measure of cycles/iteration (I know this is a very rough
; estimate...)
(def cpuspeed 3.6) ;; My computer runs at 3.6 GHz
(defmacro cyclesperit [expr its]
`(let [start# (. System (nanoTime))
ret# ( ~#expr (/ 1.0 ~its) ~its )
finish# (. System (nanoTime))]
(println (int (/ (* cpuspeed (- finish# start#)) ~its)))))
;; My solution
(defn f [^double t ^double y] (- t y))
(defn mysolveit [^double t0 ^double y0 ^double h ^long its]
(if (> its 0)
(let [t1 (+ t0 h)
y1 (+ y0 (* h (f t0 y0)))]
(recur t1 y1 h (dec its)))
[t0 y0 h its]))
; => 50-55 cycles/it
; The fastest solution presented by the author (John Aspden) is
(defn faster-solveit [^double t0 ^double y0 ^double h ^long its]
(if (> its 0)
(let [t1 (+ t0 h)
y1 (+ y0 (* h (- t0 y0)))]
(recur t1 y1 h (dec its)))
[t0 y0 h its]))
; => 25-30 cycles/it
The type hinting in my solution helps quite a bit (it's 224 cycles/it without type hinting on either f or solveit), but it's still nearly 2x slower than the inlined version. Ultimately this performance is still pretty decent, but this hit is unfortunate.
Why is there such a performance hit for this? Is there a way around it? Are there plans to find ways of improvingthis? As pointed out by John in the original post, it seems funny/unfortunate for function calls to be inefficient in a functional language.
Note: I'm running Clojure 1.5 and have :jvm-opts ^:replace [] in a project.clj file so that I can use lein exec/run without it slowing things down (and it will if you don't do this I discovered...)
Benchmarking in the presence of a JIT compiler is tricky; you really must allow for a warm-up period, but then you also can't just run it all in a loop, since it may then be proved a no-op and optimized away. In Clojure, the usual solution is to use Hugo Duncan's Criterium.
Running a Criterium benchmark for (solveit 0.0 1.0 (/ 1.0 1000000) 1000000) for both versions of solveit results in pretty much exactly the same timings on my machine (mysolveit ~3.44 ms, faster-solveit ~3.45 ms). That's in a 64-bit JVM run with -XX:+UseConcMarkSweepGC, using Criterium 0.4.2 (criterium.core/bench). Presumably HotSpot just inlines f. In any case, there's no performance hit at all.
Adding to the already good answers, the JVM JIT most often does inline the primitive function calls when warmed up, and in this case, when you bench it with a warmed JIT you see the same results. Just wanted to say Clojure also has an inlining feature though for cases where that yields benefits.
(defn f
{:inline-arities #{2}
:inline (fn [t y] `(- (double ~t) (double ~y)))}
^double [^double t ^double y]
(- t y))
Now Clojure will compile away the calls to f, inlining the function at compile time. Whereas the JIT will inline the function at runtime as needed otherwise.
Also note that I added a ^double type hint to the return of f, if you don't do that, it gets compiled to return Object, and a cast needs to be added, I'm not sure if that really affects performance much, but if you want a fully primitive function that takes primitives and return primitives you need to type hint the return as well.

Concurrent cartesian product algorithm in Clojure

Is there a good algorithm to calculate the cartesian product of three seqs concurrently in Clojure?
I'm working on a small hobby project in Clojure, mainly as a means to learn the language, and its concurrency features. In my project, I need to calculate the cartesian product of three seqs (and do something with the results).
I found the cartesian-product function in clojure.contrib.combinatorics, which works pretty well. However, the calculation of the cartesian product turns out to be the bottleneck of the program. Therefore, I'd like to perform the calculation concurrently.
Now, for the map function, there's a convenient pmap alternative that magically makes the thing concurrent. Which is cool :). Unfortunately, such a thing doesn't exist for cartesian-product. I've looked at the source code, but I can't find an easy way to make it concurrent myself.
Also, I've tried to implement an algorithm myself using map, but I guess my algorithmic skills aren't what they used to be. I managed to come up with something ugly for two seqs, but three was definitely a bridge too far.
So, does anyone know of an algorithm that's already concurrent, or one that I can parallelize myself?
EDIT
Put another way, what I'm really trying to achieve, is to achieve something similar to this Java code:
for (ClassA a : someExpensiveComputation()) {
for (ClassB b : someOtherExpensiveComputation()) {
for (ClassC c : andAnotherOne()) {
// Do something interesting with a, b and c
}
}
}
If the logic you're using to process the Cartesian product isn't somehow inherently sequential, then maybe you could just split your inputs into halves (perhaps splitting each input seq in two), calculate 8 separate Cartesian products (first-half x first-half x first-half, first-half x first-half x second-half, ...), process them and then combine the results. I'd expect this to give you quite a boost already. As for tweaking the performance of the Cartesian product building itself, I'm no expert, but I do have some ideas & observations (one needs to calculate a cross product for Project Euler sometimes), so I've tried to summarise them below.
First of all, I find the c.c.combinatorics function a bit strange in the performance department. The comments say it's taken from Knuth, I believe, so perhaps one of the following obtains: (1) it would be very performant with vectors, but the cost of vectorising the input sequences kills its performance for other sequence types; (2) this style of programming doesn't necessarily perform well in Clojure in general; (3) the cumulative overhead incurred due to some design choice (like having that local function) is large; (4) I'm missing something really important. So, while I wouldn't like to dismiss the possibility that it might be a great function to use for some use cases (determined by the total number of seqs involved, the number of elements in each seq etc.), in all my (unscientific) measurements a simple for seems to fare better.
Then there are two functions of mine, one of which is comparable to for (somewhat slower in the more interesting tests, I think, though it seems to be actually somewhat faster in others... can't say I feel prepared to make a fully educated comparison), the other apparently faster with a long initial input sequence, as it's a restricted functionality parallel version of the first one. (Details follow below.) So, timings first (do throw in the occasional (System/gc) if you care to repeat them):
;; a couple warm-up runs ellided
user> (time (last (doall (pcross (range 100) (range 100) (range 100)))))
"Elapsed time: 1130.751258 msecs"
(99 99 99)
user> (time (last (doall (cross (range 100) (range 100) (range 100)))))
"Elapsed time: 2428.642741 msecs"
(99 99 99)
user> (require '[clojure.contrib.combinatorics :as comb])
nil
user> (time (last (doall (comb/cartesian-product (range 100) (range 100) (range 100)))))
"Elapsed time: 7423.131008 msecs"
(99 99 99)
;; a second time, as no warm-up was performed earlier...
user> (time (last (doall (comb/cartesian-product (range 100) (range 100) (range 100)))))
"Elapsed time: 6596.631127 msecs"
(99 99 99)
;; umm... is syntax-quote that expensive?
user> (time (last (doall (for [x (range 100)
y (range 100)
z (range 100)]
`(~x ~x ~x)))))
"Elapsed time: 11029.038047 msecs"
(99 99 99)
user> (time (last (doall (for [x (range 100)
y (range 100)
z (range 100)]
(list x y z)))))
"Elapsed time: 2597.533138 msecs"
(99 99 99)
;; one more time...
user> (time (last (doall (for [x (range 100)
y (range 100)
z (range 100)]
(list x y z)))))
"Elapsed time: 2179.69127 msecs"
(99 99 99)
And now the function definitions:
(defn cross [& seqs]
(when seqs
(if-let [s (first seqs)]
(if-let [ss (next seqs)]
(for [x s
ys (apply cross ss)]
(cons x ys))
(map list s)))))
(defn pcross [s1 s2 s3]
(when (and (first s1)
(first s2)
(first s3))
(let [l1 (count s1)
[half1 half2] (split-at (quot l1 2) s1)
s2xs3 (cross s2 s3)
f1 (future (for [x half1 yz s2xs3] (cons x yz)))
f2 (future (for [x half2 yz s2xs3] (cons x yz)))]
(concat #f1 #f2))))
I believe that all versions produce the same results. pcross could be extended to handle more sequences or be more sophisticated in the way it splits its workload, but that's what I came up with as a first approximation... If you do test this out with your programme (perhaps adapting it to your needs, of course), I'd be very curious to know the results.
'clojure.contrib.combinatorics has a cartesian-product function.
It returns a lazy sequence and can cross any number of sequences.

Resources