Clojure Performance, How to Type hint to r/map - performance

Below, I have 2 functions computing the sum of squares of their arguments. The first one is nice and functional, but 20x slower than the second one. I presume that the r/map is not taking advantage of aget to retrieve elements from the double-array, whereas I'm explicitly doing this in function 2.
Is there any way I can further typehint or help r/map r/fold to perform faster?
(defn sum-of-squares
"Given a vector v, compute the sum of the squares of elements."
^double [^doubles v]
(r/fold + (r/map #(* % %) v)))
(defn sum-of-squares2
"This is much faster than above. Post to stack-overflow to see."
^double [^doubles v]
(loop [val 0.0
i (dec (alength v))]
(if (neg? i)
val
(let [x (aget v i)]
(recur (+ val (* x x)) (dec i))))))
(def a (double-array (range 10)))
(quick-bench (sum-of-squares a))
800 ns
(quick-bench (sum-of-squares2 a))
40 ns

Before experiments I've added next line in project.clj:
:jvm-opts ^:replace [] ; Makes measurements more accurate
Basic measurements:
(def a (double-array (range 1000000))) ; 10 is too small for performance measurements
(quick-bench (sum-of-squares a)) ; ... Execution time mean : 27.617748 ms ...
(quick-bench (sum-of-squares2 a)) ; ... Execution time mean : 1.259175 ms ...
This is more or less consistent with time difference in the question. Let's try to not use Java arrays (which are not really idiomatic for Clojure):
(def b (mapv (partial * 1.0) (range 1000000))) ; Persistent vector
(quick-bench (sum-of-squares b)) ; ... Execution time mean : 14.808644 ms ...
Almost 2 times faster. Now let's remove type hints:
(defn sum-of-squares3
"Given a vector v, compute the sum of the squares of elements."
[v]
(r/fold + (r/map #(* % %) v)))
(quick-bench (sum-of-squares3 a)) ; Execution time mean : 30.392206 ms
(quick-bench (sum-of-squares3 b)) ; Execution time mean : 15.583379 ms
Execution time increased only marginally comparing to version with type hints. By the way, version with transducers has very similar performance and is much cleaner:
(defn sum-of-squares3 [v]
(transduce (map #(* % %)) + v))
Now about additional type hinting. We can indeed optimize first sum-of-squares implementation:
(defn square ^double [^double x] (* x x))
(defn sum-of-squares4
"Given a vector v, compute the sum of the squares of elements."
[v]
(r/fold + (r/map square v)))
(quick-bench (sum-of-squares4 b)) ; ... Execution time mean : 12.891831 ms ...
(defn pl
(^double [] 0.0)
(^double [^double x] (+ x))
(^double [^double x ^double y] (+ x y)))
(defn sum-of-squares5
"Given a vector v, compute the sum of the squares of elements."
[v]
(r/fold pl (r/map square v)))
(quick-bench (sum-of-squares5 b)) ; ... Execution time mean : 9.441748 ms ...
Note #1: type hints on arguments and return value of sum-of-squares4 and sum-of-squares5 have no additional performance benefits.
Note #2: It's generally bad practice to start with optimizations. Straight-forward version (apply + (map square v)) will have good enough performance for most situations. sum-of-squares2 is very far from idiomatic and uses literally no Clojure concepts. If this is really performance critical code - better to implement it in Java and use interop. Code will be much cleaner despite of having 2 languages. Or even implement it in unmanaged code (C, C++) and use JNI (not really maintainable but if properly implemented, can give the best possible performance).

Why not use areduce:
(def sum-of-squares3 ^double [^doubles v]
(areduce v idx ret 0.0
(let [item (aget v idx)]
(+ ret (* item item)))))
On my machine running:
(criterium/bench (sum-of-squares3 (double-array (range 100000))))
Gives a mean execution time of 1.809103 ms, your sum-of-squares2 executes the same calculation in 1.455775 ms. I think this version using areduce is more idiomatic than your version.
For squeezing a little bit more performance you can try using unchecked math (add-unchecked and multiply-unchecked). But beware, you need to be sure that your calculation cannot overflow:
(defn sum-of-squares4 ^double [^doubles v]
(areduce v idx ret 0.0
(let [item (aget v idx)]
(unchecked-add ret (unchecked-multiply item item)))))
Running the same benchmark gives a mean execution time of 1.144197 ms. Your sum-of-squares2 can also benefit from unchecked math with a 1.126001 ms mean execution time.

Related

Performance of vector and array in Clojure

I am trying to solve the Maximum subarray problem on hacker rank. This is a standard DP problem and I write an O(n) solution:
(defn dp
[v]
(let [n (count v)]
(loop [i 1 f (v 0) best f]
(if (< i n)
(let [fi (max (v i) (+ f (v i)))]
(recur (inc i) fi (max fi best)))
best))))
(defn positive-only
[v]
(reduce + (filterv #(> % 0) v)))
(defn line->ints
[line]
(->>
(clojure.string/split line #" ")
(map #(Integer. %))
(into [])
))
(let [T (Integer. (read-line))]
(loop [test 0]
(when (< test T)
(let [_ (read-line)
x (read-line)
v (line->ints x)
a (dp-array v)
b (let [p (positive-only v)]
(if (= p 0) (reduce max v) p))]
(printf "%d %d\n" a b))
(recur (inc test)))))
To my surprise, I got time-limited-exceed for a large test case. I downloaded the input file, and found that the above version needs about 3 seconds to run.
I thought the bottleneck is in (v i) (getting the i-th element in vector v). So I changed the data structure from vector to an array:
(defn dp-array
[v0]
(let [v (into-array v0)
n (int (alength v))]
(loop [i 1
f (aget v 0)
best f]
(if (< i n)
(let [fi (max (aget v i) (+ f (aget v i)))]
(recur (inc i) fi (max fi best)))
best))))
This array version is even slower. On the same input, it costs 33 seconds, much slower than the vector version. I think the slowness is due to boxing and unboxing. I tried to add type hints, but encountered run-time errors. Could anyone help me improve dp-array function? Thanks!
Also, great appreciate if anyone knows how to improve the vector version.
UPDATE:
Finally I managed to get my clojure program accepted, not by optimizing over the dynamic programming function, but by changing (Integer. str) to (Integer/parseInt str). In this way, reflection is avoided in converting from string to integer.
I also replace into-array by int-array. But the speed of both versions are still on par with each other. I would expect the array version be faster than the vector version.
The Clojure compiler can't infer the type of v in the array version of the dp-array function whose argument v0 has unknown type. This causes costs to reflections when evaluating the following alength and aget. In order to avoid these unnecessary reflections, you have to replace into-array with long-array.

How to go about composing core functions, rather then using imperative style?

I have translated this code, the snippet below, from Python to Clojure. I replaced Python's while construct with Clojure's loop-recur here. But this doesn't look idiomatic.
(loop [d 2 [n & more] (list 256)]
(if (> n 1)
(recur (inc d)
(loop [x n sublist more]
(if (= (rem x d) 0)
(recur (/ x d) (conj sublist d))
(conj sublist x))))
(sort more)))
This routine gives me (3 3 31), that is prime factors of 279. For 256, it gives, (2 2 2 2 2 2 2 2), that means, 2^8.
Moreover, it performs worse for large values, say 987654123987546 instead of 279; whereas Python's counterpart works like charm.
How to start composing core functions, rather then translating imperative code as is? And specifically, how to improve this bit?
Thanks.
[Edited]
Here is the python code, I referred above,
def prime_factors(n):
factors = []
d = 2
while n > 1:
while n % d == 0:
factors.append(d)
n /= d
d = d + 1
return factors
A straight translation of the Python code in Clojure would be:
(defn prime-factors [n]
(let [n (atom n) ;; The Python code makes use of mutability which
factors (atom []) ;; isn't idiomatic in Clojure, but can be emulated
d (atom 2)] ;; using atoms
(loop []
(when (< 1 #n)
(loop []
(when (== (rem #n #d) 0)
(swap! factors conj #d)
(swap! n quot #d)
(recur)))
(swap! d inc)
(recur)))
#factors))
(prime-factors 279) ;; => [3 3 31]
(prime-factors 987654123987546) ;; => [2 3 41 14389 279022459]
(time (prime-factors 987654123987546)) ;; "Elapsed time: 13993.984 msecs"
;; same performance on my machine
;; as the Rosetta Code solution
You can improve this code to make it more idiomatic:
from nested loops to a single loop:
(loop []
(cond
(<= #n 1) #factors
(not= (rem #n #d) 0) (do (swap! d inc)
(recur))
:else (do (swap! factors conj #d)
(swap! n quot #d)
(recur))))))
get rid of the atoms:
(defn prime-factors [n]
(loop [n n
factors []
d 2]
(cond
(<= n 1) factors
(not= (rem n d) 0) (recur n factors (inc d))
:else (recur (quot n d) (conj factors d) d))))
replace == 0 by zero?:
(not (zero? (rem n d))) (recur n factors (inc d))
You can also overhaul it completely to make a lazy version of it:
(defn prime-factors [n]
((fn step [n d]
(lazy-seq
(when (< 1 n)
(cond
(zero? (rem n d)) (cons d (step (quot n d) d))
:else (recur n (inc d)))))
n 2))
I planned to have a section on optimization here, but I'm no specialist. The only thing I can say is that you can trivially make this code faster by interrupting the loop when d is greater than the square root of n:
(defn prime-factors [n]
(if (< 1 n)
(loop [n n
factors []
d 2]
(let [q (quot n d)]
(cond
(< q d) (conj factors n)
(zero? (rem n d)) (recur q (conj factors d) d)
:else (recur n factors (inc d)))))
[]))
(time (prime-factors 987654123987546)) ;; "Elapsed time: 7.124 msecs"
Not every loop unrolls cleanly into an elegant "functional" decomposition.
The Rosetta Code solution suggested by #edbond is pretty simple and concise; I would say it's idiomatic since no obvious "functional" solution is apparent. That solution runs noticeably faster on my machine than your Python version for 987654123987546.
More generally, if you're looking to expand your understanding of functional idioms, Bedra and Halloway's "Programming Clojure" (pp.90-95) presents an excellent comparison of different versions of the Fibonacci sequence, using loop, lazy seqs, and an elegant "functional" version. Chouser and Fogus's "Joy of Clojure" (MEAP version) also has a nice section on function composition.

map part of the vector efficiently in clojure

I wonder how this can be done in Clojure idiomatically and efficiently:
1) Given a vector containing n integers in it: [A0 A1 A2 A3 ... An]
2) Increase the last x items by 1 (let's say x is 100) so the vector will become: [A0 A1 A2 A3 ... (An-99 + 1) (An-98 + 1)... (An-1 + 1) (An + 1)]
One naive implementation looks like:
(defn inc-last [x nums]
(let [n (count nums)]
(map #(if (>= % (- n x)) (inc %2) %2)
(range n)
nums)))
(inc-last 2 [1 2 3 4])
;=> [1 2 4 5]
In this implementation, basically you just map the entire vector to another vector by examine each item to see if it needs to be increased.
However, this is an O(n) operation while I only want to change the last x items in the vector. Ideally, this should be done in O(x) instead of O(n).
I am considering using some functions like split-at/concat to implement it like below:
(defn inc-last [x nums]
(let [[nums1 nums2] (split-at x nums)]
(concat nums1 (map inc nums2))))
However, I am not sure if this implementation is O(n) or O(x). I am new to Clojure and not really sure what the time complexity will be for operations like concat/split-at on persistent data structures in Clojure.
So my questions are:
1) What the time complexity here in second implementation?
2) If it is still O(n), is there any idiomatic and efficient implementation that takes only O(x) in Clojure for solving this problem?
Any comment is appreciated. Thanks.
Update:
noisesmith's answer told me that split-at will convert the vector into a list, which was a fact I did not realised previously. Since I will do random access for the result (call nth after processing the vector), I would like to have an efficient solution (O(x) time) while keeping the vector instead of list otherwise nth will slow down my program as well.
Concat and split-at both turn the input into a seq, effectively a linked-list representation, O(x) time. Here is how to do it with a vector for O(n) performance.
user> (defn inc-last-n
[n x]
(let [count (count x)
update (fn [x i] (update-in x [i] inc))]
(reduce update x (range (- count n) count))))
#'user/inc-last-n
user> (inc-last-n 3 [0 1 2 3 4 5 6])
[0 1 2 3 5 6 7]
This will fail on input that is not associative (like seq / lazy-seq) because there is no O(1) access time in non-associative types.
inc-last is an implementation using a transient, which allows to get a modifiable "in place" vector in constant time and return a persistent! vector also in constant time, which allows to make the updates in O(x). The original implementation used an imperative doseq loop but, as mentioned in the comments, transient operations can return a new object, so it's better to keep doing things in a functional way.
I added a doall to the call to inc-last-2 since it returns a lazy seq, but inc-last and inc-last-3 returns a vector so the doall is needed to be able to compare them all.
According to some quick tests I made, inc-last and inc-last-3 don't actually differ much in performance, not even for huge vectors (10000000 elements). For the inc-last-2 implementation though, there's quite a difference even for a vector of 1000 elements, modifying only the last 10, it's ~100x slower. For smaller vectors or when the n is close to (count nums) the difference is not really that much.
(Thanks to Michał Marczyk for his useful comments)
(def x (vec (range 1000)))
(defn inc-last [n x]
(let [x (transient x)
l (count x)]
(->>
(range (- l n) l)
(reduce #(assoc! %1 %2 (inc (%1 %2))) x)
persistent!)))
(defn inc-last-2 [x nums]
(let [n (count nums)]
(map #(if (>= % (- n x)) (inc %2) %2)
(range n)
nums)))
(defn inc-last-3 [n x]
(let [l (count x)]
(reduce #(assoc %1 %2 (inc (%1 %2))) x (range (- l n) l))))
(time
(dotimes [i 100]
(inc-last 50 x)))
(time
(dotimes [i 100]
(doall (inc-last-2 10 x))))
(time
(dotimes [i 100]
(inc-last-3 50 x)))
;=> "Elapsed time: 49.7965 msecs"
;=> "Elapsed time: 1751.964501 msecs"
;=> "Elapsed time: 67.651 msecs"

Performance of function in Clojure 1.3

I was wondering if someone could help me with the performance of this code snippet in Clojure 1.3. I am trying to implement a simple function that takes two vectors and does a sum of products.
So let's say the vectors are X (size 10,000 elements) and B (size 3 elements), and the sum of products are stored in a vector Y, mathematically it looks like this:
Y0 = B0*X2 + B1*X1 + B2*X0
Y1 = B0*X3 + B1*X2 + B2*X1
Y2 = B0*X4 + B1*X3 + B2*X2
and so on ...
For this example, the size of Y will end up being 9997, which corresponds to (10,000 - 3). I've set up the function to accept any size of X and B.
Here's the code: It basically takes (count b) elements at a time from X, reverses it, maps * onto B and sums the contents of the resulting sequence to produce an element of Y.
(defn filt [b-vec x-vec]
(loop [n 0 sig x-vec result []]
(if (= n (- (count x-vec) (count b-vec)))
result
(recur (inc n) (rest sig) (conj result (->> sig
(take (count b-vec))
(reverse)
(map * b-vec)
(apply +)))))))
Upon letting X be (vec (range 1 10001)) and B being [1 2 3], this function takes approximately 6 seconds to run. I was hoping someone could suggest improvements to the run time, whether it be algorithmic, or perhaps a language detail I might be abusing.
Thanks!
P.S. I have done (set! *warn-on-reflection* true) but don't get any reflection warning messages.
You are using count many times unnecessary. Below code calculate count one time only
(defn filt [b-vec x-vec]
(let [bc (count b-vec) xc (count x-vec)]
(loop [n 0 sig x-vec result []]
(if (= n (- xc bc))
result
(recur (inc n) (rest sig) (conj result (->> sig
(take bc)
(reverse)
(map * b-vec)
(apply +))))))))
(time (def b (filt [1 2 3] (range 10000))))
=> "Elapsed time: 50.892536 msecs"
If you really want top performance for this kind of calculation, you should use arrays rather than vectors. Arrays have a number of performance advantages:
They support O(1) indexed lookup and writes - marginally better than vectors which are O(log32 n)
They are mutable, so you don't need to construct new arrays all the time - you can just create a single array to serve as the output buffer
They are represented as Java arrays under the hood, so benefit from the various array optimisations built into the JVM
You can use primitive arrays (e.g. of Java doubles) which are much faster than if you use boxed number objects
Code would be something like:
(defn filt [^doubles b-arr
^doubles x-arr]
(let [bc (count b-arr)
xc (count x-arr)
rc (inc (- xc bc))
result ^doubles (double-array rc)]
(dotimes [i rc]
(dotimes [j bc]
(aset result i (+ (aget result i) (* (aget x-arr (+ i j)) (aget b-arr j))))))
result))
To follow on to Ankur's excellent answer, you can also avoid repeated calls to the reverse function, which gets us even a little more performance.
(defn filt [b-vec x-vec]
(let [bc (count b-vec) xc (count x-vec) bb-vec (reverse b-vec)]
(loop [n 0 sig x-vec result []]
(if (= n (- xc bc))
result
(recur (inc n) (rest sig) (conj result (->> sig
(take bc)
(map * bb-vec)
(apply +))))))))

two methods of composing functions, how different in efficiency?

Let f transform one value to another, then I'm writing a function that repeats the transformation n times.
I have come up with two different ways:
One is the obvious way that
literally applies the function n
times, so repeat(f, 4) means x →
f(f(f(f(x))))
The other way is inspired from the
fast method for powering, which means
dividing the problem into two
problems that are half as large
whenever n is even. So repeat(f, 4)
means x → g(g(x)) where g(x) =
f(f(x))
At first I thought the second method wouldn't improve efficiency that much. At the end of the day, we would still need to apply f n times, wouldn't we? In the above example, g would still be translated into f o f without any further simplification, right?
However, when I tried out the methods, the latter method was noticeable faster.
;; computes the composite of two functions
(define (compose f g)
(lambda (x) (f (g x))))
;; identify function
(define (id x) x)
;; repeats the application of a function, naive way
(define (repeat1 f n)
(define (iter k acc)
(if (= k 0)
acc
(iter (- k 1) (compose f acc))))
(iter n id))
;; repeats the application of a function, divide n conquer way
(define (repeat2 f n)
(define (iter f k acc)
(cond ((= k 0) acc)
((even? k) (iter (compose f f) (/ k 2) acc))
(else (iter f (- k 1) (compose f acc)))))
(iter f n id))
;; increment function used for testing
(define (inc x) (+ x 1))
In fact, ((repeat2 inc 1000000) 0) was much faster than ((repeat1 inc 1000000) 0). My question is in what aspect was the second method more efficient than the first? Did re-using the same function object preserves storage and reduces the time spent for creating new objects?
After all, the application has to be repeated n times, or saying it another way, x→((x+1)+1) cannot be automatically reduced to x→(x+2), right?
I'm running on DrScheme 4.2.1.
Thank you very much.
You're right that both versions do the same number of calls to inc -- but there's more
overhead than that in your code. Specifically, the first version creates N closures, whereas
the second one creates only log(N) closures -- and if the closure creation is most of the work
then you'll see a big difference in performance.
There are three things that you can use to see this in more details:
Use DrScheme's time special form to measure the speed. In addition to the time that it
took to perform some computation, it will also tell you how much time was spent in GC.
You will see that the first version is doing some GC work, while the second doesn't.
(Well, it does, but it's so little, that it will probably not show.)
Your inc function is doing so little, that you're measuring only the looping overhead.
For example, when I use this bad version:
(define (slow-inc x)
(define (plus1 x)
(/ (if (< (random 10) 5)
(* (+ x 1) 2)
(+ (* x 2) 2))
2))
(- (plus1 (plus1 (plus1 x))) 2))
the difference between the two uses drops from a factor of ~11 to 1.6.
Finally, try this version out:
(define (repeat3 f n)
(lambda (x)
(define (iter n x)
(if (zero? n) x (iter (sub1 n) (f x))))
(iter n x)))
It doesn't do any compositions, and it works in roughly
the same speed as your second version.
The first method essentially applies the function n times, thus it is O(n). But the second method is not actually applying the function n times. Every time repeat2 is called it splits n by 2 whenever n is even. Thus much of the time the size of the problem is halved rather than merely decreasing by 1. This gives an overall runtime of O(log(n)).
As Martinho Fernandez suggested, the wikipedia article on exponentiation by squaring explains it very clearly.

Resources