I am performing element-wise operations on two vectors on the order of 50,000 elements in size, and having unsatisfactory performance issues (a few seconds). Are there any obvious performance issues to be made, such as using a different data structure?
(defn boolean-compare
"Sum up 1s if matching 0 otherwise"
[proposal-img data-img]
(sum
(map
#(Math/abs (- (first %) (second %)))
(partition 2 (interleave proposal-img data-img)))))
Try this:
(apply + (map bit-xor proposal-img data-img)))
Some notes:
mapping a function to several collections uses an element from each as the arguments to the function - no need to interleave and partition for this.
If your data is 1's and 0's, then xor will be faster than absolute difference
Timed example:
(def data-img (repeatedly 50000 #(rand-int 2)))
(def proposal-img (repeatedly 50000 #(rand-int 2)))
(def sum (partial apply +))
After warming up the JVM...
(time (boolean-compare proposal-img data-img))
;=> "Elapsed time: 528.731093 msecs"
;=> 24802
(time (apply + (map bit-xor proposal-img data-img)))
;=> "Elapsed time: 22.481255 msecs"
;=> 24802
You should look at adopting core.matrix if you are interested in good performance for large vector operations.
In particular, the vectorz-clj library (a core.matrix implementation) has some very fast implementations for most common vector operations with double values.
(def v1 (array (repeatedly 50000 #(rand-int 2))))
(def v2 (array (repeatedly 50000 #(rand-int 2))))
(time (let [d (sub v2 v1)] ;; take difference of two vectors
(.abs d) ;; calculate absolute value (mutate d)
(esum d))) ;; sum elements and return result
=> "Elapsed time: 0.949985 msecs"
=> 24980.0
i.e. under 20ns per pair of elements - that's pretty quick: you'd be hard pressed to beat that without resorting to low-level array-fiddling code.
Related
I'm new to Clojure. I have the following code, which creates an infinite lazy sequence of numbers:
(defn generator [seed factor]
(drop 1 (reductions
(fn [acc _] (mod (* acc factor) 2147483647))
seed
; using dummy infinite seq to keep the reductions going
(repeat 1))))
Each number in the sequence is dependent on the previous calculation. I'm using reductions because I need all the intermediate results.
I then instantiate two generators like so:
(def gen-a (generator 59 16807))
(def gen-b (generator 393 48271))
I then want to compare n consecutive results of these sequences, for large n, and return the number of times they are equal.
At first I did something like:
(defn run []
(->> (interleave gen-a gen-b)
(partition 2)
(take 40000000)
(filter #(apply = %))
(count)))
It was taking far too long and I saw the program's memory usage spike to about 4GB. With some printlns I saw that after about 10 million iterations it got really slow, so I was thinking that maybe count needed to store the entire sequence in memory, so I changed it to use reduce:
(defn run-2 []
(reduce
(fn [acc [a b]]
(if (= a b)
(inc acc)
acc))
0
(take 40000000 (partition 2 (interleave gen-a gen-b)))))
Still, it was allocating a lot of memory and slowing down significantly after the first couple of millions. I'm pretty sure that it's storing the entire lazy sequence in memory but I'm not sure why, so I tried to manually throw away the head:
(defn run-3 []
(loop [xs (take 40000000 (partition 2 (interleave gen-a gen-b)))
total 0]
(cond
(empty? xs) total
(apply = (first xs)) (recur (rest xs) (inc total))
:else (recur (rest xs) total))))
Again, same results. This stumped me because I'm reading that all of the functions I'm using to create my xs sequence are lazy, and since I'm only using the current item I'm expecting it to use constant memory.
Coming from a Python background I'm basically trying to emulate Python Generators. I'm probably missing something obvious, so I'd really appreciate some pointers. Thanks!
Generators are not (lazy) sequences.
You are holding on to the head here:
(def gen-a (generator 59 16807))
(def gen-b (generator 393 48271))
gen-a and gen-b are gobal vars referring to the head a sequence.
You probably want something like:
(defn run []
(->> (interleave (generator 59 16807) (generator 393 48271))
(partition 2)
(take 40000000)
(filter #(apply = %))
(count)))
Alternatively, define gen-a and gen-b as functions:
(defn gen-a
[]
(generator 59 16807)))
...
(defn run []
(->> (interleave (gen-a) (gen-b))
(partition 2)
(take 40000000)
(filter #(apply = %))
(count)))
You can get Python-style generator functions in Clojure using the Tupelo library. Just use lazy-gen and yield like so:
(ns tst.demo.core
(:use tupelo.test)
(:require
[tupelo.core :as t] ))
(defn rand-gen
[seed factor]
(t/lazy-gen
(loop [acc seed]
(let [next (mod (* acc factor) 2147483647)]
(t/yield next)
(recur next)))))
(defn run2 [num-rand]
(->> (interleave
; restrict to [0..99] to simulate bad rand #'s
(map #(mod % 100) (rand-gen 59 16807))
(map #(mod % 100) (rand-gen 393 48271)))
(partition 2)
(take num-rand)
(filter #(apply = %))
(count)))
(t/spyx (time (run2 1e5))) ; expect ~1% will overlap => 1e3
(t/spyx (time (run2 1e6))) ; expect ~1% will overlap => 1e4
(t/spyx (time (run2 1e7))) ; expect ~1% will overlap => 1e5
with result:
"Elapsed time: 409.697922 msecs" (time (run2 100000.0)) => 1025
"Elapsed time: 3250.592798 msecs" (time (run2 1000000.0)) => 9970
"Elapsed time: 32995.194574 msecs" (time (run2 1.0E7)) => 100068
Rather than using reductions, you could build a lazy sequence directly. This answer uses lazy-cons from the Tupelo library (you could also use lazy-seq from clojure.core).
(ns tst.demo.core
(:use tupelo.test)
(:require
[tupelo.core :as t] ))
(defn rand-gen
[seed factor]
(let [next (mod (* seed factor) 2147483647)]
(t/lazy-cons next (rand-gen next factor))))
(defn run2 [num-rand]
(->> (interleave
; restrict to [0..99] to simulate bad rand #'s
(map #(mod % 100) (rand-gen 59 16807))
(map #(mod % 100) (rand-gen 393 48271)))
(partition 2)
(take num-rand)
(filter #(apply = %))
(count)))
(t/spyx (time (run2 1e5))) ; expect ~1% will overlap => 1e3
(t/spyx (time (run2 1e6))) ; expect ~1% will overlap => 1e4
(t/spyx (time (run2 1e7))) ; expect ~1% will overlap => 1e5
with results:
"Elapsed time: 90.42 msecs" (time (run2 100000.0)) => 1025
"Elapsed time: 862.60 msecs" (time (run2 1000000.0)) => 9970
"Elapsed time: 8474.25 msecs" (time (run2 1.0E7)) => 100068
Note that the execution times are about 4x faster, since we have cut out the generator function stuff that we weren't really using anyway.
Below, I have 2 functions computing the sum of squares of their arguments. The first one is nice and functional, but 20x slower than the second one. I presume that the r/map is not taking advantage of aget to retrieve elements from the double-array, whereas I'm explicitly doing this in function 2.
Is there any way I can further typehint or help r/map r/fold to perform faster?
(defn sum-of-squares
"Given a vector v, compute the sum of the squares of elements."
^double [^doubles v]
(r/fold + (r/map #(* % %) v)))
(defn sum-of-squares2
"This is much faster than above. Post to stack-overflow to see."
^double [^doubles v]
(loop [val 0.0
i (dec (alength v))]
(if (neg? i)
val
(let [x (aget v i)]
(recur (+ val (* x x)) (dec i))))))
(def a (double-array (range 10)))
(quick-bench (sum-of-squares a))
800 ns
(quick-bench (sum-of-squares2 a))
40 ns
Before experiments I've added next line in project.clj:
:jvm-opts ^:replace [] ; Makes measurements more accurate
Basic measurements:
(def a (double-array (range 1000000))) ; 10 is too small for performance measurements
(quick-bench (sum-of-squares a)) ; ... Execution time mean : 27.617748 ms ...
(quick-bench (sum-of-squares2 a)) ; ... Execution time mean : 1.259175 ms ...
This is more or less consistent with time difference in the question. Let's try to not use Java arrays (which are not really idiomatic for Clojure):
(def b (mapv (partial * 1.0) (range 1000000))) ; Persistent vector
(quick-bench (sum-of-squares b)) ; ... Execution time mean : 14.808644 ms ...
Almost 2 times faster. Now let's remove type hints:
(defn sum-of-squares3
"Given a vector v, compute the sum of the squares of elements."
[v]
(r/fold + (r/map #(* % %) v)))
(quick-bench (sum-of-squares3 a)) ; Execution time mean : 30.392206 ms
(quick-bench (sum-of-squares3 b)) ; Execution time mean : 15.583379 ms
Execution time increased only marginally comparing to version with type hints. By the way, version with transducers has very similar performance and is much cleaner:
(defn sum-of-squares3 [v]
(transduce (map #(* % %)) + v))
Now about additional type hinting. We can indeed optimize first sum-of-squares implementation:
(defn square ^double [^double x] (* x x))
(defn sum-of-squares4
"Given a vector v, compute the sum of the squares of elements."
[v]
(r/fold + (r/map square v)))
(quick-bench (sum-of-squares4 b)) ; ... Execution time mean : 12.891831 ms ...
(defn pl
(^double [] 0.0)
(^double [^double x] (+ x))
(^double [^double x ^double y] (+ x y)))
(defn sum-of-squares5
"Given a vector v, compute the sum of the squares of elements."
[v]
(r/fold pl (r/map square v)))
(quick-bench (sum-of-squares5 b)) ; ... Execution time mean : 9.441748 ms ...
Note #1: type hints on arguments and return value of sum-of-squares4 and sum-of-squares5 have no additional performance benefits.
Note #2: It's generally bad practice to start with optimizations. Straight-forward version (apply + (map square v)) will have good enough performance for most situations. sum-of-squares2 is very far from idiomatic and uses literally no Clojure concepts. If this is really performance critical code - better to implement it in Java and use interop. Code will be much cleaner despite of having 2 languages. Or even implement it in unmanaged code (C, C++) and use JNI (not really maintainable but if properly implemented, can give the best possible performance).
Why not use areduce:
(def sum-of-squares3 ^double [^doubles v]
(areduce v idx ret 0.0
(let [item (aget v idx)]
(+ ret (* item item)))))
On my machine running:
(criterium/bench (sum-of-squares3 (double-array (range 100000))))
Gives a mean execution time of 1.809103 ms, your sum-of-squares2 executes the same calculation in 1.455775 ms. I think this version using areduce is more idiomatic than your version.
For squeezing a little bit more performance you can try using unchecked math (add-unchecked and multiply-unchecked). But beware, you need to be sure that your calculation cannot overflow:
(defn sum-of-squares4 ^double [^doubles v]
(areduce v idx ret 0.0
(let [item (aget v idx)]
(unchecked-add ret (unchecked-multiply item item)))))
Running the same benchmark gives a mean execution time of 1.144197 ms. Your sum-of-squares2 can also benefit from unchecked math with a 1.126001 ms mean execution time.
I wonder how this can be done in Clojure idiomatically and efficiently:
1) Given a vector containing n integers in it: [A0 A1 A2 A3 ... An]
2) Increase the last x items by 1 (let's say x is 100) so the vector will become: [A0 A1 A2 A3 ... (An-99 + 1) (An-98 + 1)... (An-1 + 1) (An + 1)]
One naive implementation looks like:
(defn inc-last [x nums]
(let [n (count nums)]
(map #(if (>= % (- n x)) (inc %2) %2)
(range n)
nums)))
(inc-last 2 [1 2 3 4])
;=> [1 2 4 5]
In this implementation, basically you just map the entire vector to another vector by examine each item to see if it needs to be increased.
However, this is an O(n) operation while I only want to change the last x items in the vector. Ideally, this should be done in O(x) instead of O(n).
I am considering using some functions like split-at/concat to implement it like below:
(defn inc-last [x nums]
(let [[nums1 nums2] (split-at x nums)]
(concat nums1 (map inc nums2))))
However, I am not sure if this implementation is O(n) or O(x). I am new to Clojure and not really sure what the time complexity will be for operations like concat/split-at on persistent data structures in Clojure.
So my questions are:
1) What the time complexity here in second implementation?
2) If it is still O(n), is there any idiomatic and efficient implementation that takes only O(x) in Clojure for solving this problem?
Any comment is appreciated. Thanks.
Update:
noisesmith's answer told me that split-at will convert the vector into a list, which was a fact I did not realised previously. Since I will do random access for the result (call nth after processing the vector), I would like to have an efficient solution (O(x) time) while keeping the vector instead of list otherwise nth will slow down my program as well.
Concat and split-at both turn the input into a seq, effectively a linked-list representation, O(x) time. Here is how to do it with a vector for O(n) performance.
user> (defn inc-last-n
[n x]
(let [count (count x)
update (fn [x i] (update-in x [i] inc))]
(reduce update x (range (- count n) count))))
#'user/inc-last-n
user> (inc-last-n 3 [0 1 2 3 4 5 6])
[0 1 2 3 5 6 7]
This will fail on input that is not associative (like seq / lazy-seq) because there is no O(1) access time in non-associative types.
inc-last is an implementation using a transient, which allows to get a modifiable "in place" vector in constant time and return a persistent! vector also in constant time, which allows to make the updates in O(x). The original implementation used an imperative doseq loop but, as mentioned in the comments, transient operations can return a new object, so it's better to keep doing things in a functional way.
I added a doall to the call to inc-last-2 since it returns a lazy seq, but inc-last and inc-last-3 returns a vector so the doall is needed to be able to compare them all.
According to some quick tests I made, inc-last and inc-last-3 don't actually differ much in performance, not even for huge vectors (10000000 elements). For the inc-last-2 implementation though, there's quite a difference even for a vector of 1000 elements, modifying only the last 10, it's ~100x slower. For smaller vectors or when the n is close to (count nums) the difference is not really that much.
(Thanks to MichaĆ Marczyk for his useful comments)
(def x (vec (range 1000)))
(defn inc-last [n x]
(let [x (transient x)
l (count x)]
(->>
(range (- l n) l)
(reduce #(assoc! %1 %2 (inc (%1 %2))) x)
persistent!)))
(defn inc-last-2 [x nums]
(let [n (count nums)]
(map #(if (>= % (- n x)) (inc %2) %2)
(range n)
nums)))
(defn inc-last-3 [n x]
(let [l (count x)]
(reduce #(assoc %1 %2 (inc (%1 %2))) x (range (- l n) l))))
(time
(dotimes [i 100]
(inc-last 50 x)))
(time
(dotimes [i 100]
(doall (inc-last-2 10 x))))
(time
(dotimes [i 100]
(inc-last-3 50 x)))
;=> "Elapsed time: 49.7965 msecs"
;=> "Elapsed time: 1751.964501 msecs"
;=> "Elapsed time: 67.651 msecs"
This seems slow:
(time (doall (map + (range 1000000) (range 1000000))))
"Elapsed time: 13951.664454 msecs"
How to do it faster?
For starters, range does not make an array, it makes a lazy-seq.
The fastest way to add two collections of numbers is probably going to involve having them in arrays first, and doing an iterative loop instead of a map.
user> (time (let [a (int-array (range 1000000))
b (int-array (range 1000000))]
(dotimes [i 1000000]
(aset a i (+ (aget b i) (aget a i))))
a))
"Elapsed time: 771.100395 msecs"
#<int[] [I#4233eba0>
user>
Note this still has the overhead of creating and realizing the lazy seqs from the two range calls, in actual performance you would likely already have that data constructed before getting to the summation step.
Unless this is a performance bottleneck in your code, doing things this way would imply you shouldn't be using clojure in the first place. The advantage of using clojure is you get high level immutable data structures, which lead to referentially transparent and parallelizable code. Once you drop down to raw jvm types like arrays, you lose these advantages (in exchange for better performance).
You might be interested in Prismatic's "open-source array processing library HipHip, which combines Clojure's expressiveness with the fastest math Java has to offer".
I just had a quick go with it and it does seem to offer a nice compromise between expressiveness and performance:
Note: I'm using Criterium to benchmark this as it reduces some of the problems with benchmarking on the JVM.
(require '[criterium.core :refer [quick-bench]])
(quick-bench (doall (map + (range 1000000) (range 1000000))))
;=> "Execution time mean : 791.955406 ms"
(require '[hiphip.int :as h])
(quick-bench (h/amap [x (h/amake [i 1000000] i)
y (h/amake [i 1000000] i)]
(+ x y)))
;=> "Execution time mean : 20.540645 ms"
I was wondering if someone could help me with the performance of this code snippet in Clojure 1.3. I am trying to implement a simple function that takes two vectors and does a sum of products.
So let's say the vectors are X (size 10,000 elements) and B (size 3 elements), and the sum of products are stored in a vector Y, mathematically it looks like this:
Y0 = B0*X2 + B1*X1 + B2*X0
Y1 = B0*X3 + B1*X2 + B2*X1
Y2 = B0*X4 + B1*X3 + B2*X2
and so on ...
For this example, the size of Y will end up being 9997, which corresponds to (10,000 - 3). I've set up the function to accept any size of X and B.
Here's the code: It basically takes (count b) elements at a time from X, reverses it, maps * onto B and sums the contents of the resulting sequence to produce an element of Y.
(defn filt [b-vec x-vec]
(loop [n 0 sig x-vec result []]
(if (= n (- (count x-vec) (count b-vec)))
result
(recur (inc n) (rest sig) (conj result (->> sig
(take (count b-vec))
(reverse)
(map * b-vec)
(apply +)))))))
Upon letting X be (vec (range 1 10001)) and B being [1 2 3], this function takes approximately 6 seconds to run. I was hoping someone could suggest improvements to the run time, whether it be algorithmic, or perhaps a language detail I might be abusing.
Thanks!
P.S. I have done (set! *warn-on-reflection* true) but don't get any reflection warning messages.
You are using count many times unnecessary. Below code calculate count one time only
(defn filt [b-vec x-vec]
(let [bc (count b-vec) xc (count x-vec)]
(loop [n 0 sig x-vec result []]
(if (= n (- xc bc))
result
(recur (inc n) (rest sig) (conj result (->> sig
(take bc)
(reverse)
(map * b-vec)
(apply +))))))))
(time (def b (filt [1 2 3] (range 10000))))
=> "Elapsed time: 50.892536 msecs"
If you really want top performance for this kind of calculation, you should use arrays rather than vectors. Arrays have a number of performance advantages:
They support O(1) indexed lookup and writes - marginally better than vectors which are O(log32 n)
They are mutable, so you don't need to construct new arrays all the time - you can just create a single array to serve as the output buffer
They are represented as Java arrays under the hood, so benefit from the various array optimisations built into the JVM
You can use primitive arrays (e.g. of Java doubles) which are much faster than if you use boxed number objects
Code would be something like:
(defn filt [^doubles b-arr
^doubles x-arr]
(let [bc (count b-arr)
xc (count x-arr)
rc (inc (- xc bc))
result ^doubles (double-array rc)]
(dotimes [i rc]
(dotimes [j bc]
(aset result i (+ (aget result i) (* (aget x-arr (+ i j)) (aget b-arr j))))))
result))
To follow on to Ankur's excellent answer, you can also avoid repeated calls to the reverse function, which gets us even a little more performance.
(defn filt [b-vec x-vec]
(let [bc (count b-vec) xc (count x-vec) bb-vec (reverse b-vec)]
(loop [n 0 sig x-vec result []]
(if (= n (- xc bc))
result
(recur (inc n) (rest sig) (conj result (->> sig
(take bc)
(map * bb-vec)
(apply +))))))))