I'm working on a Commodore 64 emulator as a fun project with functional programming. My goal was to write the entire thing functionally and as pure as possible. I was looking at using a hash table as my memory store, but the performance of mutable vs immutable hashes seems prohibitive. I liked the idea of a hash table as kind of sparse array of memory, since in many cases, memory won't actually be instantiated. I'd be fine using a vector as well, but there doesn't seem to be a functional version of vector-set.
(define (immut-hash [c (hash)] [r 10000000])
(when (> r 0) (immut-hash (hash-set c (random #xffff) (random #xff)) (- r 1))))
(define (mut-hash [c (make-hash)] [r 10000000])
(when (> r 0) (hash-set! c (random #xffff) (random #xff)) (mut-hash c (- r 1))))
(time (immut-hash)) vs (time (mut-hash)) is much worse, as a simulation of a bunch of memory pokes, and puts it beyond the ability of my macbook pro to keep up with a c64 clock rate.
(a) Is there any better approach to improve the performance of the mutable hashes in this case?
(b) If not, is there another functional approach people would suggest?
Note - I know that this isn't likely the right solution for absolute performance. Like I said..learning.
I know this is an old discussion, but it is the top hit for searching for the performance of Racket's hash-set (e.g. the immutable, functional way of setting a hash key value pair). Since 2019 when this article was posted and answered, the underlying Racket engine has changed to use Chez Scheme, and the performance ratios have also changed significantly.
Rerunning the above tests (I've included mutable vector operations as well, since the OP mentioned it):
#lang racket
(define (immut-hash [c (hash)] [r 10000000])
(when (> r 0) (immut-hash (hash-set c (random #xffff) (random #xff)) (- r 1))))
(define (mut-hash [c (make-hash)] [r 10000000])
(when (> r 0) (hash-set! c (random #xffff) (random #xff)) (mut-hash c (- r 1))))
(define (mut-vec [c (make-vector 65536)] [r 10000000])
(when (> r 0) (vector-set! c (random #xffff) (random #xff)) (mut-vec c (- r 1))))
(time (immut-hash (hash)))
(time (immut-hash (hasheq)))
(time (mut-hash (make-hash)))
(time (mut-hash (make-hasheq)))
(time (mut-vec))
produces the following results:
cpu time: 4024 real time: 4409 gc time: 198
cpu time: 3991 real time: 4334 gc time: 188
cpu time: 2532 real time: 2631 gc time: 17
cpu time: 2432 real time: 2524 gc time: 21
cpu time: 1985 real time: 2173 gc time: 11
Conclusions from the year 2021 (using Racket's new Chez Scheme 8.x engine):
The performance degradation from using hash/make-hash instead of hasheq/make-hasheq has essentially been eliminated.
The performance degradation from using immutable hashes instead of mutable hashes has gone from over 4x to less than 2x.
The worst case scenario (immutable hash) is now only 2x worse than the best case scenario (mutable vectors).
If you know that the keys of your hash will be fixnums, you could use hasheq (or make-hasheq) instead of hash (or make-hash). This gives a better performance, at least for Racket 7.4 3m variant on my Macbook Pro.
#lang racket
(define (immut-hash [c (hash)] [r 10000000])
(when (> r 0) (immut-hash (hash-set c (random #xffff) (random #xff)) (- r 1))))
(define (mut-hash [c (make-hash)] [r 10000000])
(when (> r 0) (hash-set! c (random #xffff) (random #xff)) (mut-hash c (- r 1))))
(time (immut-hash (hash)))
(time (immut-hash (hasheq)))
(time (mut-hash (make-hash)))
(time (mut-hash (make-hasheq)))
Here's the results:
cpu time: 9383 real time: 9447 gc time: 3181
cpu time: 6644 real time: 6658 gc time: 1105
cpu time: 2220 real time: 2225 gc time: 0
cpu time: 1647 real time: 1654 gc time: 0
There's a recent thread about performance of immutable hash. Jon compared the performance of immutable hash implemented by Patricia trie vs hash array mapped trie (HAMT), the hash type (eq? vs equal?), and the insertion order. You might want to take a look at the results.
Related
In the Clojure documentation on type hinting, it has the following example on how type hinting and coercions can make code run much faster:
(defn foo [n]
(loop [i 0]
(if (< i n)
(recur (inc i))
i)))
(time (foo 100000))
"Elapsed time: 0.391 msecs"
100000
(defn foo2 [n]
(let [n (int n)]
(loop [i (int 0)]
(if (< i n)
(recur (inc i))
i))))
(time (foo2 100000))
"Elapsed time: 0.084 msecs"
100000
If you run this code with (set! *warn-on-reflection* true), it doesn't show a reflection warning. Is it up to programmer trial-and-error to see where these kinds of adornments make a performance difference? Or is there a tool that indicates the problematic areas?
Well you can estimate this pretty well, just by thinking about which parts of the code gets hit often.
Or you could use a normal profiler of some sort. I would recommend VIsual VM, which you can get to work with clojure. Then you just place them in the methods you see take most of the time (it will also show you calls to java.lang.reflect.Method, if this gets called a lot you should consider using type hints).
Below, I have 2 functions computing the sum of squares of their arguments. The first one is nice and functional, but 20x slower than the second one. I presume that the r/map is not taking advantage of aget to retrieve elements from the double-array, whereas I'm explicitly doing this in function 2.
Is there any way I can further typehint or help r/map r/fold to perform faster?
(defn sum-of-squares
"Given a vector v, compute the sum of the squares of elements."
^double [^doubles v]
(r/fold + (r/map #(* % %) v)))
(defn sum-of-squares2
"This is much faster than above. Post to stack-overflow to see."
^double [^doubles v]
(loop [val 0.0
i (dec (alength v))]
(if (neg? i)
val
(let [x (aget v i)]
(recur (+ val (* x x)) (dec i))))))
(def a (double-array (range 10)))
(quick-bench (sum-of-squares a))
800 ns
(quick-bench (sum-of-squares2 a))
40 ns
Before experiments I've added next line in project.clj:
:jvm-opts ^:replace [] ; Makes measurements more accurate
Basic measurements:
(def a (double-array (range 1000000))) ; 10 is too small for performance measurements
(quick-bench (sum-of-squares a)) ; ... Execution time mean : 27.617748 ms ...
(quick-bench (sum-of-squares2 a)) ; ... Execution time mean : 1.259175 ms ...
This is more or less consistent with time difference in the question. Let's try to not use Java arrays (which are not really idiomatic for Clojure):
(def b (mapv (partial * 1.0) (range 1000000))) ; Persistent vector
(quick-bench (sum-of-squares b)) ; ... Execution time mean : 14.808644 ms ...
Almost 2 times faster. Now let's remove type hints:
(defn sum-of-squares3
"Given a vector v, compute the sum of the squares of elements."
[v]
(r/fold + (r/map #(* % %) v)))
(quick-bench (sum-of-squares3 a)) ; Execution time mean : 30.392206 ms
(quick-bench (sum-of-squares3 b)) ; Execution time mean : 15.583379 ms
Execution time increased only marginally comparing to version with type hints. By the way, version with transducers has very similar performance and is much cleaner:
(defn sum-of-squares3 [v]
(transduce (map #(* % %)) + v))
Now about additional type hinting. We can indeed optimize first sum-of-squares implementation:
(defn square ^double [^double x] (* x x))
(defn sum-of-squares4
"Given a vector v, compute the sum of the squares of elements."
[v]
(r/fold + (r/map square v)))
(quick-bench (sum-of-squares4 b)) ; ... Execution time mean : 12.891831 ms ...
(defn pl
(^double [] 0.0)
(^double [^double x] (+ x))
(^double [^double x ^double y] (+ x y)))
(defn sum-of-squares5
"Given a vector v, compute the sum of the squares of elements."
[v]
(r/fold pl (r/map square v)))
(quick-bench (sum-of-squares5 b)) ; ... Execution time mean : 9.441748 ms ...
Note #1: type hints on arguments and return value of sum-of-squares4 and sum-of-squares5 have no additional performance benefits.
Note #2: It's generally bad practice to start with optimizations. Straight-forward version (apply + (map square v)) will have good enough performance for most situations. sum-of-squares2 is very far from idiomatic and uses literally no Clojure concepts. If this is really performance critical code - better to implement it in Java and use interop. Code will be much cleaner despite of having 2 languages. Or even implement it in unmanaged code (C, C++) and use JNI (not really maintainable but if properly implemented, can give the best possible performance).
Why not use areduce:
(def sum-of-squares3 ^double [^doubles v]
(areduce v idx ret 0.0
(let [item (aget v idx)]
(+ ret (* item item)))))
On my machine running:
(criterium/bench (sum-of-squares3 (double-array (range 100000))))
Gives a mean execution time of 1.809103 ms, your sum-of-squares2 executes the same calculation in 1.455775 ms. I think this version using areduce is more idiomatic than your version.
For squeezing a little bit more performance you can try using unchecked math (add-unchecked and multiply-unchecked). But beware, you need to be sure that your calculation cannot overflow:
(defn sum-of-squares4 ^double [^doubles v]
(areduce v idx ret 0.0
(let [item (aget v idx)]
(unchecked-add ret (unchecked-multiply item item)))))
Running the same benchmark gives a mean execution time of 1.144197 ms. Your sum-of-squares2 can also benefit from unchecked math with a 1.126001 ms mean execution time.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I found a interesting thing. Passing argument may deserve consideration, especially in which situation time is important. the code like below.
(define (collatz-num n)
(define (collatz-iter n m)
(cond
((= n 1)
m)
((even? n)
(collatz-iter (/ n 2) (+ m 1)))
(else
(collatz-iter (+ (* 3 n) 1) (+ m 1)))))
(collatz-iter n 1))
(define (collatz-iter n m)
(cond
((= n 1)
m)
((even? n)
(collatz-iter (/ n 2) (+ m 1)))
(else
(collatz-iter (+ (* 3 n) 1) (+ m 1)))))
(define (euler14 n limit)
(define (help-iter m len n limit)
(let ((collatz (collatz-iter n 1)))
(cond
((> n limit)
(list m len))
((> collatz len)
(help-iter n collatz (+ n 2) limit))
(else
(help-iter m len (+ n 2) limit)))))
(help-iter 0 0 n limit))
for collatz-iter
> (time (euler14 1 1000000))
cpu time: 1596 real time: 1596 gc time: 0
for collatz-num
> (time (euler14 1 1000000))
cpu time: 1787 real time: 1789 gc time: 0
My question:
How big is the cost of passing argument in scheme
In function euler14, I let limit as argument of help-iter, will it save some time this way? as I have seen somewhere, the free variable will have cost.
Maybe I am too mean.
Again. this is very implementation specific!
I tested your code and since it does consume memory and do a lot of computations the order in which I tested the two interfered with the second result. I separated the two tests in each file and run each 40 times separately and looked at average running time for both. The differences were num: 1059.75 and iter: 1018.85. 4% difference on average, but might as well be 12% when picking two samples. I'd guess the running time of the same program might differ in more than 4% on average so the difference between these are irrelevant in one run.
You have an extra application in your code so to check how much impact an argument has I made this little test. The usage of the arguments in the base case is so that Scheme won't optimize them away:
(define (three-params x y z)
(if (zero? x)
(cons y z)
(three-params (- x 1) x x)))
(define (two-params x y)
(if (zero? x)
(cons x y)
(two-params (- x 1) x)))
(define (one-param x)
(if (zero? x)
(cons x x)
(one-param (- x 1))))
(define numtimes 100000000)
(time (one-param numtimes))
(time (two-params numtimes #f))
(time (three-params numtimes #f #f))
Comment out 2 of the three last lines and make 3 files. Compile them if you can.
The do the average of several runs. I choose 50 and in Ikarus I get the following averages:
agvms diffms
one 402.4
two 417.82 15.42
three 433.14 15.32
two2 551.38 133.56 (with an extra addition compared to two)
Just looking at the 50 results I see that the times overlap between one, two, and three, but statistically it looks like the average argument cost 0,15ns. if, zero?, - and cons and gc cost a lot more even if they are primitives. Neither extra applications nor extra arguments should be your concern. Sometimes a different implementation optimizes your code differently so changing from racket to ikarus, gambit or chicken might improve you code in production.
This seems slow:
(time (doall (map + (range 1000000) (range 1000000))))
"Elapsed time: 13951.664454 msecs"
How to do it faster?
For starters, range does not make an array, it makes a lazy-seq.
The fastest way to add two collections of numbers is probably going to involve having them in arrays first, and doing an iterative loop instead of a map.
user> (time (let [a (int-array (range 1000000))
b (int-array (range 1000000))]
(dotimes [i 1000000]
(aset a i (+ (aget b i) (aget a i))))
a))
"Elapsed time: 771.100395 msecs"
#<int[] [I#4233eba0>
user>
Note this still has the overhead of creating and realizing the lazy seqs from the two range calls, in actual performance you would likely already have that data constructed before getting to the summation step.
Unless this is a performance bottleneck in your code, doing things this way would imply you shouldn't be using clojure in the first place. The advantage of using clojure is you get high level immutable data structures, which lead to referentially transparent and parallelizable code. Once you drop down to raw jvm types like arrays, you lose these advantages (in exchange for better performance).
You might be interested in Prismatic's "open-source array processing library HipHip, which combines Clojure's expressiveness with the fastest math Java has to offer".
I just had a quick go with it and it does seem to offer a nice compromise between expressiveness and performance:
Note: I'm using Criterium to benchmark this as it reduces some of the problems with benchmarking on the JVM.
(require '[criterium.core :refer [quick-bench]])
(quick-bench (doall (map + (range 1000000) (range 1000000))))
;=> "Execution time mean : 791.955406 ms"
(require '[hiphip.int :as h])
(quick-bench (h/amap [x (h/amake [i 1000000] i)
y (h/amake [i 1000000] i)]
(+ x y)))
;=> "Execution time mean : 20.540645 ms"
I am performing element-wise operations on two vectors on the order of 50,000 elements in size, and having unsatisfactory performance issues (a few seconds). Are there any obvious performance issues to be made, such as using a different data structure?
(defn boolean-compare
"Sum up 1s if matching 0 otherwise"
[proposal-img data-img]
(sum
(map
#(Math/abs (- (first %) (second %)))
(partition 2 (interleave proposal-img data-img)))))
Try this:
(apply + (map bit-xor proposal-img data-img)))
Some notes:
mapping a function to several collections uses an element from each as the arguments to the function - no need to interleave and partition for this.
If your data is 1's and 0's, then xor will be faster than absolute difference
Timed example:
(def data-img (repeatedly 50000 #(rand-int 2)))
(def proposal-img (repeatedly 50000 #(rand-int 2)))
(def sum (partial apply +))
After warming up the JVM...
(time (boolean-compare proposal-img data-img))
;=> "Elapsed time: 528.731093 msecs"
;=> 24802
(time (apply + (map bit-xor proposal-img data-img)))
;=> "Elapsed time: 22.481255 msecs"
;=> 24802
You should look at adopting core.matrix if you are interested in good performance for large vector operations.
In particular, the vectorz-clj library (a core.matrix implementation) has some very fast implementations for most common vector operations with double values.
(def v1 (array (repeatedly 50000 #(rand-int 2))))
(def v2 (array (repeatedly 50000 #(rand-int 2))))
(time (let [d (sub v2 v1)] ;; take difference of two vectors
(.abs d) ;; calculate absolute value (mutate d)
(esum d))) ;; sum elements and return result
=> "Elapsed time: 0.949985 msecs"
=> 24980.0
i.e. under 20ns per pair of elements - that's pretty quick: you'd be hard pressed to beat that without resorting to low-level array-fiddling code.