(defun billion-test ()
(setq i 0)
(loop while (< i 100) do
(setq i (+ i 1))))
(billion-test)
(print "done")
I have the above Lisp code that simply loops to one billion. The problem is it is really
slow. Slower than any trivial program I have ever written. These are the times it took to
run with the intepreters I have(gcl and clisp).
Compiled Uncompiled
GNU Common Lisp(gcl) 270-300s 900-960s
Clisp 280-300s 960-1647s
I used this Python code to time the time for Clisp and an approximated using the system time
with gcl since you can't run it from the command prompt.
import sys
import time
import os
start=time.time()
os.system(" ".join(sys.argv[1:]))
stop=time.time()
print "\n%.4f seconds\n"%(stop-start)
Here are the comparison with while loops from other languages:
Kawa scheme 220.3350s
Petite chez 112.827s
C# 1.9130s
Ruby 31.045s
Python 116.8600s 113.7090s(optimized)
C 2.8240s 0.0150s(optimized)
lua 84.6970s
My assumption is that loop while <condition> do is the Lisp equivalent of a while
loop. I have some doubts about those 1647s(25+ min), I was watching something at that
time and it might have slowed the execution, but by almost 800s? I don't know.
These results are hard to believe. According to Norvig Lisp
is 3 to 85 times faster than Python. Judging from what I got, the most logical
explanation for such slow execution is that Clisp and gcl in Windows have some kind
of bug that slows down large iterations. How, you ask, I don't know?
Sooo, my question is, why so slow?
Is anybody else getting something like this?
UPDATE 1:
I ran Joswigs's program and got these results:
compiled uncompiled
gcl 0.8s 12mins
clisp 5mins 18mins
gcl compiled the program fine, clisp however gave this warning:
;; Compiling file C:\mine\.cl\test.cl ...
WARNING: in BILLION-TEST in lines 1..8 : FIXNUM-SAFETY is not a
valid OPTIMIZE quality.
0 errors, 1 warning
;; Wrote file C:\mine\.cl\test.fas
;; clisp
[2]> (type-of 1000000000)
(INTEGER (16777215))
;;gcl
(type-of 1000000000)
FIXNUM
Guess that could be the reason it took more than a minute.
UPDATE 2:
I thought I would give it another try with another implementation just to confirm
that it's really the bignum comparison that slowing it down. I obtained sbcl
for windows and ran the program again:
* (print most-positive-fixnum)
536870911
* (compile-file "count-to-billion.cl")
; compiling file "C:/mine/.cl/count-to-billion.cl"
(written 09 OCT 2013 04:28:24 PM):
; compiling (DEFUN BILLION-TEST ...)
; file: C:/mine/.cl/count-to-billion.cl
; in: DEFUN BILLION-TEST
; (OPTIMIZE (SPEED 3) (SAFETY 0) (DEBUG 0) (FIXNUM-SAFETY 0))
;
; caught WARNING:
; Ignoring unknown optimization quality FIXNUM-SAFETY in:
; (OPTIMIZE (SPEED 3) (SAFETY 0) (DEBUG 0) (FIXNUM-SAFETY 0))
* (load "count-to-billion")
I wish I could tell you how long it took but I never saw the end of it. I waited for
2 hours, watched an episode of Vampire Diaries(hehe) and it still hadn't finished.
I was expecting it to be faster than Clisp since its MOST-POSITIVE-FIXNUM is, well, more
positive. I'm vouching for the slow implementation point because only gcl could pull
off the less than one minute run.
Running Rörd's code with gcl:
(time (loop with i = 0 while (< i 1000000000) do (incf i)))
gcl with Rords's code:
>(load "count-to-billion.cl")
Loading count-to-billion.cl
real-time : 595.667 secs
run time : 595.667 secs
>(compile-file "count-to-billion.cl")
OPTIMIZE levels: Safety=0 (No runtime error checking), Space=0, Speed=3
Finished compiling count-to-billion.cl.
#p"count-to-billion.o"
>(load "count-to-billion")
Loading count-to-billion.o
real time : 575.567 secs
run time : 575.567 secs
start address -T 1020e400 Finished loading count-to-billion.o
48
UPDATE 3:
This is the last one, I promise. I tried Rords other code:
(defun billion-test ()
(loop with i fixnum = 0
while (< i 1000000000) do (incf i)))
and suprisingly, it runs as fast Joswig's the difference being the keywords fixnum and
with:
gcl's output:
real time : 0.850 secs
run time : 0.850 secs
sbcl's output(ran for about half a second and spat this out):
debugger invoked on a TYPE-ERROR in thread
#<THREAD "main thread" RUNNING {23FC3A39}>:
The value 536870912 is not of type FIXNUM.
clisp's output:
Real time: 302.82532 sec.
Run time: 286.35544 sec.
Space: 11798673420 Bytes
GC: 21413, GC time: 64.47521 sec.
NIL
start up time
undeclared variable
global variable
no type declarations
compiler not told to optimize
on 32bit machines/implementations 1000000000 is possibly not a fixnum, see the variable MOST-POSITIVE-FIXNUM
possibly < comparison with a bignum on 32bit machines -> better to count to 0
slow implementation
A 64bit Common Lisp should have larger fixnums and we can use simple fixnum computations.
On a 64bit LispWorks on a MacBook Air laptop with 2 Ghz Intel i7 I get unoptimized code to run in slightly under 2 seconds. If we add declarations, it gets a bit faster.
(defun billion-test ()
(let ((i 0))
(declare (fixnum i)
(optimize (speed 3) (safety 0) (debug 0))
(inline +))
(loop while (< i 1000000000) do
(setq i (+ i 1)))))
CL-USER 7 > (time (billion-test))
Timing the evaluation of (BILLION-TEST)
User time = 0.973
System time = 0.002
Elapsed time = 0.958
Allocation = 154384 bytes
0 Page faults
NIL
A 64bit SBCL needs 0.3 seconds. So it is even faster.
With GCL you should be able to get better results on a 32bit machine. Here I use GCL on a 32bit ARM processor (Samsung Exynos 5410). A billion is with GCL on the ARM machine still a fixnum.
>(type-of 1000000000)
FIXNUM
>(defun billion-test ()
(let ((i 0))
(declare (fixnum i)
(optimize (speed 3) (safety 0) (debug 0))
(inline +))
(loop while (< i 1000000000) do
(setq i (+ i 1)))))
BILLION-TEST
>(compile *)
Compiling /tmp/gazonk_23351_0.lsp.
Warning:
The OPTIMIZE quality DEBUG is unknown.
End of Pass 1.
End of Pass 2.
OPTIMIZE levels: Safety=0 (No runtime error checking), Space=0, Speed=3
Finished compiling /tmp/gazonk_23351_0.lsp.
Loading /tmp/gazonk_23351_0.o
start address -T 0x7a36f0 Finished loading /tmp/gazonk_23351_0.o
#<compiled-function BILLION-TEST>
NIL
NIL
Now you can see that GCL is also quite fast, even on a slower ARM processor:
>(time (billion-test))
real time : 0.639 secs
run-gbc time : 0.639 secs
child run time : 0.000 secs
gbc time : 0.000 secs
NIL
Related
I am optimizing a program I've been working on, and have hit a wall. The function julia-subrect maps over for-each-pixel a large number of times. I've optimized for-each-pixel to have a ~16x speedup. However, my optimized version of julia-subrect shows no evidence of this. Here are my benchmarks and relevant code:
; ======== Old `for-each-pixel` ========
;(bench (julia/for-each-pixel (->Complex rc ic) max-itrs radius r-min x-step y-step [xt yt])))
;Evaluation count : 3825300 in 60 samples of 63755 calls.
;Execution time mean : 16.018466 µs
; ======== New `for-each-pixel`. optimized 16x. ========
;(bench (julia/for-each-pixel-opt [rc ic] [max-itrs radius r-min] [x-step y-step] [xt yt])))
;Evaluation count : 59542860 in 60 samples of 992381 calls.
;Execution time mean : 1.038955 µs
(defn julia-subrect [^Long start-x ^Long start-y ^Long end-x ^Long end-y ^Long total-width ^Long total-height ^Complex constant ^Long max-itrs]
(let [grid (for [y (range start-y end-y)]
(vec (for [x (range start-x end-x)]
[x y])))
radius (calculate-r constant)
r-min (- radius)
r-max radius
x-step (/ (Math/abs (- r-max r-min)) total-width)
y-step (/ (Math/abs (- r-max r-min)) total-height)
; Uses old implementation of `for-each-pixel`
calculate-pixel (partial for-each-pixel constant max-itrs radius r-min x-step y-step)
for-each-row (fn [r] (map calculate-pixel r))]
(map for-each-row grid)))
; ======== Old `julia-subrect` ========
;(bench (doall (julia/julia-subrect start-x start-y end-x end-y total-width total-height c max-itrs))))
;Evaluation count : 22080 in 60 samples of 368 calls.
;Execution time mean : 2.746852 ms
(defn julia-subrect-opt [[^long start-x ^long start-y ^long end-x ^long end-y] [^double rc ^double ic] total-width total-height max-itrs ]
(let [grid (for [y (range start-y end-y)]
(vec (for [x (range start-x end-x)]
[x y])))
radius (calculate-r-opt rc ic)
r-min (- radius)
r-max radius
x-step (/ (Math/abs (- r-max r-min)) total-width)
y-step (/ (Math/abs (- r-max r-min)) total-height)
;Uses new implementation of `for-each-pixel`
calculate-pixel (fn [px] (for-each-pixel-opt [rc ic] [max-itrs radius r-min] [x-step y-step] px))
for-each-row (fn [r] (map calculate-pixel r))]
(map for-each-row grid)))
; ======== New `julia-subrect`, but no speedup ========
;(bench (doall (julia/julia-subrect-opt [start-x start-y end-x end-y] [rc ic] total-width total-height max-itrs))))
;Evaluation count : 21720 in 60 samples of 362 calls.
;Execution time mean : 2.831553 ms
Here is a gist containing source code for all the functions I've specified:
https://gist.github.com/johnmarinelli/adc5533c19fb0b6d74cf4ef04ae55ee6
So, can anyone tell me why julia-subrect is showing no signs of speedup? Also, I'm still new to clojure so bear with me if the code is unidiomatic/ugly. Right now, I'm focusing on making the program run quicker.
As a general guideline:
profile!
actually get around to profiling, like for real ;-)
remove reflection (looks like you did this)
split the operations into easy to think about functions
remove laziness (transducers should be the last step in this part)
combine steps using loop/recur to make your code impossible to figure out and slightly faster (this is the last step for a reason)
Specifically thinking about the code you posted:
At a glance, it looks like this function will spend much of it's time generating a lazy list of value in the for loop which are then immediately realized (evaluated to no longer be lazy) so the time spent generating that structure is wasted. You may consider changing this to produce vectors directly, mapv is useful for this.
The second part is the call to map in for-each-row which will produce a lot of intermediate data structures. For that one you may consider using a non-lazy expression like mapv or loop/recur.
It looks like you have done steps 2-4 already, and there is no obvious reason for you to skip to step seven. I'd spend the next couple hours on limiting laziness and if you have to, learning about transducers.
I've started using the clojure core.async library. I found the concepts of CSP, channels, go blocks really easy to use. However, I'm not sure if I'm using them right. I've got the following code -
(def x-ch (chan))
(def y-ch (chan))
(def w1-ch (chan))
(def w2-ch (chan))
; they all return matrices
(go (>! x-ch (Mat/* x (map #(/ 1.0 %) (max-fold x)))))
(go (>! y-ch (Mat/* y (map #(/ 1.0 %) (max-fold y)))))
(go (>! w1-ch (gen-matrix 200 300)))
(go (>! w2-ch (gen-matrix 300 100)))
(let [x1 (<!! (go (<! x-ch)))
y1 (<!! (go (<! y-ch)))
w1 (<!! (go (<! w1-ch)))
w2 (<!! (go (<! w2-ch)))]
;; do stuff w/ x1 y1 w1 w2
)
I've got predefined (matrix) vectors in symbols x and y. I need to modify both vectors before I use them. Those vectors are pretty large. I also need to generate two random matrices. Since go macro starts the computation asyncronously, I split all four computation tasks into separate go blocks and put the consequent result into channels. Then I've got a let block where I take values from the channels and store them into symbols. They are all using blocking <!! take functions since they're on the main thread.
What I'm trying to do basically is speed up my computation time by splitting program fragments into async processes. Is this the right way to do it?
For this kind of processing, future may be slightly more adequate.
The example from the link is simple to grasp:
(def f
(future
(Thread/sleep 10000)
(println "done")
100))
The processing, the future block is started immediately, so the above does start a thread, wait for 10s and prints "done" when finished.
When you need the value you can just use:
(deref f)
; or #f
Which will block and return the value of the code block of the future.
In the same example, if you call deref before the 10 seconds have gone, the call will block until the computation is finished.
In your example, since you are just waiting for computations to finish, and are not so much concern about messages and interactions between the channel participants future is what I would recommend. So:
(future
(Mat/* x (map #(/ 1.0 %) (max-fold x))))
go blocks return a channel with the result of the expression, so you don't need to create intermediate channels for their results. The code below lets you kick off all 4 calculations at the same time, and then block on the values until they return. If you don't need some of the results straight away, you could block on the value only when you actually use it.
(let [x1-ch (go (Mat/* x (map #(/ 1.0 %) (max-fold x))))
y1-ch (go (Mat/* y (map #(/ 1.0 %) (max-fold y))))
w1-ch (go (gen-matrix 200 300))
w2-ch (go (gen-matrix 300 100))
x1 (<!! x1-ch)
y1 (<!! y1-ch)
w1 (<!! w1-ch)
w2 (<!! w2-ch)]
;; do stuff w/ x1 y1 w1 w2
)
If you're looking to speed up your program more generally by running code in parallel, then you could look at using Clojure's Reducers, or Aphyr's Tesser. These work by splitting up the work on a single computation into parallelisable parts, then combining them together. These will efficiently run the work over as many cores as your computer has. If you run each of your computations with a future or in a go block, then each computation will run on a single thread, some may finish before others and those cores will be idle.
Both the below functions go from 2 to (sqrt n), and both stop as soon as it is detected if n is non-prime
(defn is-prime-for? [n]
(empty? (for [i (range 2 (math/sqrt (inc n)))
:when (= 0 (rem n i))]
i)))
(defn is-prime-loop? [n]
(loop [i 2]
(cond (> i (math/sqrt (inc n))) true
(zero? (rem n i)) false
:else (recur (inc i)))))
Then why do we see the drastic performance difference b/n them? the "loop" version takes almost 4 times as much time (on my desktop)
project-euler.prob010> (time (dorun (map is-prime-for? (range 200000))))
"Elapsed time: 3267.613099 msecs"
;; => nil
project-euler.prob010> (time (dorun (map is-prime-loop? (range 200000))))
"Elapsed time: 12961.190032 msecs"
;; => nil
Such micro-benchmarks are usually pointless since they don't account for a variety of factors that can influence performance of a specific piece of code (e.g. JVM warmup, optimizations, ...). You should use a benchmarking library like criterium if you want to attain reliable results.
That being said, your two versions have a few major differences that will be reflected in the results:
for creates a lazy sequence whose maintenance cost is higher than what is done in loop/recur.
the loop version calculates (Math/sqrt (inc n)) on every iteration, the for version only once.
zero? has one level of indirection more than (= 0 ...).
Obviously, the compiler might be able to optimize these away, but there are many more factors that can change the outcome (Java version, OpenJDK vs. Oracle, Clojure version, ...). So, here the results of my benchmark run using Clojure 1.6.0 on Oracle JDK 1.7.0_67:
(criterium.core/quick-bench (mapv is-prime-for? (range 200000)))
Evaluation count : 6 in 6 samples of 1 calls.
Execution time mean : 1.942423 sec
Execution time std-deviation : 36.768207 ms
Execution time lower quantile : 1.912171 sec ( 2.5%)
Execution time upper quantile : 1.984463 sec (97.5%)
Overhead used : 8.986692 ns
(criterium.core/quick-bench (mapv is-prime-loop? (range 200000)))
Evaluation count : 6 in 6 samples of 1 calls.
Execution time mean : 724.077492 ms
Execution time std-deviation : 5.695680 ms
Execution time lower quantile : 716.547992 ms ( 2.5%)
Execution time upper quantile : 730.173992 ms (97.5%)
Overhead used : 8.986692 ns
So, on my machine, the loop version is about 3x faster than the for one.
I've been playing around with the Is Clojure is Still Fast? (and prequel Clojure is Fast) code. It seemed unfortunate that inlining the differential equation (f) is one of the steps taken to improving performance. The cleanest/fastest thing I've been able to come up without doing this is the following:
; As in the referenced posts, for giving a rough measure of cycles/iteration (I know this is a very rough
; estimate...)
(def cpuspeed 3.6) ;; My computer runs at 3.6 GHz
(defmacro cyclesperit [expr its]
`(let [start# (. System (nanoTime))
ret# ( ~#expr (/ 1.0 ~its) ~its )
finish# (. System (nanoTime))]
(println (int (/ (* cpuspeed (- finish# start#)) ~its)))))
;; My solution
(defn f [^double t ^double y] (- t y))
(defn mysolveit [^double t0 ^double y0 ^double h ^long its]
(if (> its 0)
(let [t1 (+ t0 h)
y1 (+ y0 (* h (f t0 y0)))]
(recur t1 y1 h (dec its)))
[t0 y0 h its]))
; => 50-55 cycles/it
; The fastest solution presented by the author (John Aspden) is
(defn faster-solveit [^double t0 ^double y0 ^double h ^long its]
(if (> its 0)
(let [t1 (+ t0 h)
y1 (+ y0 (* h (- t0 y0)))]
(recur t1 y1 h (dec its)))
[t0 y0 h its]))
; => 25-30 cycles/it
The type hinting in my solution helps quite a bit (it's 224 cycles/it without type hinting on either f or solveit), but it's still nearly 2x slower than the inlined version. Ultimately this performance is still pretty decent, but this hit is unfortunate.
Why is there such a performance hit for this? Is there a way around it? Are there plans to find ways of improvingthis? As pointed out by John in the original post, it seems funny/unfortunate for function calls to be inefficient in a functional language.
Note: I'm running Clojure 1.5 and have :jvm-opts ^:replace [] in a project.clj file so that I can use lein exec/run without it slowing things down (and it will if you don't do this I discovered...)
Benchmarking in the presence of a JIT compiler is tricky; you really must allow for a warm-up period, but then you also can't just run it all in a loop, since it may then be proved a no-op and optimized away. In Clojure, the usual solution is to use Hugo Duncan's Criterium.
Running a Criterium benchmark for (solveit 0.0 1.0 (/ 1.0 1000000) 1000000) for both versions of solveit results in pretty much exactly the same timings on my machine (mysolveit ~3.44 ms, faster-solveit ~3.45 ms). That's in a 64-bit JVM run with -XX:+UseConcMarkSweepGC, using Criterium 0.4.2 (criterium.core/bench). Presumably HotSpot just inlines f. In any case, there's no performance hit at all.
Adding to the already good answers, the JVM JIT most often does inline the primitive function calls when warmed up, and in this case, when you bench it with a warmed JIT you see the same results. Just wanted to say Clojure also has an inlining feature though for cases where that yields benefits.
(defn f
{:inline-arities #{2}
:inline (fn [t y] `(- (double ~t) (double ~y)))}
^double [^double t ^double y]
(- t y))
Now Clojure will compile away the calls to f, inlining the function at compile time. Whereas the JIT will inline the function at runtime as needed otherwise.
Also note that I added a ^double type hint to the return of f, if you don't do that, it gets compiled to return Object, and a cast needs to be added, I'm not sure if that really affects performance much, but if you want a fully primitive function that takes primitives and return primitives you need to type hint the return as well.
Is there a good algorithm to calculate the cartesian product of three seqs concurrently in Clojure?
I'm working on a small hobby project in Clojure, mainly as a means to learn the language, and its concurrency features. In my project, I need to calculate the cartesian product of three seqs (and do something with the results).
I found the cartesian-product function in clojure.contrib.combinatorics, which works pretty well. However, the calculation of the cartesian product turns out to be the bottleneck of the program. Therefore, I'd like to perform the calculation concurrently.
Now, for the map function, there's a convenient pmap alternative that magically makes the thing concurrent. Which is cool :). Unfortunately, such a thing doesn't exist for cartesian-product. I've looked at the source code, but I can't find an easy way to make it concurrent myself.
Also, I've tried to implement an algorithm myself using map, but I guess my algorithmic skills aren't what they used to be. I managed to come up with something ugly for two seqs, but three was definitely a bridge too far.
So, does anyone know of an algorithm that's already concurrent, or one that I can parallelize myself?
EDIT
Put another way, what I'm really trying to achieve, is to achieve something similar to this Java code:
for (ClassA a : someExpensiveComputation()) {
for (ClassB b : someOtherExpensiveComputation()) {
for (ClassC c : andAnotherOne()) {
// Do something interesting with a, b and c
}
}
}
If the logic you're using to process the Cartesian product isn't somehow inherently sequential, then maybe you could just split your inputs into halves (perhaps splitting each input seq in two), calculate 8 separate Cartesian products (first-half x first-half x first-half, first-half x first-half x second-half, ...), process them and then combine the results. I'd expect this to give you quite a boost already. As for tweaking the performance of the Cartesian product building itself, I'm no expert, but I do have some ideas & observations (one needs to calculate a cross product for Project Euler sometimes), so I've tried to summarise them below.
First of all, I find the c.c.combinatorics function a bit strange in the performance department. The comments say it's taken from Knuth, I believe, so perhaps one of the following obtains: (1) it would be very performant with vectors, but the cost of vectorising the input sequences kills its performance for other sequence types; (2) this style of programming doesn't necessarily perform well in Clojure in general; (3) the cumulative overhead incurred due to some design choice (like having that local function) is large; (4) I'm missing something really important. So, while I wouldn't like to dismiss the possibility that it might be a great function to use for some use cases (determined by the total number of seqs involved, the number of elements in each seq etc.), in all my (unscientific) measurements a simple for seems to fare better.
Then there are two functions of mine, one of which is comparable to for (somewhat slower in the more interesting tests, I think, though it seems to be actually somewhat faster in others... can't say I feel prepared to make a fully educated comparison), the other apparently faster with a long initial input sequence, as it's a restricted functionality parallel version of the first one. (Details follow below.) So, timings first (do throw in the occasional (System/gc) if you care to repeat them):
;; a couple warm-up runs ellided
user> (time (last (doall (pcross (range 100) (range 100) (range 100)))))
"Elapsed time: 1130.751258 msecs"
(99 99 99)
user> (time (last (doall (cross (range 100) (range 100) (range 100)))))
"Elapsed time: 2428.642741 msecs"
(99 99 99)
user> (require '[clojure.contrib.combinatorics :as comb])
nil
user> (time (last (doall (comb/cartesian-product (range 100) (range 100) (range 100)))))
"Elapsed time: 7423.131008 msecs"
(99 99 99)
;; a second time, as no warm-up was performed earlier...
user> (time (last (doall (comb/cartesian-product (range 100) (range 100) (range 100)))))
"Elapsed time: 6596.631127 msecs"
(99 99 99)
;; umm... is syntax-quote that expensive?
user> (time (last (doall (for [x (range 100)
y (range 100)
z (range 100)]
`(~x ~x ~x)))))
"Elapsed time: 11029.038047 msecs"
(99 99 99)
user> (time (last (doall (for [x (range 100)
y (range 100)
z (range 100)]
(list x y z)))))
"Elapsed time: 2597.533138 msecs"
(99 99 99)
;; one more time...
user> (time (last (doall (for [x (range 100)
y (range 100)
z (range 100)]
(list x y z)))))
"Elapsed time: 2179.69127 msecs"
(99 99 99)
And now the function definitions:
(defn cross [& seqs]
(when seqs
(if-let [s (first seqs)]
(if-let [ss (next seqs)]
(for [x s
ys (apply cross ss)]
(cons x ys))
(map list s)))))
(defn pcross [s1 s2 s3]
(when (and (first s1)
(first s2)
(first s3))
(let [l1 (count s1)
[half1 half2] (split-at (quot l1 2) s1)
s2xs3 (cross s2 s3)
f1 (future (for [x half1 yz s2xs3] (cons x yz)))
f2 (future (for [x half2 yz s2xs3] (cons x yz)))]
(concat #f1 #f2))))
I believe that all versions produce the same results. pcross could be extended to handle more sequences or be more sophisticated in the way it splits its workload, but that's what I came up with as a first approximation... If you do test this out with your programme (perhaps adapting it to your needs, of course), I'd be very curious to know the results.
'clojure.contrib.combinatorics has a cartesian-product function.
It returns a lazy sequence and can cross any number of sequences.