How can I make this clojure code run faster? - performance

I have a version implemented in Lisp(SBCL) which runs under 0.001 seconds with 12 samples. However this version(in clojure) takes more than 1.1 secs.
What should I do for making this code run as fast as original Lisp version?
To make it sure, my numbers are not including times for starting repl and others. And is from time function of both sbcl and clojure.(Yes, my laptop is rather old atom based one)
And this application is/will be used in the repl, not as compiled in single app, so running thousand times before benchmarking seems not meaningful.
Oh, the fbars are like this: [[10.0 10.5 9.8 10.1] [10.1 10.8 10.1 10.7] ... ], which is
Open-High-Low-Close price bars for the stocks.
(defn- build-new-smpl [fmx fmn h l c o]
(let [fmax (max fmx h)
fmin (min fmn l)
fc (/ (+ c fmax fmin) 3.0)
fcd (Math/round (* (- fc o) 1000))
frd (Math/round (* (- (* 2.0 c) fmax fmin) 1000))]
(if (and (> fcd 0) (> frd 0))
[1 fmax fmin]
(if (and (< fcd 0) (< frd 0))
[-1 fmax fmin]
[0 fmax fmin]))))
(defn binary-smpls-using [fbars]
(let [fopen (first (first fbars))]
(loop [fbars fbars, smpls [], fmax fopen, fmin fopen]
(if (> (count fbars) 0)
(let [bar (first fbars)
[_ h l c _] bar
[nsmpl fmx fmn] (build-new-smpl fmax fmin h l c fopen)]
(recur (rest fbars) (conj smpls nsmpl) fmx fmn))
smpls))))
================================================
Thank you. I managed to make the differences for 1000 iteration as 0.5 secs (1.3 secs on SBCL and 1.8 on Clojure). Major factor is I should have created fbars as not lazy but as concrete(?) vector or array, and this solves my problem.

You need to use a proper benchmarking library; the standard Clojure solution is Hugo Duncan's Criterium.
The reason is that code on the JVM starts running in interpreted mode and then eventually gets compiled by the JIT compiler; it's the steady-state behaviour after JIT compilation that you want to benchmark and not behaviour during the profiling stage. This is, however, quite tricky, since the JIT compiler optimizes no-ops away where it can see them, so you need to make sure your code causes side effects that won't be optimized away, but then you still need to run it in a loop to obtain meaningful results etc. -- quick and dirty solutions just don't cut it. (See the Elliptic Group, Inc. Java benchmarking article, also linked to by Criterium's README, for an extended discussion of the issues involved.)
Cycling the two samples you listed in a vector of length 1000 results in a timing of ~327 µs in a Criterium benchmark on my machine:
(require '[criterium.core :as c])
(def v (vec (take 1000 (cycle [[10.0 10.5 9.8 10.1] [10.1 10.8 10.1 10.7]]))))
(c/bench (binary-smpls-using v))
WARNING: Final GC required 4.480116525558204 % of runtime
Evaluation count : 184320 in 60 samples of 3072 calls.
Execution time mean : 327.171892 µs
Execution time std-deviation : 3.129050 µs
Execution time lower quantile : 322.731261 µs ( 2.5%)
Execution time upper quantile : 333.117724 µs (97.5%)
Overhead used : 1.900032 ns
Found 1 outliers in 60 samples (1.6667 %)
low-severe 1 (1.6667 %)
Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
A really good benchmark would actually involve an interesting dataset (all different samples, preferably coming from the real world).

When I run this with 1000 samples I get an answer in 46 ms, though here are some common tips on making clojure code faster:
turn on reflection warnings:
(set! warn-on-reflection true)
add type hints until the reflection warnings go away, which isn't a problem here
make it a lazy sequence so you don't have to construct huge sequences in ram, which results in lots of GC overhead. (in some cases this adds overhead though I think it's a decent idea here)
in cases like this see if incanter can do the job for you (though that may be cheating)
time the creation of the result excluding the time it takes the repl to print it.

Related

Clojure comp doesn't tail-call-optimize (can create StackOverflow exception)

I got stuck on a Clojure program handling a very large amount of data (image data). When the image was larger than 128x128, the program would crash with a StackOverflow exception. Because it worked for smaller images, I knew it wasn't an infinite loop.
There were lots of possible causes of high memory usage, so I spent time digging around. Making sure I was using lazy sequences correctly, making sure to use recur as appropriate, etc. The turning point came when I realized that this:
at clojure.core$comp$fn__5792.invoke(core.clj:2569)
at clojure.core$comp$fn__5792.invoke(core.clj:2569)
at clojure.core$comp$fn__5792.invoke(core.clj:2569)
referred to the comp function.
So I looked at where I was using it:
(defn pipe [val funcs]
((apply comp funcs) val))
(pipe the-image-vec
(map
(fn [index] (fn [image-vec] ( ... )))
(range image-size)))
I was doing per-pixel operations, mapping a function to each pixel to process it. Interestingly, comp doesn't appear to benefit from tail-call optimization, or do any kind of sequential application of functions. It seems that it was just composing things in the basic way, which when there are 65k functions, understandably overflows the stack. Here's the fixed version:
(defn pipe [val funcs]
(cond
(= (count funcs) 0) val
:else (recur ((first funcs) val) (rest funcs))))
recur ensures the recursion gets tail-call optimized, preventing a stack buildup.
If anybody can explain why comp works this way (or rather, doesn't work this way), I'd love to be enlightened.
First, a more straightforward MCVE:
(def fs (repeat 1e6 identity))
((apply comp fs) 99)
#<StackOverflowError...
But why does this happen? If you look at the (abridged) comp source:
(defn comp
([f g]
(fn
([x] (f (g x)))
([f g & fs]
(reduce1 comp (list* f g fs))))
You can see that the whole thing is basically just 2 parts:
The first argument overload that does the main work of wrapping each composed function call in another function.
Reducing over the functions using comp.
I believe the first point is the problem. comp works by taking the list of functions and continually wrapping each set of calls in functions. Eventually, this will exhaust the stack space if you try to compose too many functions, as it ends up creating a massive function that's wrapping many other functions.
So, why can TCO not help here? Because unlike most StackOverflowErrors, recursion is not the problem here. The recursive calls only ever reach one frame deep in the variardic case at the bottom. The problem is the building up of a massive function, which can't simply be optimizated away.
Why were you able to "fix" it though? Because you have access to val, so you're able to evaluate the functions as you go instead of building up one function to call later. comp was written using a simple implementation that works fine for most cases, but fails for extreme cases like the one you presented. It's fairly trivial to write a specialized version that handles massive collections though:
(defn safe-comp [& fs]
(fn [value]
(reduce (fn [acc f]
(f acc))
value
(reverse fs))))
Of course note though, this doesn't handle multiple arities like the core version does.
Honestly though, in 3 and a bit years of using Clojure, I've never once written (apply comp ...). While it's certainly possible you have experienced a case I've never needed to deal with, it's more likely that you're using the wrong tool for the job here. When this code is complete, post it on Code Review and we may be able to suggest better ways of accomplishing what you're trying to do.

Why does this Clojure Reducers r/fold provide no perf benefit?

I'm wondering why the code below provides no speedup in the case of r/fold? Am I misunderstanding something about reducers?
I'm running it on a pretty slow (although with 2 cores) Ubuntu 12.04 dev box, both through emacs and lein run, each with the same results.
(require '[clojure.core.reducers :as r])
(.. Runtime getRuntime availableProcessors)
;; 2
(let
[n 80000000
vs #(range n)]
(time (reduce + (vs)))
(time (r/fold + (vs)))
"Elapsed time: 26076.434324 msecs"
"Elapsed time: 25500.234034 msecs"
Thanks.
You are folding over a seq. Parallel fold only happens on persistent vectors and maps right now.
There are also all sorts of reasons why this kind of perf testing is inferior to using something like Criterium, but that's probably a separate discussion. (Some of the reasons are garbage collection, JVM warmup and inlining, funky default jvm settings on both Emacs and lein, boxed and checked math, etc.)
Still wrong for many of the above reasons but slightly more useful comparison:
(require '[clojure.core.reducers :as r])
(def v (vec (range 800000)))
(dotimes [_ 100] (time (reduce + v)))
(dotimes [_ 100] (time (r/fold + v)))
Watch the best times from each of the last 2 runs.

Efficiency: recursion vs loop

This is just curiosity on my part, but what is more efficient, recursion or a loop?
Given two functions (using common lisp):
(defun factorial_recursion (x)
(if (> x 0)
(* x (factorial_recursion (decf x)))
1))
and
(defun factorial_loop (x)
(loop for i from 1 to x for result = 1 then
(* result i) finally
(return result)))
Which is more efficient?
I don't even have to read your code.
Loop is more efficient for factorials. When you do recursion, you have up to x function calls on the stack.
You almost never use recursion for performance reasons. You use recursion to make the problem more simple.
Mu.
Seriously now, it doesn't matter. Not for examples this size. They both have the same complexity. If your code is not fast enough for you, this is probably one of the last places you'd look at.
Now, if you really want to know which is faster, measure them. On SBCL, you can call each function in a loop and measure the time. Since you have two simple functions, time is enough. If your program was more complicated, a profiler would be more useful. Hint: if you don't need a profiler for your measurements, you probably don't need to worry about performance.
On my machine (SBCL 64 bit), I ran your functions and got this:
CL-USER> (time (loop repeat 1000 do (factorial_recursion 1000)))
Evaluation took:
0.540 seconds of real time
0.536034 seconds of total run time (0.496031 user, 0.040003 system)
[ Run times consist of 0.096 seconds GC time, and 0.441 seconds non-GC time. ]
99.26% CPU
1,006,632,438 processor cycles
511,315,904 bytes consed
NIL
CL-USER> (time (loop repeat 1000 do (factorial_loop 1000)))
Evaluation took:
0.485 seconds of real time
0.488030 seconds of total run time (0.488030 user, 0.000000 system)
[ Run times consist of 0.072 seconds GC time, and 0.417 seconds non-GC time. ]
100.62% CPU
902,043,247 processor cycles
511,322,400 bytes consed
NIL
After putting your functions in a file with (declaim (optimize speed)) at the top, the recursion time dropped to 504 milliseconds and the loop time dropped to 475 milliseconds.
And if you really want to know what's going on, try dissasemble on your functions and see what's in there.
Again, this looks like a non-issue to me. Personally, I try to use Common Lisp like a scripting language for prototyping, then profile and optimize the parts that are slow. Getting from 500ms to 475ms is nothing. For instance, in some personal code, I got a couple of orders of magnitude speedup by simply adding an element type to an array (thus making the array storage 64 times smaller in my case). Sure, in theory it would have been faster to reuse that array (after making it smaller) and not allocate it over and over. But simply adding :element-type bit to it was enough for my situation - more changes would have required more time for very little extra benefit. Maybe I'm sloppy, but 'fast' and 'slow' don't mean much to me. I prefer 'fast enough' and 'too slow'. Both your functions are 'fast enough' in most cases (or both are 'too slow' in some cases) so there's no real difference between them.
If you can write recursive functions in such a way that the recursive call is the very last thing done (and the function is thus tail-recursive) and the language and compiler/interpreter you are using supports tail recursion, then the recursive function can (usually) be optimised into code that is really iterative, and is as fast as an iterative version of the same function.
Sam I Am is correct though, iterative functions are usually faster than their recursive counterparts. If a recursive function is to be as fast as an iterative function that does the same thing, you have to rely on the optimiser.
The reason for this is that a function call is much more expensive than a jump, plus you consume stack space, a (very) finite resource.
The function you give is not tail recursive because you call factorial_recursion and then you multiply it by x. An example of a tail-recursive version would be
(defun factorial-recursion-assist (x cur)
(if (> x 1)
(factorial-recursion-assist (- x 1) (+ cur (* (- x 1) x)))
cur))
(defun factorial-recursion (x)
(factorial-recursion-assist x 1))
(print (factorial-recursion 4))
Here's a tail-recursive factorial (I think):
(defun fact (x)
(funcall (alambda (i ret)
(cond ((> i 1)
(self (1- i) (* ret i)))
(t
ret)))
x 1))

Recency map in clojure using Newtonian cooling

I am building a system in Clojure that consumes events in real time and acts on them according to how many similar messages were received recently. I would like to implement this using a recency score based on Newtonian cooling.
In other words, when an event arrives, I want to be able to assign it a score between 1.0 (never happened before, or "ambient temperature" in Newton's equation) and 10.0 (hot hot hot, has occurred several times in the past minute).
I have a vague idea of what this data structure looks like - every "event type" is a map key, and every map value should contain some set of timestamps for previous events, and perhaps a running average of the current "heat" for that event type, but I can't quite figure out how to start implementing beyond that. Specifically I'm having trouble figuring out how to go from Newton's actual equation, which is very generic, and apply it to this specific scenario.
Does anyone have any pointers? Could someone suggest a simpler "recency score algorithm" to get me started, that could be replaced by Newtonian cooling down the road?
EDIT: Here is some clojure code! It refers to the events as letters but could obviously be repurposed to take any other sort of object.
(ns heater.core
(:require [clojure.contrib.generic.math-functions :as math]))
(def letter-recency-map (ref {}))
(def MIN-TEMP 1.0)
(def MAX-TEMP 10.0)
;; Cooling time is 15 seconds
(def COOLING-TIME 15000)
;; Events required to reach max heat
(def EVENTS-TO-HEAT 5.0)
(defn temp-since [t since now]
(+
MIN-TEMP
(*
(math/exp (/
(- (- now since))
COOLING-TIME))
(- t MIN-TEMP))))
(defn temp-post-event [temp-pre-event]
(+ temp-pre-event
(/
(- MAX-TEMP temp-pre-event)
EVENTS-TO-HEAT)))
(defn get-letter-heat [letter]
(dosync
(let [heat-record (get (ensure letter-recency-map) letter)]
(if (= heat-record nil)
(do
(alter letter-recency-map conj {letter {:time (System/currentTimeMillis) :heat 1.0}})
MIN-TEMP)
(let [now (System/currentTimeMillis)
new-temp-cooled (temp-since (:heat heat-record) (:time heat-record) now)
new-temp-event (temp-post-event new-temp-cooled)]
(alter letter-recency-map conj {letter {:time now :heat new-temp-event}})
new-temp-event)))))
In absence of any events, the solution to the cooling equation is exponential decay. Say T_0 is the temperature at the beginning of a cooling period, and dt is the timestep (computed from system time or whatever) since you evaluated the temperature to be T_0:
T_no_events(dt) = T_min + (T_0 - T_min)*exp(- dt / t_cooling)
Since your events are discrete impulses, and you have a maximum temperature, you want a given ratio per event:
T_post_event = T_pre_event + (T_max - T_pre_event) / num_events_to_heat
Some notes:
t_cooling is the time for things to cool off by a factor of 1/e = 1/(2.718...).
num_events_to_heat is the number of events required to have an effect comparable to T_max. It should probably be a moderately large positive value (say 5.0 or more?). Note that if num_events_to_heat==1.0, every event will reset the temperature to T_max, which isn't very interesting, so the value should at the very least be greater than one.
Theoretically, both heating and cooling should never quite reach the maximum and minimum temperatures, respectively (assuming parameters are set as above, and that you're starting somewhere in between). In practice, though, the exponential nature of the process should get close enough as makes no difference...
To implement this, you only need to store the timestamp and temperature of the last update. When you receive an event, do a cooling step, then a heating event, and update with the new temperature and timestamp.
Note that read-only queries don't require an update: you can just compute the cooling since the last update.

Generating list of million random elements

How to efficiently generate a list of million random elements in scheme? The following code hits maximum recursion depth with 0.1 million itself.
(unfold (lambda(x)(= x 1000000)) (lambda(x)(random 1000)) (lambda(x)(+ x 1)) 0)
It really depends on the system you're using, but here's a common way to do that in plain scheme:
(let loop ([n 1000000] [r '()])
(if (zero? n)
r
(loop (- n 1) (cons (random 1000) r))))
One note about running this code as is: if you just type it into a REPL, it will lead to printing the resulting list, and that will usually involve using much more memory than the list holds. So it's better to do something like
(define l ...same...)
There are many other tools that can be used to varying degrees of convenience. unfold is one of them, and another is for loops as can be found in PLT Scheme:
(for/list ([i (in-range 1000000)]) (random 1000))
I don't know much scheme but couldn't you just use tail-recursion (which is really just looping) instead of unfold (or any other higher-order function)?
Use the do-loop-construct as described here.
Some one correct me if I am wrong but the Fakrudeen's code should end up being optimized away since it is tail recursive. Or it should be with a proper implementation of unfold. It should never reach a maximum recursion depth.
What version of scheme are you using Fakrudeen?
DrScheme does not choke on a mere million random numbers.
Taking Chicken-Scheme as implementation, here is a try with some results.
(use srfi-1)
(use extras)
(time (unfold (lambda(x)(= x 1000000))
(lambda(x)(random 1000))
(lambda(x)(+ x 1)) 0))
(time (let loop ([n 1000000] [r '()])
(if (zero? n)
r
(loop (- n 1) (cons (random 1000) r)))))
(define (range min max body)
(let loop ((current min) (ret '()))
(if (= current max)
ret
(loop (+ current 1) (cons (body current ret) ret)))))
(time (range 0 1000000 (lambda params (random 1000))))
The results are here with csc -O3 t.scm
0.331s CPU time, 0.17s GC time (major), 12/660 GCs (major/minor)
0.107s CPU time, 0.02s GC time (major), 1/290 GCs (major/minor)
0.124s CPU time, 0.022s GC time (major), 1/320 GCs (major/minor)
As you can see, the version of the author is much more slowlier than using plain tail recursive calls. It's hard to say why the unfold call is much more slowlier but I'd guess that it's because it taking a lot more time doing function calls.
The 2 other versions are quite similar. My version is almost the same thing with the exception that I'm creating a high order function that can be reused.
Unlike the plain loop, it could be reused to create a range of function. The position and current list is sent to the function in case they are needed.
The higher order version is probably the best way to do even if it takes a bit more time to execute. It is probably also because of the function calls. It could be optimized by removing parameters and it will get almost as fast as the named let.
The advantage of the higher order version is that the user doesn't have to write the loop itself and can be used with an abstract lambda function.
Edit
Looking at this specific case. Ef we are to create a million of element ranged between 0 and 999, we could possibly create a fixed length vector of a million and with values from 0 to 999 in it. Shuffle the thing back after. Then the whole random process would depend on the shuffle function which should not have to create new memory swapping values might get faster than generating random numbers. That said, the shuffle method somewhat still rely on random.
Edit 2
Unless you really need a list, you could get away with a vector instead.
Here is my second implementation with vector-map
(time (vector-map (lambda (x y) (random 1000)) (make-vector 1000000)))
# 0.07s CPU time, 0/262 GCs (major/minor)
As you can see, it is terribly faster than using a list.
Edit 3 fun
(define-syntax bigint
(er-macro-transformer
(lambda (exp rename compare)
(let ((lst (map (lambda (x) (random 1000)) (make-list (cadr exp)))))
(cons 'list lst)))))
100000
0.004s CPU time, 0/8888 GCs (major/minor)
It's probably not a good idea to use this but I felt it might be interesting. Since it's a macro, it will get executed at compile time. The compile time will be huge, but as you can see, the speed improvement is also huge. Unfortunately using chicken, I couldn't get it to build a list of a million. My guess is that the type it might use to build the list is overflowing and accessing invalid memory.
To answer the question in the comments:
I'm not a Scheme professional. I'm pretty new to it too and as I understand, the named loop or the high order function should be the way to go. The high order function is good because it's reusable. You could define a
(define (make-random-list quantity maxran)
...)
Then thats the other interesting part, since scheme is all about high order functions. You could then replace the implementation of make-random-list with anything you like. If you need some compile time execution, define the macro otherwise use a function. All that really matters is to be able to reuse it. It has to be fast and not use memory.
Common sense tells you that doing less execution it will be faster, tail recursive calls aren't suppose to consume memory. And when you're not sure, you can hide implementation into a function that can be optimized later.
MIT Scheme limits a computation's stack. Given the size of your problem, you are likely running out of stack size. Fortunately, you can provide a command-line option to change the stack size. Try:
$ mit-scheme --stack <number-of-1024-word-blocks>
There are other command-line options, check out mit-scheme --help
Note that MIT Scheme, in my experience, is one of the few schemes that has a limited stack size. This explains why trying your code in others Schemes will often succeed.
As to your question of efficiency. The routine unfold is probably not implemented with a tail-recursive/iterative algorithm. Here is a tail recursive version with a tail recursive version of 'list reverse in-place':
(define (unfold stop value incr n0)
(let collecting ((n n0) (l '()))
(if (stop n)
(reverse! l)
(collecting (incr n) (cons (value n) l)))))
(define (reverse! list)
(let reving ((list list) (rslt '()))
(if (null? list)
rslt
(let ((rest (cdr list)))
(set-cdr! list rslt)
(reving rest list)))))
Note:
$ mit-scheme --version
MIT/GNU Scheme microcode 15.3
Copyright (C) 2011 Massachusetts Institute of Technology
This is free software; see the source for copying conditions. There is NO warranty; not even
for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Image saved on Tuesday November 8, 2011 at 10:45:46 PM
Release 9.1.1 || Microcode 15.3 || Runtime 15.7 || SF 4.41 || LIAR/x86-64 4.118 || Edwin 3.116
Moriturus te saluto.

Resources