Why does this Clojure Reducers r/fold provide no perf benefit?

Why does this Clojure Reducers r/fold provide no perf benefit? - performance

I'm wondering why the code below provides no speedup in the case of r/fold? Am I misunderstanding something about reducers?
I'm running it on a pretty slow (although with 2 cores) Ubuntu 12.04 dev box, both through emacs and lein run, each with the same results.
(require '[clojure.core.reducers :as r])
(.. Runtime getRuntime availableProcessors)
;; 2
(let
[n 80000000
vs #(range n)]
(time (reduce + (vs)))
(time (r/fold + (vs)))
"Elapsed time: 26076.434324 msecs"
"Elapsed time: 25500.234034 msecs"
Thanks.

You are folding over a seq. Parallel fold only happens on persistent vectors and maps right now.
There are also all sorts of reasons why this kind of perf testing is inferior to using something like Criterium, but that's probably a separate discussion. (Some of the reasons are garbage collection, JVM warmup and inlining, funky default jvm settings on both Emacs and lein, boxed and checked math, etc.)
Still wrong for many of the above reasons but slightly more useful comparison:
(require '[clojure.core.reducers :as r])
(def v (vec (range 800000)))
(dotimes [_ 100] (time (reduce + v)))
(dotimes [_ 100] (time (r/fold + v)))
Watch the best times from each of the last 2 runs.

Related

How can I make this clojure code run faster?

I have a version implemented in Lisp(SBCL) which runs under 0.001 seconds with 12 samples. However this version(in clojure) takes more than 1.1 secs.
What should I do for making this code run as fast as original Lisp version?
To make it sure, my numbers are not including times for starting repl and others. And is from time function of both sbcl and clojure.(Yes, my laptop is rather old atom based one)
And this application is/will be used in the repl, not as compiled in single app, so running thousand times before benchmarking seems not meaningful.
Oh, the fbars are like this: [[10.0 10.5 9.8 10.1] [10.1 10.8 10.1 10.7] ... ], which is
Open-High-Low-Close price bars for the stocks.
(defn- build-new-smpl [fmx fmn h l c o]
(let [fmax (max fmx h)
fmin (min fmn l)
fc (/ (+ c fmax fmin) 3.0)
fcd (Math/round (* (- fc o) 1000))
frd (Math/round (* (- (* 2.0 c) fmax fmin) 1000))]
(if (and (> fcd 0) (> frd 0))
[1 fmax fmin]
(if (and (< fcd 0) (< frd 0))
[-1 fmax fmin]
[0 fmax fmin]))))
(defn binary-smpls-using [fbars]
(let [fopen (first (first fbars))]
(loop [fbars fbars, smpls [], fmax fopen, fmin fopen]
(if (> (count fbars) 0)
(let [bar (first fbars)
[_ h l c _] bar
[nsmpl fmx fmn] (build-new-smpl fmax fmin h l c fopen)]
(recur (rest fbars) (conj smpls nsmpl) fmx fmn))
smpls))))
================================================
Thank you. I managed to make the differences for 1000 iteration as 0.5 secs (1.3 secs on SBCL and 1.8 on Clojure). Major factor is I should have created fbars as not lazy but as concrete(?) vector or array, and this solves my problem.

You need to use a proper benchmarking library; the standard Clojure solution is Hugo Duncan's Criterium.
The reason is that code on the JVM starts running in interpreted mode and then eventually gets compiled by the JIT compiler; it's the steady-state behaviour after JIT compilation that you want to benchmark and not behaviour during the profiling stage. This is, however, quite tricky, since the JIT compiler optimizes no-ops away where it can see them, so you need to make sure your code causes side effects that won't be optimized away, but then you still need to run it in a loop to obtain meaningful results etc. -- quick and dirty solutions just don't cut it. (See the Elliptic Group, Inc. Java benchmarking article, also linked to by Criterium's README, for an extended discussion of the issues involved.)
Cycling the two samples you listed in a vector of length 1000 results in a timing of ~327 µs in a Criterium benchmark on my machine:
(require '[criterium.core :as c])
(def v (vec (take 1000 (cycle [[10.0 10.5 9.8 10.1] [10.1 10.8 10.1 10.7]]))))
(c/bench (binary-smpls-using v))
WARNING: Final GC required 4.480116525558204 % of runtime
Evaluation count : 184320 in 60 samples of 3072 calls.
Execution time mean : 327.171892 µs
Execution time std-deviation : 3.129050 µs
Execution time lower quantile : 322.731261 µs ( 2.5%)
Execution time upper quantile : 333.117724 µs (97.5%)
Overhead used : 1.900032 ns
Found 1 outliers in 60 samples (1.6667 %)
low-severe 1 (1.6667 %)
Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
A really good benchmark would actually involve an interesting dataset (all different samples, preferably coming from the real world).

When I run this with 1000 samples I get an answer in 46 ms, though here are some common tips on making clojure code faster:
turn on reflection warnings:
(set! warn-on-reflection true)
add type hints until the reflection warnings go away, which isn't a problem here
make it a lazy sequence so you don't have to construct huge sequences in ram, which results in lots of GC overhead. (in some cases this adds overhead though I think it's a decent idea here)
in cases like this see if incanter can do the job for you (though that may be cheating)
time the creation of the result excluding the time it takes the repl to print it.

Code works in stk-simply but not in mit-scheme

I have been reading from the courses CS61A (Spring 2011) from Berkeley Opencourseware and MIT 6.001 from OCW. One uses STk (called as stk-simply) and the other uses mit-scheme as programming language for the lectures.
I just wrote a simple square root procedure using Heron's method. The file was saved as sqrt.scm
;; setting an accuracy or tolerance to deviation between the actual and the expected values
(define tolerance 0.001)
;; gives average of two numbers
(define (average x y)
(/ (+ x y) 2))
;; gives the absolute values of any number
(define (abs x)
(if (< x 0) (- x) x))
;; gives the square of a number
(define (square x) (* x x))
;; tests whether the guess is good enough by checking if the difference between square of the guess
;; and the number is tolerable
(define (good-enuf? guess x)
(< (abs (- (square guess) x)) tolerance))
;; improves the guess by averaging guess the number divided by the guess
(define (improve guess x)
(average (/ x guess) guess))
;; when a tried guess does not pass the good-enuf test then the tries the improved guess
(define (try guess x)
(if (good-enuf? guess x)
guess
(try (improve guess x) x)))
;; gives back square root of number by starting guess with 1 and then improving the guess until good-enuf
(define (sqr-root x)
(try 1 x))
This works fine in STk.
sam#Panzer:~/code/src/scheme$ sudo stk-simply
[sudo] password for sam:
Welcome to the STk interpreter version 4.0.1-ucb1.3.6 [Linux-2.6.16-1.2108_FC4-i686]
Copyright (c) 1993-1999 Erick Gallesio - I3S - CNRS / ESSI <eg#unice.fr>
Modifications by UCB EECS Instructional Support Group
Questions, comments, or bug reports to <inst#EECS.Berkeley.EDU>.
STk> (load "sqrt.scm")
okay
STk> (sqr-root 25)
5.00002317825395
STk>
But not in scheme.
sam#Panzer:~/code/src/scheme$ sudo scheme
[sudo] password for sam:
MIT/GNU Scheme running under GNU/Linux
Type `^C' (control-C) followed by `H' to obtain information about interrupts.
Copyright (C) 2011 Massachusetts Institute of Technology
This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Image saved on Thursday October 27, 2011 at 7:44:21 PM
Release 9.1 || Microcode 15.3 || Runtime 15.7 || SF 4.41 || LIAR/i386 4.118 || Edwin 3.116
1 ]=> (load "sqrt.scm")
;Loading "sqrt.scm"... done
;Value: sqr-root
1 ]=> (sqr-root 25)
;Value: 1853024483819137/370603178776909
1 ]=>
I checked the manuals but I was unable to find the cause. Can anyone let me know why is this? Or is there an error in the code which I am ignoring?. I am a beginner in both scheme & STk.

Try this, now the code should return the same results in both interpreters (ignoring tiny rounding differences):
(define (sqr-root x)
(exact->inexact (try 1 x)))
The answer was the same all along, it's just that MIT Scheme is producing an exact result by default, whereas STk is returning an inexact value. With the above code in place, we convert the result to an inexact number after performing the calculation. Alternatively, we could perform the conversion from the beginning (possibly losing some precision in the process):
(define (sqr-root x)
(try 1 (exact->inexact x)))
This quote explains the observed behavior:
Scheme numbers are either exact or inexact. A number is exact if it was written as an exact constant or was derived from exact numbers using only exact operations. A number is inexact if it was written as an inexact constant, if it was derived using inexact ingredients, or if it was derived using inexact operations. Thus inexactness is a contagious property of a number.

You'd have both answers look the same if in MIT Scheme you had written:
(sqr-root 25.)
since 'inexactness' is 'sticky'

How to map a function over a list in parallel in racket?

The question title says it all, really: What is the best way map a function over a list in parallel in racket? Thanks.

If you mean over multiple processor cores, then the most general approach is to use Places.
Places enable the development of parallel programs that take advantage of machines with multiple processors, cores, or hardware threads.
A place is a parallel task that is effectively a separate instance of the Racket virtual machine. Places communicate through place channels, which are endpoints for a two-way buffered communication.
You might be able to use the other parallelization technique, Futures, but the conditions for it to work are relatively limited, for example floating point operations, as described here.
EDIT: In response to the comment:
Is there an implementation of parallel-map using places somewhere?
First, I should back up. You might not need Places. You can get concurrency using Racket threads. For example, here's a map/thread:
#lang racket
(define (map/thread f xs)
;; Make one channel for each element of xs.
(define cs (for/list ([x xs])
(make-channel)))
;; Make one thread for each elemnet of xs.
;; Each thread calls (f x) and puts the result to its channel.
(for ([x xs]
[c cs])
(thread (thunk (channel-put c (f x)))))
;; Get the result from each channel.
;; Note: This will block on each channel if not yet ready.
(for/list ([c cs])
(channel-get c)))
;; Use:
(define xs '(1 2 3 4 5))
(map add1 xs)
(map/thread add1 xs)
If the work being done involves blocking, e.g. I/O requests, this will give you "parallelism" in the sense of not being stuck on I/O. However Racket threads are "green" threads so only one at a time will be using the CPU.
If you truly need parallel use of multiple CPU cores, then you would need Futures or Places.
Due to the way Places are implemented --- effectively as multiple instances of Racket --- I don't immediately see how to write a generic map/place. For examples of using places in a "bespoke" way, see:
Places in the Guide
Places in the Reference

I don't know in Racket, but I implemented a version in SISC Scheme.
(define (map-parallel fn lst)
(call-with-values (lambda ()
(apply parallel
(map (lambda (e)
(delay (fn e)))
lst)))
list))
Only parallel is not R5RS.
Example of use:
Using a regular map:
(time (map (lambda (a)
(begin
(sleep 1000)
(+ 1 a)))
'(1 2 3 4 5)))
=> ((2 3 4 5 6) (5000 ms))
Using the map-parallel:
(time (map-parallel (lambda (a)
(begin
(sleep 1000)
(+ 1 a)))
'(1 2 3 4 5)))
=> ((2 3 4 5 6) (1000 ms))

Efficiency: recursion vs loop

This is just curiosity on my part, but what is more efficient, recursion or a loop?
Given two functions (using common lisp):
(defun factorial_recursion (x)
(if (> x 0)
(* x (factorial_recursion (decf x)))
1))
and
(defun factorial_loop (x)
(loop for i from 1 to x for result = 1 then
(* result i) finally
(return result)))
Which is more efficient?

I don't even have to read your code.
Loop is more efficient for factorials. When you do recursion, you have up to x function calls on the stack.
You almost never use recursion for performance reasons. You use recursion to make the problem more simple.

Mu.
Seriously now, it doesn't matter. Not for examples this size. They both have the same complexity. If your code is not fast enough for you, this is probably one of the last places you'd look at.
Now, if you really want to know which is faster, measure them. On SBCL, you can call each function in a loop and measure the time. Since you have two simple functions, time is enough. If your program was more complicated, a profiler would be more useful. Hint: if you don't need a profiler for your measurements, you probably don't need to worry about performance.
On my machine (SBCL 64 bit), I ran your functions and got this:
CL-USER> (time (loop repeat 1000 do (factorial_recursion 1000)))
Evaluation took:
0.540 seconds of real time
0.536034 seconds of total run time (0.496031 user, 0.040003 system)
[ Run times consist of 0.096 seconds GC time, and 0.441 seconds non-GC time. ]
99.26% CPU
1,006,632,438 processor cycles
511,315,904 bytes consed
NIL
CL-USER> (time (loop repeat 1000 do (factorial_loop 1000)))
Evaluation took:
0.485 seconds of real time
0.488030 seconds of total run time (0.488030 user, 0.000000 system)
[ Run times consist of 0.072 seconds GC time, and 0.417 seconds non-GC time. ]
100.62% CPU
902,043,247 processor cycles
511,322,400 bytes consed
NIL
After putting your functions in a file with (declaim (optimize speed)) at the top, the recursion time dropped to 504 milliseconds and the loop time dropped to 475 milliseconds.
And if you really want to know what's going on, try dissasemble on your functions and see what's in there.
Again, this looks like a non-issue to me. Personally, I try to use Common Lisp like a scripting language for prototyping, then profile and optimize the parts that are slow. Getting from 500ms to 475ms is nothing. For instance, in some personal code, I got a couple of orders of magnitude speedup by simply adding an element type to an array (thus making the array storage 64 times smaller in my case). Sure, in theory it would have been faster to reuse that array (after making it smaller) and not allocate it over and over. But simply adding :element-type bit to it was enough for my situation - more changes would have required more time for very little extra benefit. Maybe I'm sloppy, but 'fast' and 'slow' don't mean much to me. I prefer 'fast enough' and 'too slow'. Both your functions are 'fast enough' in most cases (or both are 'too slow' in some cases) so there's no real difference between them.

If you can write recursive functions in such a way that the recursive call is the very last thing done (and the function is thus tail-recursive) and the language and compiler/interpreter you are using supports tail recursion, then the recursive function can (usually) be optimised into code that is really iterative, and is as fast as an iterative version of the same function.
Sam I Am is correct though, iterative functions are usually faster than their recursive counterparts. If a recursive function is to be as fast as an iterative function that does the same thing, you have to rely on the optimiser.
The reason for this is that a function call is much more expensive than a jump, plus you consume stack space, a (very) finite resource.
The function you give is not tail recursive because you call factorial_recursion and then you multiply it by x. An example of a tail-recursive version would be
(defun factorial-recursion-assist (x cur)
(if (> x 1)
(factorial-recursion-assist (- x 1) (+ cur (* (- x 1) x)))
cur))
(defun factorial-recursion (x)
(factorial-recursion-assist x 1))
(print (factorial-recursion 4))

Here's a tail-recursive factorial (I think):
(defun fact (x)
(funcall (alambda (i ret)
(cond ((> i 1)
(self (1- i) (* ret i)))
(t
ret)))
x 1))

Generating list of million random elements

How to efficiently generate a list of million random elements in scheme? The following code hits maximum recursion depth with 0.1 million itself.
(unfold (lambda(x)(= x 1000000)) (lambda(x)(random 1000)) (lambda(x)(+ x 1)) 0)

It really depends on the system you're using, but here's a common way to do that in plain scheme:
(let loop ([n 1000000] [r '()])
(if (zero? n)
r
(loop (- n 1) (cons (random 1000) r))))
One note about running this code as is: if you just type it into a REPL, it will lead to printing the resulting list, and that will usually involve using much more memory than the list holds. So it's better to do something like
(define l ...same...)
There are many other tools that can be used to varying degrees of convenience. unfold is one of them, and another is for loops as can be found in PLT Scheme:
(for/list ([i (in-range 1000000)]) (random 1000))

I don't know much scheme but couldn't you just use tail-recursion (which is really just looping) instead of unfold (or any other higher-order function)?

Use the do-loop-construct as described here.

Some one correct me if I am wrong but the Fakrudeen's code should end up being optimized away since it is tail recursive. Or it should be with a proper implementation of unfold. It should never reach a maximum recursion depth.
What version of scheme are you using Fakrudeen?
DrScheme does not choke on a mere million random numbers.

Taking Chicken-Scheme as implementation, here is a try with some results.
(use srfi-1)
(use extras)
(time (unfold (lambda(x)(= x 1000000))
(lambda(x)(random 1000))
(lambda(x)(+ x 1)) 0))
(time (let loop ([n 1000000] [r '()])
(if (zero? n)
r
(loop (- n 1) (cons (random 1000) r)))))
(define (range min max body)
(let loop ((current min) (ret '()))
(if (= current max)
ret
(loop (+ current 1) (cons (body current ret) ret)))))
(time (range 0 1000000 (lambda params (random 1000))))
The results are here with csc -O3 t.scm
0.331s CPU time, 0.17s GC time (major), 12/660 GCs (major/minor)
0.107s CPU time, 0.02s GC time (major), 1/290 GCs (major/minor)
0.124s CPU time, 0.022s GC time (major), 1/320 GCs (major/minor)
As you can see, the version of the author is much more slowlier than using plain tail recursive calls. It's hard to say why the unfold call is much more slowlier but I'd guess that it's because it taking a lot more time doing function calls.
The 2 other versions are quite similar. My version is almost the same thing with the exception that I'm creating a high order function that can be reused.
Unlike the plain loop, it could be reused to create a range of function. The position and current list is sent to the function in case they are needed.
The higher order version is probably the best way to do even if it takes a bit more time to execute. It is probably also because of the function calls. It could be optimized by removing parameters and it will get almost as fast as the named let.
The advantage of the higher order version is that the user doesn't have to write the loop itself and can be used with an abstract lambda function.
Edit
Looking at this specific case. Ef we are to create a million of element ranged between 0 and 999, we could possibly create a fixed length vector of a million and with values from 0 to 999 in it. Shuffle the thing back after. Then the whole random process would depend on the shuffle function which should not have to create new memory swapping values might get faster than generating random numbers. That said, the shuffle method somewhat still rely on random.
Edit 2
Unless you really need a list, you could get away with a vector instead.
Here is my second implementation with vector-map
(time (vector-map (lambda (x y) (random 1000)) (make-vector 1000000)))
# 0.07s CPU time, 0/262 GCs (major/minor)
As you can see, it is terribly faster than using a list.
Edit 3 fun
(define-syntax bigint
(er-macro-transformer
(lambda (exp rename compare)
(let ((lst (map (lambda (x) (random 1000)) (make-list (cadr exp)))))
(cons 'list lst)))))
100000
0.004s CPU time, 0/8888 GCs (major/minor)
It's probably not a good idea to use this but I felt it might be interesting. Since it's a macro, it will get executed at compile time. The compile time will be huge, but as you can see, the speed improvement is also huge. Unfortunately using chicken, I couldn't get it to build a list of a million. My guess is that the type it might use to build the list is overflowing and accessing invalid memory.
To answer the question in the comments:
I'm not a Scheme professional. I'm pretty new to it too and as I understand, the named loop or the high order function should be the way to go. The high order function is good because it's reusable. You could define a
(define (make-random-list quantity maxran)
...)
Then thats the other interesting part, since scheme is all about high order functions. You could then replace the implementation of make-random-list with anything you like. If you need some compile time execution, define the macro otherwise use a function. All that really matters is to be able to reuse it. It has to be fast and not use memory.
Common sense tells you that doing less execution it will be faster, tail recursive calls aren't suppose to consume memory. And when you're not sure, you can hide implementation into a function that can be optimized later.

MIT Scheme limits a computation's stack. Given the size of your problem, you are likely running out of stack size. Fortunately, you can provide a command-line option to change the stack size. Try:
$ mit-scheme --stack <number-of-1024-word-blocks>
There are other command-line options, check out mit-scheme --help
Note that MIT Scheme, in my experience, is one of the few schemes that has a limited stack size. This explains why trying your code in others Schemes will often succeed.
As to your question of efficiency. The routine unfold is probably not implemented with a tail-recursive/iterative algorithm. Here is a tail recursive version with a tail recursive version of 'list reverse in-place':
(define (unfold stop value incr n0)
(let collecting ((n n0) (l '()))
(if (stop n)
(reverse! l)
(collecting (incr n) (cons (value n) l)))))
(define (reverse! list)
(let reving ((list list) (rslt '()))
(if (null? list)
rslt
(let ((rest (cdr list)))
(set-cdr! list rslt)
(reving rest list)))))
Note:
$ mit-scheme --version
MIT/GNU Scheme microcode 15.3
Copyright (C) 2011 Massachusetts Institute of Technology
This is free software; see the source for copying conditions. There is NO warranty; not even
for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Image saved on Tuesday November 8, 2011 at 10:45:46 PM
Release 9.1.1 || Microcode 15.3 || Runtime 15.7 || SF 4.41 || LIAR/x86-64 4.118 || Edwin 3.116
Moriturus te saluto.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio