Rewriting sequental code for parallel execution - parallel-processing

What is best idiomatic approach to rewrite common lisp sequential code for parallel execution?
There is some good libraries, like lparallel, helping with simple cases. For example, if we had some mapcar on long list, we can replace it with lparallel:mapcar, and it'll do the job in most cases. Now I have some loop call, taking result of some remote JSON API and nconcing it into list:
(loop :for offset :from 0 :by 100
:for result = (get-remote-data offset)
:until (null result) :nconc result)
How replace it, to call get-remote-data in parallel without needs to change get-remote-data itself? Is there any standard and idiomatic ways? Any good read on this topic also will help. Thanks.

I have used chanl for such a use case to set up a message queue. I
started n worker threads that made the remote calls and sent it to
the queue. An aggregator got the results from the queue and
concatenated them.
If order is important, this might not be right. You could perhaps
pre-define a result array that is filled in at the defined separate
offsets by the workers.
EDIT: In order to get an unknown number of pages, you could use an
atomic offset counter and an atomic flag. The worker threads (from a
fixed pool) then check the flag, get the next offset from the counter,
make the remote call, finally either send the result to the queue or,
if the result is empty, flip the flag off. If the flag is flipped off
any worker thread checking it shuts itself down. When the worker
thread pool is empty, you are finished.

I came up with the following using lparallel:
(defun get-datum (n)
(sleep (random 2))
(if (> n 1000)
()
(list n)))
(defun get-data ()
(let ((channel (lparallel:make-channel))
results)
(flet ((collect (item)
(setq results (append item results))))
;; Ask lparallel to schedule the first 8 requests
(loop for i from 0 to 70 by 10
do (lparallel:submit-task channel #'get-datum i))
;; Schedule an additional request each time one returns
;; until we get a null result
(loop for i from 80 by 10
for result = (lparallel:receive-result channel)
while result
do (lparallel:submit-task channel #'get-datum i)
(collect result))
;; wait for all outstanding requests
(loop repeat 7
do (collect (lparallel:receive-result channel)))
results)))

Related

How to generate random numbers in [0 ... 1.0] in Common Lisp

My understanding of Common Lisp pseudorandom number generation is that (random 1.0) will generate a fraction strictly less than 1. I would like to get numbers upto 1.0 inclusive. Is this possible? I guess I could decide on a degree of precision and generate integers and divide by the range but I'd like to know if there is a more widely accepted way of doing this. Thanks.
As you say, random will generate numbers in [0,1) by default, and in general (random x) will generate random numbers in [0,x). If these were real numbers and if the distribution really is random, then the probability of getting any number is zero, so this is effectively no different than [0,1]. But they're not real numbers: they're floats, so the probability of getting any particular value is higher since there are only a finite number of floats in [0,1].
Fortunately you can express exactly what you want: CL has a bunch of constants with names like *-epsilon which are defined so that, for instance
(/= (+ 1.0f0 single-float-epsilon) 1.0f0)
and single-float-epsilon is the smallest single-float for which this is true.
Thus (random (+ 1.0f0 single-float-epsilon)) will produce random single-floats in the range [0,1], and will eventually probably turn out 1.0f0. You can test this:
(defun tsit ()
(let ((f (+ 1.0f0 single-float-epsilon)))
(assert (/= f 1.0f0) (f) "oops")
(loop for i upfrom 1
for v = (random f)
when (= v 1.0f0)
return (values i v))))
And for me
> (tsit)
12839205
1.0
If you use double floats it takes ... quite a lot longer ... to get 1.0d0 (and remember to use double-float-epsilon).
I have a bit of a different idea here. Instead of trying to stretch the range over an epsilon, we can work with the original range, and pick a victim number somewhere in that range which gets mapped to the range limit. We can avoid a hard-coded victim by choosing one randomly, and changing it from time to time:
(defun make-random-gen (range)
(let ((victim nil)
(count 1))
(lambda ()
(when (zerop (decf count))
(setf count 10000
victim (random range)))
(let ((out (random range)))
(if (eql out victim) range out)))))
(defun testit ()
(loop with r = (make-random-gen 1.0)
for x = (funcall r)
until (eql x 1.0)
counting t))
At the listener:
[5]> (testit)
23030093
There is a small bias here in that the victim is never equal to range. So that is to say, the range value such as 1.0 is never victim and therefore always has a certain chance of occurring. Whereas every other value can potentially take a turn at being victim, having its chance of occurring temporarily reduced to zero. That should be faintly detectable in a statistical analysis of the output in that the range value will occur slightly more often than any other value.
It would be interesting to update this approach with a correction for that, an attempt to do which is this:
(defun make-random-gen (range)
(let ((victim nil)
(count 1))
(labels ((gen ()
(when (zerop (decf count))
(setf count 10000
victim (gen)))
(let ((out (random range)))
(if (eql out victim) range out))))
#'gen)))
Now when we select victim, we recurse on our own function which can potentially select range. Whenever range is selected as victim, that value is correctly suppressed: range will not occur in the output, because out will never be eql to range.
We can justify this with the following hand-waving argument:
Let us suppose that the recursive call to gen has a slight bias in favor of range being output. But whenever that happens, range is selected as victim, which prevents it from appearing in the output of gen.
There is a kind of negative feedback which should almost entirely correct the bias.
Note: our random-number-generating lambda would be better designed if it also captured a random state object also and used that. Then the sequence it yields would be undisturbed by other uses of the pseudo-random-number generator. That's a different topic.
On a theoretical note, note that neither [0, 1) nor [0, 1] yield strictly correct distributions. If we had a mathematically ideal PRNG, it would yield actual real numbers in these ranges. Since that range contains an uncountable infinity of real values, each one would occur with a zero probability: 1/aleph-null, which, I'm guessing, so tiny, that it cannot be distinguished from a real zero.
What we want is the floating-point PRNG to approximate the ideal PRNG.
The problem is that each floating-point value approximates a range of real values. So this means that if we have a generator of values in the range 0.0 to 1.0, it actually represents a range of real numbers from -epsilon to 1.0 + epsilon. If we take values from this PRNG and plot a bar graph of values, each bar in the graph has to have some nonzero width. The 0.0 bar is centered on 0, and the 1.0 bar is centered on 1. The distribution of real numbers extends from the left edge of the left bar, to the right edge of the right bar.
In order to create a PRNG which mimics an even distribution of values in the 0.0 to 1.0 interval, we have to include the 0.0 and 1.0 values with half probability. So that is to say, when we collect a large number of values from the PRNG, the 0.0 and 1.0 bars of the graph should be about half as high as all the other bars.
Under these conditions, we cannot distinguish the [0, 1.0) interval from the [0, 1.0] interval because they are exactly as large. We must include the 1.0 value, at about half the usual probability to account for the above uniformity problem. If we simply exclude that value, we create a bias in the wrong direction, because the 1.0 bar in the histogram now has a zero value.
One way we could rescue the situation might be to take the 1.0-epsilon bar of the histogram and make that value 50% more likely, so that the bar is 50% taller than average. Basically, we overload that last value of the range just before 1.0 to represent everything up to and not including 1.0, requiring that value to be more likely. And then, we exclude the 1.0 value from the output. All values approaching 1.0 from the left get mapped to the extra 50% probability of 1.0 - epsilon.

What are the trade offs while moving to functional programming?

Non-Functional way:
arr = [1, 2, 3] becomes arr = [1, 5, 3] . Here we change same array.
This is discouraged in functional programming. I know that since computers are becoming faster and faster every day and there is more memory to store, functional programming seems more feasible for better readability and clean code.
Functional way:
arr = [1, 2, 3] isn't changed arr2 = [1, 5, 3]. I see a general trend that we use more memory and time to just change one variable.
Here, we doubled our memory and the time complexity changed from O(1) to O(n).
This might be costly for bigger algorithms. Where is this compensated? Or since we can afford for costlier calculations (like when Quantum computing becomes mainstream), do we just trade speed off for readability?
Functional data structures don't necessarily take up a lot more space or require more processing time. The important aspect here is that purely functional data structures are immutable, but that doesn't mean you always make a complete copy of something. In fact, the immutability is precisely the key to working efficiently.
I'll provide as an example a simple list. Suppose we have the following list:
The head of the list is element 1. The tail of the list is (2, 3). Suppose this list is entirely immutable.
Now, we want to add an element at the start of that list. Our new list must look like this:
You can't change the existing list, it is immutable. So, we have to make a new one, right? However, note how the tail of our new list is (1, 2 ,3). That's identical to the old list. So, you can just re-use that. The new list is simply the element 0 with a pointer to the start of the old list as its tail. Here's the new list with various parts highlighted:
If our lists were mutable, this would not be safe. If you changed something in the old list (for example, replacing element 2 with a different one) the change would reflect in the new list as well. That's exactly where the danger is in mutability: concurrent access on data structures needs to be synchronized to avoid unpredictable results, and changes can have unintended side-effects. But, because that can't happen with immutable data structures, it's safe to re-use part of another structure in a new one. Sometimes you want changes in one thing to reflect in another; for example, when you remove an entry in the key set of a Map in Java, you want the mapping itself to be removed too. But in other situations mutability leads to trouble (the infamous Calendar class in Java).
So how can this work, if you can't change the data structure itself? How do you make a new list? Remember that if we're working purely functionally, we move away from the classical data structures with changeable pointers, and instead evaluate functions.
In functional languages, making lists is done with the cons function. cons makes a "cell" of two elements. If you want to make a list with only one element, the second one is nil. So a list with only one 3 element is:
(cons 3 nil)
If the above is a function and you ask what its head is, you get 3. Ask for the tail, you get nil. Now, the tail itself can be a function, like cons.
Our first list then is expressed as such:
(cons 1 (cons 2 (cons 3 nil)))
Ask the head of the above function and you get 1. Ask for the tail and you get (cons 2 (cons 3 nil)).
If we want to append 0 in the front, you just make a new function that evaluates to cons with 0 as head and the above as tail.
(cons 0 (cons 1 (cons 2 (cons 3 nil))))
Since the functions we make are immutable, our lists become immutable. Things like adding elements is a matter of making a new function that calls the old one in the right place. Traversing a list in the imperative and object-oriented way is going through pointers to get from one element to another. Traversing a list in the functional way is evaluating functions.
I like to think of data structures as this: a data structure is basically storing the result of running some algorithm in memory. It "caches" the result of computation, so we don't have to do the computation every time. Purely functional data structures model the computation itself via functions.
This in fact means that it can be quite memory efficient because a lot of data copying can be avoided. And with an increasing focus on parallelization in processing, immutable data structures can be very useful.
EDIT
Given the additional questions in the comments, I'll add a bit to the above to the best of my abilities.
What about my example? Is it something like cons(1 fn) and that function can be cons(2 fn2) where fn2 is cons(3 nil) and in some other case cons(5 fn2)?
The cons function is best compared to a single-linked list. As you might imagine, if you're given a list composed of cons cells, what you're getting is the head and thus random access to some index isn't possible. In your array you can just call arr[1] and get the second item (since it's 0-indexed) in the array, in constant time. If you state something like val list = (cons 1 (cons 2 (cons 3 nil))) you can't just ask the second item without traversing it, because list is now actually a function you evaluate. So access requires linear time, and access to the last element will take longer than access to the head element. Also, given that it's equivalent to a single-linked list, traversal can only be in one direction. So the behavior and performance is more like that of a single-linked list than of, say, an arraylist or array.
Purely functional data structures don't necessarily provide better performance for some operations such as indexed access. A "classic" data structure may have O(1) for some operation where a functional one may have O(log n) for the same one. That's a trade-off; functional data structures aren't a silver bullet, just like object-orientation wasn't. You use them where they make sense. If you're always going to traverse a whole list or part of it and want to be capable of safe parallel access, a structure composed of cons cells works perfectly fine. In functional programming, you'd often traverse a structure using recursive calls where in imperative programming you'd use a for loop.
There are of course many other functional data structures, some of which come much closer to modeling an array that allows random access and updates. But they're typically a lot more complex than the simple example above. There's of course advantages: parallel computation can be trivially easy thanks to immutability; memoization allows us to cache the results of function calls based on inputs since a purely functional approach always yields the same result for the same input.
What are we actually storing underneath? If we need to traverse a list, we need a mechanism to point to next elements right? or If I think a bit, I feel like it is irrelevant question to traverse a list since whenever a list is required it should probably be reconstructed everytime?
We store data structures containing functions. What is a cons? A simple structure consisting of two elements: a head and tail. It's just pointers underneath. In an object-oriented language like Java, you could model it as a class Cons that contains two final fields head and tail assigned on construction (immutable) and has corresponding methods to fetch these. This in a LISP variant
(cons 1 (cons 2 nil))
would be equivalent to
new Cons(1, new Cons(2, null))
in Java.
The big difference in functional languages is that functions are first-class types. They can be passed around and assigned to variables just like object references. You can compose functions. I could just as easily do this in a functional language
val list = (cons 1 (max 2 3))
and if I ask list.head I get 1, if I ask list.tail I get (max 2 3) and evaluating that just gives me 3. You compose functions. Think of it as modeling behavior instead of data. Which brings us to
Could you elaborate "Purely functional data structures model the computation itself via functions."?
Calling list.tail on our above list returns something that can be evaluated and then returns a value. In other words, it returns a function. If I call list.tail in that example it returns (max 2 3), clearly a function. Evaluating it yields 3 as that's the highest number of the arguments. In this example
(cons 1 (cons 2 nil))
calling tail evaluates to a new cons (the (cons 2 nil) one) which in turn can be used.
Suppose we want a sum of all the elements in our list. In Java, before the introduction of lambdas, if you had an array int[] array = new int[] {1, 2, 3} you'd do something like
int sum = 0;
for (int i = 0; i < array.length; ++i) {
sum += array[i];
}
In a functional language it would be something like (simplified pseudo-code)
(define sum (arg)
(eq arg nil
(0)
(+ arg.head (sum arg.tail))
)
)
This uses prefix notation like we've used with our cons so far. So a + b is written as (+ a b). define lets us define a function, with as arguments the name (sum), a list of arguments for the function ((arg)), and then the actual function body (the rest).
The function body consists of an eq function which we'll define as comparing its first two arguments (arg and nil) and if they're equal it evaluates to its next argument ((0) in this case), otherwise to the argument after that (the sum). So think of it as (eq arg1 arg2 true false) with true and false whatever you want (a value, a function...).
The recursion bit then comes in the sum (+ arg.head (sum arg.tail)). We're stating that we take the addition of the head of the argument with a recursive call to the sum function itself on the tail. Suppose we do this:
val list = (cons 1 (cons 2 (cons 3 nil)))
(sum list)
Mentally step through what that last line would do to see how it evaluates to the sum of all the elements in list.
Note, now, how sum is a function. In the Java example we had some data structure and then iterated over it, performing access on it, to create our sum. In the functional example the evaluation is the computation. A useful aspect of this is that sum as a function could be passed around and evaluated only when it's actually needed. That is lazy evaluation.
Another example of how data structures and algorithms are actually the same thing in a different form. Take a set. A set can contain only one instance of an element, for some definition of equality of elements. For something like integers it's simple; if they are the same value (like 1 == 1) they're equal. For objects, however, we typically have some equality check (like equals() in Java). So how can you know whether a set already contains an element? You go over each element in the set and check if it is equal to the one you're looking for.
A hash set, however, computes some hash function for each element and places elements with the same hash in a corresponding bucket. For a good hash function there will rarely be more than one element in a bucket. If you now provide some element and want to check if it's in the set, the actions are:
Get the hash of the provided element (typically takes constant time).
Find the hash bucket in the set for that hash (again should take constant time).
Check if there's an element in that bucket which is equal to the given element.
The requirement is that two equal elements must have the same hash.
So now you can check if something is in the set in constant time. The reason being that our data structure itself has stored some computation information: the hashes. If you store each element in a bucket corresponding to its hash, we have put some computation result in the data structure itself. This saves time later if we want to check whether the set contains an element. In that way, data structures are actually computations frozen in memory. Instead of doing the entire computation every time, we've done some work up-front and re-use those results.
When you think of data structures and algorithms as being analogous in this way, it becomes clearer how functions can model the same thing.
Make sure to check out the classic book "Structure and Interpetation of Computer Programs" (often abbreviated as SICP). It'll give you a lot more insight. You can read it for free here: https://mitpress.mit.edu/sicp/full-text/book/book.html
This is a really broad question with a lot of room for opinionated answers, but G_H provides a really nice breakdown of some of the differences
Could you elaborate "Purely functional data structures model the computation itself via functions."?
This is one of my favourite topics, so I'm happy to share an example in JavaScript because it will allow you to run the code here in the browser and see the answer for yourself
Below you will see a linked list implemented using functions. I use a couple Numbers for example data and I use a String so that I can log something to the console for you to see, but other that that, it's just functions – no fancy objects, no arrays, no other custom stuff.
const cons = (x,y) => f => f(x,y)
const head = f => f((x,y) => x)
const tail = f => f((x,y) => y)
const nil = () => {}
const isEmpty = x => x === nil
const comp = f => g => x => f(g(x))
const reduce = f => y => xs =>
isEmpty(xs) ? y : reduce (f) (f (y,head(xs))) (tail(xs))
const reverse = xs =>
reduce ((acc,x) => cons(x,acc)) (nil) (xs)
const map = f =>
comp (reverse) (reduce ((acc, x) => (cons(f(x), acc))) (nil))
// this function is required so we can visualise the data
// it effectively converts a linked-list of functions to readable strings
const list2str = xs =>
isEmpty(xs) ? 'nil' : `(${head(xs)} . ${list2str(tail(xs))})`
// example input data
const xs = cons(1, cons(2, cons(3, cons(4, nil))))
// example derived data
const ys = map (x => x * x) (xs)
console.log(list2str(xs))
// (1 . (2 . (3 . (4 . nil))))
console.log(list2str(ys))
// (1 . (4 . (9 . (16 . nil))))
Of course this isn't of practical use in real-world JavaScript, but that's beside the point. It's just showing you how functions alone could be used to represent complex data structures.
Here's another example of implementing rational numbers using nothing but functions and numbers – again, we're only using strings so we can convert the functional structure to a visual representation we can understand in the console - this exact scenario is examine thoroughly in the SICP book that G_H mentions
We even implement our higher-order data rat using cons. This shows how functional data structures can easily be made up of (composed of) other functional data structures
const cons = (x,y) => f => f(x,y)
const head = f => f((x,y) => x)
const tail = f => f((x,y) => y)
const mod = y => x =>
y > x ? x : mod (y) (x - y)
const gcd = (x,y) =>
y === 0 ? x : gcd(y, mod (y) (x))
const rat = (n,d) =>
(g => cons(n/g, d/g)) (gcd(n,d))
const numer = head
const denom = tail
const ratAdd = (x,y) =>
rat(numer(x) * denom(y) + numer(y) * denom(x),
denom(x) * denom(y))
const rat2str = r => `${numer(r)}/${denom(r)}`
// example complex data
let x = rat(1,2)
let y = rat(1,4)
console.log(rat2str(x)) // 1/2
console.log(rat2str(y)) // 1/4
console.log(rat2str(ratAdd(x,y))) // 3/4

Why is this Clojure micro benchmark so slow?

There was a previous question which was answered successfully on comparing speeds of Clojure to Scala, but applying those same techniques to the following code still leaves it over 25 times slower than equivalent Scala code. This is comparing Clojure 1.6.0 with Leiningen 2.5.0 on Java 1.8.0_40 to Scala 2.11.6:
The comparisons are made not using the REPL but using the Leiningen "run" command and run at about the same speed when run directly from java after producing a standalone '.jar' file using the Leiningen "uberjar" command.
The micro benchmark tests the speed of doing bit manipulations inside an array, which is typical of some low level types of tasks such as encryption or compression or in primes sieving. To get a reasonable measurement interval and to avoid JIT overheads spoiling the results, the benchmark runs the same loop 1000 times.
The Clojure code is as follows:
(ns test-cljr-speed.core
(:gen-class))
(set! *unchecked-math* true)
(set! *warn-on-reflection* true)
(defn testspeed
"test array bit manipulating tight loop speeds."
[]
(let [lps 1000,
len (bit-shift-left 1 12),
bits ^int (int (bit-shift-left 1 17))]
(let [buf ^ints(int-array len)]
(letfn [(doit []
(loop [i ^int (int 0)]
(if (< i bits)
(let [w ^int (int (bit-shift-right i 5))]
(do
(aset-int ^ints buf w ^int (int (bit-or ^int (aget ^ints buf w)
^long (bit-shift-left 1 ^long (bit-and i 31)))))
(recur (inc i)))))))]
(dorun lps (repeatedly doit))))))
(defn -main
"runs test."
[& args]
(let [strt (System/nanoTime),
cnt (testspeed),
stop (System/nanoTime)]
(println "Took " (long (/ (- stop strt) 1000000)) " milliseconds.")))
Which produces the following output:
Took 9342 milliseconds.
I believe the problem to be related to reflection accessing the buffer array, but have applied all sorts of type hints as recommended and can't seem to find it.
Comparable Scala code is as follows:
object Main extends App {
def testspeed() = {
val lps = 1000
val len = 1 << 12
val bits = 1 << 17
val buf = new Array[Int](len)
def doit() = {
def set1(i: Int): Unit =
if (i < bits) {
buf(i >> 5) |= 1 << (i & 31)
set1(i + 1)
}
set1(0)
}
(0 until lps).foreach { _ => doit() }
}
val strt = System.nanoTime()
val cnt = testspeed()
val stop = System.nanoTime()
println(s"Took ${(stop - strt) / 1000000} milliseconds.")
}
Which produces the following output:
Took 365 milliseconds.
Doing the same job, it is over 25 times as fast!!!
I have turned on the warn-on-reflection flag and there doesn't seem to be any Java reflection going on where more hinting would help. Perhaps I am not turning on some optimization settings properly (perhaps set in the project file for Leiningen?) as they are hard to dig out on the Internet; for Scala I have turned off all debugging output and enabled the compiler "optimize" flag, which makes some improvement.
My question is "Is there something that can be done for this type of application that will make Clojure run at a speed more comparable to the Scala speed?".
To short circuit any false speculation, yes, the array is indeed being filled with all binary ones a multiple of times as determined by another series of tests, and no, Scala is not optimizing away all but one loop.
I am not interested in discussions on the comparative merits of the two languages, but only how one can produce reasonably elegant Clojure code to do the same job at least ten times faster on a bit by bit basis (not a simple array fill operation, as the linear fill is just representative of more complex tasks such as prime number culling).
Using a Java BitSet does not have the problem (but not all algorithms are suited to only an set of booleans), nor likely does using a Java Integer array and Java class methods to access it, but one should be able to use the Clojure "native" array types without these sort of performance problems.
First off, your type hints are not affecting the execution time of the Clojure code, and on my machine the updated version is not an improvement:
user=> (time (testspeed))
"Elapsed time: 6256.075155 msecs"
nil
user=> (time (testspeedx))
"Elapsed time: 6371.968782 msecs"
nil
You are doing a number of type hints that are not needed, and stripping them all away actually makes the code faster:
(defn testspeed-unhinted
"test array bit manipulating tight loop speeds."
[]
(let [lps 1000,
len (bit-shift-left 1 12),
bits (bit-shift-left 1 17)]
(let [buf (int-array len)]
(letfn [(doit []
(loop [i (int 0)]
(if (< i bits)
(let [w (bit-shift-right i 5)]
(do
(aset buf w (bit-or (aget buf w)
(bit-shift-left 1 (bit-and i 31))))
(recur (inc i)))))))]
(dorun lps (repeatedly doit)))))))
user=> (time (testspeed-unhinted))
"Elapsed time: 270.652953 msecs"
It occurred to me that coercing i to int on the recur would potentially speed up the code, but it actually slows it down. With that in mind, I decided to try removing ints from the code entirely and see what the result was performance wise:
(defn testspeed-unhinted-longs
"test array bit manipulating tight loop speeds."
[]
(let [lps 1000,
len (bit-shift-left 1 12),
bits (bit-shift-left 1 17)]
(let [buf (long-array len)]
(letfn [(doit []
(loop [i 0]
(if (< i bits)
(let [w (bit-shift-right i 5)]
(do
(aset buf w (bit-or (aget buf w)
(bit-shift-left 1 (bit-and i 31))))
(recur (inc i)))))))]
(dorun lps (repeatedly doit)))))))
user=> (time (testspeed-unhinted-longs))
"Elapsed time: 221.025048 msecs"
The performance gain was relatively small, so I used the criterium lib to get accurate microbenchmarks for the difference:
user=> (crit/bench (testspeed-unhinted))
WARNING: Final GC required 2.2835076167941852 % of runtime
Evaluation count : 240 in 60 samples of 4 calls.
Execution time mean : 260.877321 ms
Execution time std-deviation : 18.168141 ms
Execution time lower quantile : 251.952111 ms ( 2.5%)
Execution time upper quantile : 321.995872 ms (97.5%)
Overhead used : 15.568045 ns
Found 8 outliers in 60 samples (13.3333 %)
low-severe 1 (1.6667 %)
low-mild 7 (11.6667 %)
Variance from outliers : 51.8061 % Variance is severely inflated by outliers
nil
user=> (crit/bench (testspeed-unhinted-longs))
Evaluation count : 300 in 60 samples of 5 calls.
Execution time mean : 232.078704 ms
Execution time std-deviation : 24.828378 ms
Execution time lower quantile : 219.615718 ms ( 2.5%)
Execution time upper quantile : 297.456135 ms (97.5%)
Overhead used : 15.568045 ns
Found 11 outliers in 60 samples (18.3333 %)
low-severe 2 (3.3333 %)
low-mild 9 (15.0000 %)
Variance from outliers : 72.1097 % Variance is severely inflated by outliers
nil
So the final result is, you can get a huge speedup by removing your type hints (since everything critical in the code is already totally unambiguous type wise), and you can get a small improvement on top of that by switching from int to long (at least on my 64 bit intel machine).
I'll just answer my own question to help others that may be fighting this same issue:
After perusing another question's answer, I accidentally stumbled on the problem: "aset" is fine; "aset-int" (and all the other specialized forms of "aset-?") is not and no amount of type hinting helps.
In the following code for the test procedure Edited as per #noisesmith's answer, all I change is to using "long-array" ("int array" also works, just not quite as fast) and use the "aset" instead of "aset-long" (or "aset-int" for "int-array") and have eliminated all type hints:
(set! *unchecked-math* true)
(defn testspeed
"test array bit manipulating tight loop speeds."
[]
(let [lps 1000,
len (bit-shift-left 1 11),
bits (bit-shift-left 1 17),
buf (long-array len)]
(letfn [(doit []
(loop [i (int 0)]
(if (< i bits)
(let [w (bit-shift-right i 6)]
(do
(aset buf w (bit-or (aget buf w)
(bit-shift-left 1 (bit-and i 63))))
(recur (inc i)))))))]
(dorun lps (repeatedly doit)))))
With the result that it produces the following output:
Took 395 milliseconds.
With "aset-long" instead of "aset", the output is:
Took 7424 milliseconds.
for a speed-up of almost 19 times.
Now this is just very slightly slower than the Scala code using a Int array (which is faster for Scala than using a Long array), but that is somewhat understandable as Clojure does not have the read/modify/write primitives as "|=" and it seems that the compiler is not smart enough to see that a read/modify/write operation is what is implied in the above code.
However, being only a few percent slower is completely acceptable and means that for this type of application, performance is not the criteria for choosing between Scala or Clojure.
This solution doesn't make sense, as the specialized versions of "aset-?" should really just be calling through to the overloaded cases of "aset", but it seems there is a problem/bug affecting their performance, at least with the current version 1.6.0.

Eval times for this function alternate b/w 85 nanosec and 10 sec (!?)

Objective
I'm trying to figure out why a function I've created, items-staged-f, has both such strangely long and short evaluation times.
Strange, you say?
I say "strange" because:
(time (items-staged-f)) yields 1.313 msecs
(time (items-staged-f)) a second time yields 0.035 msecs (which is unsurprising, because the result is a lazy sequence and it must have been memoized)
The Criterium benchmarking system reports it taking 85.149767 ns (which is unsurprising)
And yet...
The time it takes to actually evaluate (items-staged-f) in the REPL is around 10 seconds. This is even before it prints anything. I was originally thinking that it takes that long likely because it's preparing to print to the REPL, because it's a long and complex data structure (nested maps and vectors in a lazy sequence), but it's just strange that the result wouldn't even start printing out until 10 seconds later when it (supposedly) takes 85 nanoseconds. Could it be that it's pre-calculating how to print the data structure?
(time (last (items-staged-f))) yields 10498.16 msecs (although this varies up to around 20 seconds), possibly for the same reason above.
And now for the code...
The goal of the function items-staged-f is to visualize what needs to be done in order to make some necessary changes to inventory items in an accounting database.
Unfamiliar functions referenced within items-staged-f may be found below.
(defn items-staged-f []
(let [items-0 (lazy-seq (items-staged :items))
both-types? #(in? % (group+line-items))
items-from-group #(get items-0 %)
replace-subgroups
(fn [[g-item l-items :as group]]
(let [items-in-both
(->> l-items
(map :item)
(filter both-types?))]
(->> (concat
(remove #(in? (:item %) items-in-both) l-items)
(mapcat items-from-group items-in-both))
(into [])
(assoc group 1))))
replaced (map replace-subgroups items-0)]
replaced))
items-staged is a function which outputs the original data which items-staged-f manipulates. (items-staged :items) outputs a map with string-keys (group items) whose values are vectors of maps (lists of sub-items):
{"786M" ; this is a group item
; below are the sub-items of the above group item
[{:description "Signature Collection Item", :item "4X1"}
{:description "Cookies, Inc. Paper Wrapped", :item "65G7"}
{:description "MyChocolate 5 oz.", :item "21F"}]}
Note that the output of items-staged-f is almost identical in structure to that of items-staged, except it is a lazy sequence of vectors instead of a hash-map with hash-map-entries, as would be expected by calling the map function on a hash-map.
in? is a predicate which checks if an object is in a given collection. For example, (in? 1 [1 2 3]) evaluates to true.
group+line-items is a function which outputs a lazy sequence of certain duplicate items I wish to eliminate. For example, (group+line-items) evaluates to: ("428X" "41SF" "6998" "75D22")
Notes
VisualVM 1.3.8 is saying that clojure.lang.Reflector.getMethods() clocks in at 28700 ms (51.3%), clojure.lang.LineNumberingPushbackReader.read() (is this because of the output in the REPL?) at 9000 ms (16.2%), and clojure.lang.RT.nthFrom() at 7800 ms (13.9%).
However, when I evaluate each element of the lazy sequence (nth items-staged-f n) individually in the REPL, only clojure.lang.LineNumberingPushbackReader.read() ever goes up. The invocations go up in increments of 32, which is the lazy-seq chunking size. Time elapsed for other methods/functions is negligible.
One other consideration is that items-staged is a function which ultimately draws its data from an Excel file (read via Apache POI). However, the raw data from the Excel file is stored as a var, so that shouldn't be an issue because it would only calculate once before being memoized (I think).
Thanks for your help!
Addendum
Once I used doall to force realization on the lazy sequence (which I thought was being realized), Criterium now says the function takes 11.370356 sec to evaluate, which unfortunately makes sense. I'll repost once I refactor.
Lazy-sequences by definition calculate their elements only when required. Printing to the REPL or requesting the last element both force realization. Timing the function call that produces the lazy sequence does not.
(defn slow-and-lazy [] (map #(do (Thread/sleep 1000) (inc %)) (range 10)))
user=> (time (slow-and-lazy))
"Elapsed time: 0.837002 msecs"
(1 2 3 4 5 6 7 8 9 10) ; printed 10 seconds later
user=> (time (doall (slow-and-lazy)))
"Elapsed time: 10000.205709 msecs"
(1 2 3 4 5 6 7 8 9 10)
In the case of (time (slow-and-lazy)), slow-and-lazy quickly returns an unrealized lazy-sequence and time finishes, printing the elapsed time and passing along the unrealized result in this case to the REPL. Then, the REPL attempts to print the sequence. In order to do so, it must realize the sequence.
That having been said, 10 seconds is an eternity for a computer, so this does warrant examination/profiling. I would suggest refactoring your code into smaller self-contained functions. In particular, the data should be passed in as arguments. Once you nail down the bottleneck (time with doall to force realization!), then consider posting a new question. Without being able to tell exactly what's going on with this code or whether IO in items-staged is the true bottleneck, there still seems to be room for improvement.

caching previous return values of procedures in scheme

In Chapter 16 of "The Seasoned Schemer", the authors define a recursive procedure "depth", which returns 'pizza nested in n lists, e.g (depth 3) is (((pizza))). They then improve it as "depthM", which caches its return values using set! in the lists Ns and Rs, which together form a lookup-table, so you don't have to recurse all the way down if you reach a return value you've seen before. E.g. If I've already computed (depthM 8), when I later compute (depthM 9), I just lookup the return value of (depthM 8) and cons it onto null, instead of recursing all the way down to (depthM 0).
But then they move the Ns and Rs inside the procedure, and initialize them to null with "let". Why doesn't this completely defeat the point of caching the return values? From a bit of experimentation, it appears that the Ns and Rs are being reinitialized on every call to "depthM".
Am I misunderstanding their point?
I guess my question is really this: Is there a way in Scheme to have lexically-scoped variables preserve their values in between calls to a procedure, like you can do in Perl 5.10 with "state" variables?
Duh. Not having read the Seasoned Schemer, I cannot comment on the memoization issue, unless you give some source code here. However, regarding the question of whether there is a way to have lexically scoped variables keep their state between function calls... This is a feature of the Scheme language called "closures". Consider the following example:
(define counter
(let ((number 0))
(lambda ()
(let ((result number))
(set! number (+ number 1))
result)))
This piece of code defines a function called counter, which uses a lexical variable (number) to keep track of its state. Each time you call the function, you will get a different number in return:
> (counter)
0
> (counter)
1
and so on. The important point here is, that the function generated by the execution of the lambda expression "closes over" all lexically visible variables from enclosing scopes (in this case only number.) This means, that those variables remain valid places to read values from or write new values to.

Resources