Performance problems with Clojure HashMap's lookup - performance

I was trying to implement a pure functional Sieve of Eratosthenes' algorithm, based on this paper: https://www.cs.hmc.edu/~oneill/papers/Sieve-JFP.pdf
Following all the steps, I end up with a very performant Haskell code, and I tried to port it to Clojure. Problem is, the Clojure's version is very slow: it's as slow as trying to test all numbers to check if they are divisible or not. The code I ended up was the following:
(defn- sieve2 [[x & xs] table]
(let [reinsert (fn [table prime]
; (merge-with concat table {(+ x prime) [prime]})
(update table (+ x prime) #(cons prime %)))] ;(vec %) prime)))]
(if x
(if-let [facts (get table x)]
(recur xs (reduce reinsert (dissoc table x) facts))
(lazy-seq (cons x (sieve2 xs (assoc table (* x x) [x])))))
'())))
(defn real-sieve [xs] (sieve2 xs {}))
(merge with concat is commented because that was the Haskell's way, but its even slower).
With 30000 prime numbers, Haskell's version ran in 39ms, and Clojure, in 483ms. So, I ported my Clojure version to Scala:
val primes2 = {
def sieve(xs: Stream[Int], table: Map[Int, Vector[Int]]): Stream[Int] =
xs match {
case Stream() => xs
case x #:: xs => table get x match {
case Some(facts) =>
sieve(xs, facts.foldLeft(table - x) { (table, prime) =>
val key = x + prime
val value = table.getOrElse(key, Vector()) :+ x
table + (key -> value)
})
case None => x #:: sieve(xs, table + (x*x -> Vector(x)))
}
}
sieve(Stream.from(2), Map())
}
And it ran on 39ms. Then, I downloaded VisualVM and sampled my code, to see this:
Notice that most of the time, the performance killers are the HashMap key lookup and assoc. Is there some problem with my code?

Trying out OP's code, I indeed saw that the scala implementation was taking around 30 ms, while clojure's was about 500ms. That was odd.
So I compared the results, and found that the scala implementation was giving me lots of even numbers as primes. After some digging I learned that there were two bugs in the scala implementation.
The first:
val value = table.getOrElse(key, Vector()) :+ x // bug
val value = table.getOrElse(key, Vector()) :+ prime // corrected
This bug caused the evaluation to finish much quicker, since lots of non-prime numbers were included in the result.
The second bug with the scala version is the use of Int. Way before the 30000'th prime is reached an overflow occurs:
scala> 92683*92683
res1: Int = 203897 // an odd square??
So, I fixed that as well, and since scala does not have a Stream.from(Long), had to write that too (I dont speak fluent scala, so there might be a better way..):
object Test {
def sieve(xs: Stream[Long], table: Map[Long, Vector[Long]]): Stream[Long] =
xs match {
case Stream() => xs
case x #:: xs = {
table get x match {
case Some(facts) =>
sieve(xs, facts.foldLeft(table - x) { (table, prime) =>
val key = x + prime
val value = table.getOrElse(key, Vector()) :+ prime
table + (key -> value)
})
case None => {
x #:: sieve(xs, table + (x*x -> Vector(x)))
}}}}
def fromLong(start:Long) : Stream[Long] = Stream.cons(start, fromLong(start+1))
def main(args: Array[String]) {
sieve(fromLong(2), Map())
}
}
Running this again gave me comparable elapsed times for both scala and clojure:
scala> Test.time {Test.sieve(Test.fromLong(2), Map()).take(30000).last}
Elapsed time: 583 msecs
res14: Long = 350377
And clojure's version:
(time (last (take 30000 (real-sieve a))))
"Elapsed time: 536.646696 msecs"
350377
And this is in fact, the 30000th prime!

Related

Haskell Quick Sort implementation (how to obtain the best performance?)

I am currently reading about functional data structures and algorithms and I tried different quick sort implementations. I noticed, however, that there is a lot of variation in their performances.
Here are some selected ones that I want to discuss (all the programs have been compiled (ghc program.hs) for test (the time in parentheses is the one obtained with the optimization -O flag), and the list to be sorted was the (worst case already sorted (which implies that the time taken is in O(n^2))) [1 .. 20000] list):
qs1 [] = []
qs1 (pivot : rest) = qs1 lower ++ [pivot] ++ qs1 upper where
lower = [ x | x <- rest, x <= pivot ]
upper = [ x | x <- rest, x > pivot ]
This is the classic sort we see when we learn the language. Here, the rest list is traversed twice to filter first the elements less or equal than the pivot, then the elements greater than the pivot. The time taken was 6.45 (4.42) sec.
qs2 [] = []
qs2 (pivot : rest) = qs2 lower ++ [pivot] ++ qs2 upper where
(lower, upper) = foldr
(\x (l, u) -> if x <= pivot then (x : l, u) else (l, x : u))
([], [])
rest
I was surprised by this one. It is the same as above, except that it only traverses the rest list once. It did worse than the first with a total of 8.11 (2.95) sec. I tried the partition function from Data.List library, but it only did worse (much worse), with a catastrophic 10.99 (9.99) sec. I checked the implementation, and yet it is almost the same as my lambda function, although it relies on an utility function:
partition :: (a -> Bool) -> [a] -> ([a],[a])
partition p xs = foldr (select p) ([],[]) xs
select :: (a -> Bool) -> a -> ([a], [a]) -> ([a], [a])
select p x ~(ts,fs) | p x = (x:ts,fs)
| otherwise = (ts, x:fs)
(Taken from https://hackage.haskell.org/package/base-4.14.0.0/docs/src/Data.OldList.html#partition).
qs3 [] s = s
qs3 (pivot : rest) s = qs3 lower (pivot : (qs3 upper s)) where
lower = [ x | x <- rest, x <= pivot ]
upper = [ x | x <- rest, x > pivot ]
In this one, there are 2 novelties. Fist, the removal of append ++ in favour of cons :. Second, it is a tail recursion function, so it should (in principle) be faster. However, it is merely better than the first one, with a time of 6.42 (4.44) sec. In fact, due to variation from an execution to the other, it probably is the same.
qs4 [] s = s
qs4 (pivot : rest) s = qs4 lower (pivot : (qs4 upper s)) where
(lower, upper) = foldr
(\x (l, u) -> if x <= pivot then (x : l, u) else (l, x : u))
([], [])
rest
Again, this is the same as above, except that I replaced the 2 list traversal by a foldr, and again, the time taken is increased: 8.02 (2.95) sec.
split pivot [] lower upper s = qs5 lower (pivot : qs5 upper s)
split pivot (h : t) lower upper s
| h <= pivot = split pivot t (h : lower) upper s
| otherwise = split pivot t lower (h : upper) s
qs5 [] s = s
qs5 (pivot : rest) s = split pivot rest [] [] s
This one is the fastest. I recorded a stunning time of 2.82 (1.92) sec, which is almost 4 times faster than the “slowest”. It bounces between 2 functions that calls each other. split is a recursive function that separates the elements of the rest list sent by the qs5 functions, which it returns to when it is done partitioning.
Conclusion: What the hell is going on here? I am puzzled by all the hidden subtleties done during the compilation that makes my expectations concerning the performance of the program go wrong. I humbly thank anyone who could help me untangle the pieces of this jigsaw by pointing out what is going on under the hood.

Haskell's `seq` evaluates arguments redundantly? [duplicate]

This question already has an answer here:
What is the monomorphism restriction?
(1 answer)
Closed 4 years ago.
If I understand the discussion here correctly, seq should not be evaluating a value twice, as in x `seq` x should be evaluating x once.
Then why do I have this behaviour?
λ> :set +s
λ> let fib x = if x <= 1 then x else fib (x - 1) + fib (x - 2)
(0.01 secs, 102,600 bytes)
λ> fib 30
832040
(2.49 secs, 638,088,448 bytes)
λ> let x = fib 30 in x
832040
(2.47 secs, 638,088,792 bytes)
λ> let x = fib 30 in x `seq` x
832040
(4.95 secs, 1,276,067,128 bytes)
which is clearly double evaluating? Am I misunderstanding something?
EDIT: As asked #danidiaz below, I also evaluated
λ> (\x -> x `seq` x) (fib 30)
832040
(2.51 secs, 638,087,888 bytes)
λ> let x = (fib 30) :: Int in x `seq` x
832040
(2.52 secs, 732,476,640 bytes)
which are even more surprising now.
EDIT 2: I see that this question has been marked as a duplicate of an earlier question that asks about the monomorphism restriction. When I encountered this problem, I had no idea that this was due to the restriction. So if someone finds him/herself in my position I guess the answer to this question would be helpful.
For the first part of this answer, :set -XNoMonomorphismRestriction in ghci. It will be explained later.
Naively, one would expect that in Haskell let x = 5 in (x + 1,x + 2) would always be equivalent to (\x -> (x + 1, x + 2)) 5. But they have different types!
let x = 5 in (x + 1,x + 2) :: (Num a, Num b) => (a, b)
(\x -> (x + 1,x + 2)) 5 :: Num b => (b, b)
The reason is a feature of Haskell called let-bound polymorphism. Unlike lambda-bound identifiers, identifiers bound in a let can be instantiated in different ways in the body of the let. For example:
ghci> let f = id in (f True, f 'a')
(True,'a')
ghci> (\f -> (f True, f 'a')) id
*** ERROR ***
Now, you didn't give a type signature to your fib funtion, and the one that is deduced is something like
fib :: (Ord a, Num a) => a -> a
that will work for different instances of Num like Int, Float, etc.
But because of this, when you write x `seq` x, ghci can't be sure that the two xs are actually of the same type! And if they might be different, then they can't be shared.
That's the reason why (\x -> x `seq` x) (fib 30) does have sharing. Because the x is lambda-bound, the compiler is sure that both occurrences really are the same value. Same for let x = (fib 30) :: Int in x `seq` x because we have removed polymorphism using an explicit type.
There's another way out. Turning on the -XMonomorphismRestriction extension increases the amount of type defaulting, causing let expressions to be more monomorphic than one might expect. That should be enough to recover sharing in this case as well.

Is there any way to make this Haskell program faster?

Mind the following Haskell program:
-- Lambda Calculus ADT
data Term = Fun (Term -> Term) | Num !Double
instance Show Term where
show (Num x) = "(Num "++(if fromIntegral (floor x) == x then show (floor x) else show x)++")"
show (Fun _) = "(<function>)"
-- Lambda Calculus term application
(#) :: Term -> Term -> Term
(Fun f) # x = f x
infixl 0 #
-- We have floats as primitives for performance
float_toChurch :: Term
float_toChurch = Fun (\ (Num n) -> Fun (\f -> Fun (\x ->
if n <= 0
then x
else (f # (float_toChurch # (Num (n - 1)) # f # x)))))
float_add :: Term
float_add = Fun (\ (Num x) -> Fun (\ (Num y) -> Num (x + y)))
-- Function compiled from the Lambda Calculus.
-- It sums all nats from 0 til a number.
sum_til :: Term
sum_til = (Fun(\v0->((((((float_toChurch # v0) # (Fun(\v1->(Fun(\v2->(Fun(\v3->(Fun(\v4->((v3 # v2) # (((v1 # ((float_add # v2) # (Num 1))) # v3) # v4))))))))))) # (Fun(\v1->(Fun(\v2->(Fun(\v3->v3))))))) # (Num 0)) # (Fun(\v1->(Fun(\v2->((float_add # v2) # v1)))))) # (Num 0))))
-- Testing it
main = do
let n = 512*512*8
print $ (sum_til # (Num n))
Since there is no fast lambda calculator around, I'm using the strategy above to compile terms of the Untyped Lambda Calculus to Haskell in order to evaluate them fast. I'm impressed with the performance: that program creates a list of numbers from 0 to 2097152 and sums them all in less than a second on my computer. That is much faster than I expected - only 4 times slower than a Haskell direct equivalent - and sufficient to be useful for my goals. Yet, notice that I had to wrap functions and terms under the Fun/Num constructors in order to satisfy the type system. That boxing is probably not ideal. So, my question is: is it possible to run the same Lambda Calculus program and get the same result even faster? I.e., any way to remove the boxing? (Also, it doesn't need to use Haskell.)
I don't think you can keep Double and avoid wrapping. I think the closest you can get would be just
newtype Term = Term (Term -> Term)
But that's going to make arithmetic massively slower, I would imagine.
The only other thing I can think of is maybe trying to cache previous results to avoid recomputing them (but that could easily be slower, not faster).
I am curious to know what on Earth you've actually "using" this for though. ;-)

Functional "All except one"

How I can declare function that takes number and list of numbers, and returns NONE if there is no such number in the list, otherwise returns list option ('Maybe' in Haskell) without this number? If there more then one such number, function has to erase just first of them.
all_except_one : 'a * 'a list -> 'a list option
I have no idea how to do it :\
I ask any code in any language, just some tip about algorithm in functional style (initially I have to solve this problem in SML). Also I can't use higher order functions in my task.
what about this solution ?
fun all_except_one(s, lst) =
let
fun helper e =
case e of
([], _) => NONE
|(x::xs, acc) => if x = s
then SOME (acc # xs)
else helper(xs, x :: acc)
in helper(lst, []) end
The same without helper function and without tail recursion.
fun all_except_one (_, []) = NONE
| all_except_one (s, x::xs) = if x = s
then SOME xs
else case all_except_one(s, xs) of
NONE => NONE
| SOME ys => SOME (x::ys)
How about (Haskell syntax):
allbutone n xs
| n `elem` xs = Just (filter (!=n) xs)
| otherwise = Nothing

Generating integers in ascending order using a set of prime numbers

I have a set of prime numbers and I have to generate integers using only those prime factors in increasing order.
For example, if the set is p = {2, 5} then my integers should be 1, 2, 4, 5, 8, 10, 16, 20, 25, …
Is there any efficient algorithm to solve this problem?
Removing a number and reinserting all its multiples (by the primes in the set) into a priority queue is wrong (in the sense of the question) - i.e. it produces correct sequence but inefficiently so.
It is inefficient in two ways - first, it overproduces the sequence; second, each PriorityQueue operation incurs extra cost (the operations remove_top and insert are not usually both O(1), certainly not in any list- or tree-based PriorityQueue implementation).
The efficient O(n) algorithm maintains pointers back into the sequence itself as it is being produced, to find and append the next number in O(1) time. In pseudocode:
return array h where
h[0]=1; n=0; ps=[2,3,5, ... ]; // base primes
is=[0 for each p in ps]; // indices back into h
xs=[p for each p in ps] // next multiples: xs[k]==ps[k]*h[is[k]]
repeat:
h[++n] := minimum xs
for each ref (i,x,p) in (is,xs,ps):
if( x==h[n] )
{ x := p*h[++i]; } // advance the minimal multiple/pointer
For each minimal multiple it advances its pointer, while at the same time calculating its next multiple value. This too effectively implements a PriorityQueue but with crucial distinctions - it is before the end point, not after; it doesn't create any additional storage except for the sequence itself; and its size is constant (just k numbers, for k base primes) whereas the size of past-the-end PriorityQueue grows as we progress along the sequence (in the case of Hamming sequence, based on set of 3 primes, as n2/3, for n numbers of the sequence).
The classic Hamming sequence in Haskell is essentially the same algorithm:
h = 1 : map (2*) h `union` map (3*) h `union` map (5*) h
union a#(x:xs) b#(y:ys) = case compare x y of LT -> x : union xs b
EQ -> x : union xs ys
GT -> y : union a ys
We can generate the smooth numbers for arbitrary base primes using the foldi function (see Wikipedia) to fold lists in a tree-like fashion for efficiency, creating a fixed sized tree of comparisons:
smooth base_primes = h where -- strictly increasing base_primes NB!
h = 1 : foldi g [] [map (p*) h | p <- base_primes]
g (x:xs) ys = x : union xs ys
foldi f z [] = z
foldi f z (x:xs) = f x (foldi f z (pairs f xs))
pairs f (x:y:t) = f x y : pairs f t
pairs f t = t
It is also possible to directly calculate a slice of Hamming sequence around its nth member in O(n2/3) time, by direct enumeration of the triples and assessing their values through logarithms, logval(i,j,k) = i*log 2+j*log 3+k*log 5. This Ideone.com test entry calculates 1 billionth Hamming number in 1.12 0.05 seconds (2016-08-18: main speedup due to usage of Int instead of the default Integer where possible, even on 32-bit; additional 20% thanks to the tweak suggested by #GordonBGood, bringing band size complexity down to O(n1/3)).
This is discussed some more in this answer where we also find its full attribution:
slice hi w = (c, sortBy (compare `on` fst) b) where -- hi is a top log2 value
lb5=logBase 2 5 ; lb3=logBase 2 3 -- w<1 (NB!) is (log2 width)
(Sum c, b) = fold -- total count, the band
[ ( Sum (i+1), -- total triples w/this j,k
[ (r,(i,j,k)) | frac < w ] ) -- store it, if inside the band
| k <- [ 0 .. floor ( hi /lb5) ], let p = fromIntegral k*lb5,
j <- [ 0 .. floor ((hi-p)/lb3) ], let q = fromIntegral j*lb3 + p,
let (i,frac) = pr (hi-q) ; r = hi - frac -- r = i + q
] -- (sum . map fst &&& concat . map snd)
pr = properFraction
This can be generalized for k base primes as well, probably running in O(n(k-1)/k) time.
(see this SO entry for an important later development. also, this answer is interesting. and another related answer.)
The basic idea is that 1 is a member of the set, and for each member of the set n so also 2n and 5n are members of the set. Thus, you begin by outputting 1, and push 2 and 5 onto a priority queue. Then, you repeatedly pop the front item of the priority queue, output it if it is different from the previous output, and push 2 times and 5 times the number onto the priority queue.
Google for "Hamming number" or "regular number" or go to A003592 to learn more.
----- ADDED LATER -----
I decided to spend a few minutes on my lunch hour to write a program to implement the algorithm described above, using the Scheme programming language. First, here is a library implementation of priority queues using the pairing heap algorithm:
(define pq-empty '())
(define pq-empty? null?)
(define (pq-first pq)
(if (null? pq)
(error 'pq-first "can't extract minimum from null queue")
(car pq)))
(define (pq-merge lt? p1 p2)
(cond ((null? p1) p2)
((null? p2) p1)
((lt? (car p2) (car p1))
(cons (car p2) (cons p1 (cdr p2))))
(else (cons (car p1) (cons p2 (cdr p1))))))
(define (pq-insert lt? x pq)
(pq-merge lt? (list x) pq))
(define (pq-merge-pairs lt? ps)
(cond ((null? ps) '())
((null? (cdr ps)) (car ps))
(else (pq-merge lt? (pq-merge lt? (car ps) (cadr ps))
(pq-merge-pairs lt? (cddr ps))))))
(define (pq-rest lt? pq)
(if (null? pq)
(error 'pq-rest "can't delete minimum from null queue")
(pq-merge-pairs lt? (cdr pq))))
Now for the algorithm. Function f takes two parameters, a list of the numbers in the set ps and the number n of items to output from the head of the output. The algorithm is slightly changed; the priority queue is initialized by pushing 1, then the extraction steps start. Variable p is the previous output value (initially 0), pq is the priority queue, and xs is the output list, which is accumulated in reverse order. Here's the code:
(define (f ps n)
(let loop ((n n) (p 0) (pq (pq-insert < 1 pq-empty)) (xs (list)))
(cond ((zero? n) (reverse xs))
((= (pq-first pq) p) (loop n p (pq-rest < pq) xs))
(else (loop (- n 1) (pq-first pq) (update < pq ps)
(cons (pq-first pq) xs))))))
For those not familiar with Scheme, loop is a locally-defined function that is called recursively, and cond is the head of an if-else chain; in this case, there are three cond clauses, each clause with a predicate and consequent, with the consequent evaluated for the first clause for which the predicate is true. The predicate (zero? n) terminates the recursion and returns the output list in the proper order. The predicate (= (pq-first pq) p) indicates that the current head of the priority queue has been output previously, so it is skipped by recurring with the rest of the priority queue after the first item. Finally, the else predicate, which is always true, identifies a new number to be output, so it decrements the counter, saves the current head of the priority queue as the new previous value, updates the priority queue to add the new children of the current number, and inserts the current head of the priority queue into the accumulating output.
Since it is non-trivial to update the priority queue to add the new children of the current number, that operation is extracted to a separate function:
(define (update lt? pq ps)
(let loop ((ps ps) (pq pq))
(if (null? ps) (pq-rest lt? pq)
(loop (cdr ps) (pq-insert lt? (* (pq-first pq) (car ps)) pq)))))
The function loops over the elements of the ps set, inserting each into the priority queue in turn; the if returns the updated priority queue, minus its old head, when the ps list is exhausted. The recursive step strips the head of the ps list with cdr and inserts the product of the head of the priority queue and the head of the ps list into the priority queue.
Here are two examples of the algorithm:
> (f '(2 5) 20)
(1 2 4 5 8 10 16 20 25 32 40 50 64 80 100 125 128 160 200 250)
> (f '(2 3 5) 20)
(1 2 3 4 5 6 8 9 10 12 15 16 18 20 24 25 27 30 32 36)
You can run the program at http://ideone.com/sA1nn.
This 2-dimensional exploring algorithm is not exact, but works for the first 25 integers, then mixes up 625 and 512.
n = 0
exp_before_5 = 2
while true
i = 0
do
output 2^(n-exp_before_5*i) * 5^Max(0, n-exp_before_5*(i+1))
i <- i + 1
loop while n-exp_before_5*(i+1) >= 0
n <- n + 1
end while
Based on user448810's answer, here's a solution that uses heaps and vectors from the STL.
Now, heaps normally output the largest value, so we store the negative of the numbers as a workaround (since a>b <==> -a<-b).
#include <vector>
#include <iostream>
#include <algorithm>
int main()
{
std::vector<int> primes;
primes.push_back(2);
primes.push_back(5);//Our prime numbers that we get to use
std::vector<int> heap;//the heap that is going to store our possible values
heap.push_back(-1);
std::vector<int> outputs;
outputs.push_back(1);
while(outputs.size() < 10)
{
std::pop_heap(heap.begin(), heap.end());
int nValue = -*heap.rbegin();//Get current smallest number
heap.pop_back();
if(nValue != *outputs.rbegin())//Is it a repeat?
{
outputs.push_back(nValue);
}
for(unsigned int i = 0; i < primes.size(); i++)
{
heap.push_back(-nValue * primes[i]);//add new values
std::push_heap(heap.begin(), heap.end());
}
}
//output our answer
for(unsigned int i = 0; i < outputs.size(); i++)
{
std::cout << outputs[i] << " ";
}
std::cout << std::endl;
}
Output:
1 2 4 5 8 10 16 20 25 32

Resources