Performance of Prime Testing with Haskell - algorithm

I have two ways of testing for primes. One of them called isPrime and the other is isBigPrime. What I originally wanted is to test "big" primes with "small" primes that I have already computed, so that the testing becomes faster. Here are my implementations:
intSqrt :: Integer -> Integer
intSqrt n = round $ sqrt $ fromIntegral n
isPrime' :: Integer->Integer -> Bool
isPrime' 1 m = False
isPrime' n m = do
if (m > (intSqrt n))
then True
else if (rem n (m+1) == 0)
then False
else do isPrime' n (m+1)
isPrime :: Integer -> Bool
isPrime 2 = True
isPrime 3 = True
isPrime n = isPrime' n 1
isBigPrime' :: Integer ->Int ->Bool
isBigPrime' n i =
if ( ( smallPrimes !! i ) > intSqrt n )
then True
else if (rem n (smallPrimes !! i) == 0)
then False
else do isBigPrime' n (i+1)
smallPrimes = [2,3, List of Primes until 1700]
--Start at 1 because we only go through uneven numbers
isBigPrime n = isBigPrime' n 1
primes m = [2]++[k | k <- [3,5..m], isPrime k]
bigPrimes m = smallPrimes ++ [k | k <- [1701,1703..m], isBigPrime k]
main = do
print $ (sum $ [Enter Method] 2999999 )
I have chosen 1700 as upper limit because I wanted to have primes up to 3e6 and sqrt(3e6) ~ 1700. I took the sum of them to compare those two algorithms. I thought that bigPrimes would be way faster that primes because first of all it does way less calculations and it has a head start, which is not too big but anyway. However to my surprise bigPrimes was slower than primes. Here are the results:
For primes
Performance counter stats for './p10':
16768,627686 task-clock (msec) # 1,000 CPUs utilized
58 context-switches # 0,003 K/sec
1 cpu-migrations # 0,000 K/sec
6.496 page-faults # 0,387 K/sec
53.416.641.157 cycles # 3,186 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
160.411.056.099 instructions # 3,00 insns per cycle
34.512.352.987 branches # 2058,150 M/sec
10.673.742 branch-misses # 0,03% of all branches
16,773316435 seconds time elapsed
and for bigPrimes
Performance counter stats for './p10':
19111,667046 task-clock (msec) # 0,999 CPUs utilized
259 context-switches # 0,014 K/sec
3 cpu-migrations # 0,000 K/sec
6.278 page-faults # 0,328 K/sec
61.027.453.425 cycles # 3,193 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
198.207.905.034 instructions # 3,25 insns per cycle
34.632.138.061 branches # 1812,094 M/sec
106.102.114 branch-misses # 0,31% of all branches
19,126843560 seconds time elapsed
I was wondering why that would be the case. I am suspecting that using primes!!n makes bigPrimes somewhat slower but I am not entirely sure.

A common antipattern brought from other languages is to iterate over indices and use (!!) to index into a list. In Haskell, it is instead idiomatic to simply iterate over the list itself. So:
isBigPrime' :: Integer -> [Integer] ->Bool
isBigPrime' n [] = True
isBigPrime' n (p:ps) = p > intSqrt n || (rem n p /= 0 && isBigPrime' n ps)
isBigPrime n = isBigPrime' n (drop 1 smallPrimes)
On my machine, your primes takes 25.3s; your bigPrimes takes 20.9s; and my bigPrimes takes 3.2s. There are several other pieces of low-hanging fruit (e.g. using p*p > n instead of p > intSqrt n), but this is by far the most significant one.

Related

Mergesort implementation in Julia

I'm trying to implement the merge sort algorithm in Julia, but I cannot seem to understand the recursion step needed for the algorithm. My code is the following:
mₐ = [1, 10, 7, 4, 3, 6, 8, 2, 9]
b₁(t, z, half₁, half₂)= ((t<=length(half₁)) && (z<=length(half₂))) && (half₁[t]<half₂[z])
b₂(t, z, half₁, half₂)= ((z<=length(half₂)) && (t<=length(half₁)) ) && (half₁[t]>half₂[z])
function Merge(m₁, m₂)
N = length(m₁) + length(m₂)
B = zeros(N)
i = 1
j = 1
for k in 1:N
if b₁(i, j, m₁, m₂)
B[k] = m₁[i]
i += 1
elseif b₂(i, j, m₁, m₂)
B[k] = m₂[j]
j += 1
elseif j >= length(m₂)
B[k] = m₁[i]
i += 1
elseif i >= length(m₁)
B[k] = m₂[j]
j += 1
end
end
return B
end
function MergeSort(M)
if length(M) == 1
return M
elseif length(M) == 0
return nothing
end
n = length(M)
i₁ = n ÷ 2
i₂ = n - i₁
h₁ = M[1:i₁]
h₂ = M[i₂:end]
C = MergeSort(h₁)
D = MergeSort(h₂)
return Merge(C, D)
end
MergeSort(mₐ)
It always gets stuck when C becomes a single element because it returns it and then splits it again, the only solution is to make it a loop once it is a single element. However, this would not be a recursive approach.
Solution
Taking #Sundar R answer and suggestions. This is a working implementation
#implementation of MergeSort in julia
# merge function, it joins two ordered arrays and returning one single ordered array
function merge(m₁, m₂)
N = length(m₁) + length(m₂)
# create a zeros array of the same input type (int64)
B = zeros(eltype(m₁), N)
i = 1
j = 1
for k in 1:N
if !checkbounds(Bool, m₁, i)
B[k] = m₂[j]
j += 1
elseif !checkbounds(Bool, m₂, j)
B[k] = m₁[i]
i += 1
elseif m₁[i]<m₂[j]
B[k] = m₁[i]
i += 1
else
B[k] = m₂[j]
j += 1
end
end
return B
end
# merge mergesort, this function recursively orders m/2 sub array given an array M
function mergeSort(M)
# base cases
if length(M) == 1
return M
elseif length(M) == 0
return nothing
end
# dividing array in two
n = length(M)
i₁ = n ÷ 2
# be careful with the indexes, thank you #Sundar R
i₂ = i₁ + 1
h₁ = M[1:i₁]
h₂ = M[i₂:end]
# recursively sorting the array
C = mergeSort(h₁)
D = mergeSort(h₂)
return merge(C, D)
end
#test the function
mₐ = [1, 10, 7, 4, 3, 6, 8, 2, 9]
b = mergeSort(mₐ)
println(b)
The issue is with the indices used for splitting, specifically i₂. n - i₁ is the number of elements in the second half of the array, but not necessarily the index where the second half starts - for that you just want i₂ = i₁ + 1.
With i₂ = n - i₁, when n is 2 i.e. when you come down to [1, 10] as the array to sort, i₁ = n ÷ 2 is 1, and i₂ is (2 - 1) = 1 also. So instead of splitting it into [1], [10], you end up "splitting" it into [1], and [1, 10], hence the infinite looping.
Once you fix that, there's a BoundsError from Merge because of a minor mistake: the elseif conditions should check for >, not >= (since Julia uses 1-based indexing, j is still a valid index when j == length(m₂)).
Some other suggestions:
zeros(N) returns a Float64 array, so the result here will always be a float array. I'd suggest zeros(eltype(m₁), N) instead.
It feels like b₁ and b₂ only complicate the code and make it less clear, I'd suggest a simple nested if there, an outer one to check the indices - look up checkbounds, for eg. checkbounds(Bool, m₁, i) - and an inner one to see which is greater.
Julia convention is to use lowercase for functions, so merge and mergesort instead of Merge and MergeSort
To add to the previous answers, which deal with some of the problems in your existing code, here is for reference a relatively efficient and straightforward Julia implementation of mergesort:
# Top-level function will allocate temporary arrays for convenience
function mergesort(A)
S = similar(A)
return mergesort!(copy(A), S)
end
# Efficient in-place version
# S is a temporary working (scratch) array
function mergesort!(A, S, n=length(A))
width = 1
swapcount = 0
while width < n
# A is currently full of sorted runs of length `width` (starting with width=1)
for i = 1:2*width:n
# Merge two sorted lists, left and right:
# left = A[i:i+width-1], right = A[i+width:i+2*width-1]
merge!(A, i, min(i+width, n+1), min(i+2*width, n+1), S)
end
# Swap the pointers of `A` and `S` such that `A` now contains merged
# runs of length 2*width.
S,A = A,S
swapcount += 1
# Double the width and continue
width *= 2
end
# Optional, if it is important that `A` be sorted in-place:
if isodd(swapcount)
# If we've swapped A and S an odd number of times, copy `A` back to `S`
# since `S` will by now refer to the memory initially provided as input
# array `A`, which the user will expect to have been sorted in-place
copyto!(S,A)
end
return A
end
# Merge two sorted subarrays, left and right:
# left = A[iₗ:iᵣ-1], right = A[iᵣ:iₑ-1]
#inline function merge!(A, iₗ, iᵣ, iₑ, S)
left, right = iₗ, iᵣ
#inbounds for n = iₗ:(iₑ-1)
if (left < iᵣ) && (right >= iₑ || A[left] <= A[right])
S[n] = A[left]
left += 1
else
S[n] = A[right]
right += 1
end
end
end
This is enough to get us in the same ballpark as Base's implementation of the same algorithm
julia> using BenchmarkTools
julia> #benchmark mergesort!(A,B) setup = (A = rand(50); B = similar(A))
BenchmarkTools.Trial: 10000 samples with 194 evaluations.
Range (min … max): 497.062 ns … 1.294 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 501.438 ns ┊ GC (median): 0.00%
Time (mean ± σ): 526.171 ns ± 49.011 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▅ ▁ ▁ ▃▇▄ ▁ ▂
█████▇▇▆▇█▇████▇▅▆▅▅▅▆█▆██▄▅▅▄▆██▆▆▄▄▆██▅▃▄██▄▅▅▃▃▃▃▄▅▁▄▄▃▁█ █
497 ns Histogram: log(frequency) by time 718 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> issorted(mergesort(rand(50)))
true
julia> issorted(mergesort(rand(10_000)))
true
julia> #benchmark Base.sort!(A, alg=MergeSort) setup=(A = rand(50))
BenchmarkTools.Trial: 10000 samples with 216 evaluations.
Range (min … max): 344.690 ns … 11.294 μs ┊ GC (min … max): 0.00% … 95.73%
Time (median): 352.917 ns ┊ GC (median): 0.00%
Time (mean ± σ): 401.700 ns ± 378.399 ns ┊ GC (mean ± σ): 3.57% ± 3.76%
█▇▄▄▄▂▁▂▁▂▃▁▁ ▃▂ ▁ ▁▁ ▁
████████████████▇██████▆▆▆▅▆▆▆▆▅▃▅▅▄▅▃▅▅▄▆▅▄▅▄▅▃▄▄██▇▅▆▆▇▆▄▅▅ █
345 ns Histogram: log(frequency) by time 741 ns <
Memory estimate: 336 bytes, allocs estimate: 3.
though both cost a good bit more in terms of both time and memory (the latter due to the need for the working array) in most numeric cases than a similarly efficient pure-Julia implementation of quicksort!:
julia> #benchmark VectorizedStatistics.quicksort!(A) setup = (A = rand(50))
BenchmarkTools.Trial: 10000 samples with 993 evaluations.
Range (min … max): 28.854 ns … 175.821 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 35.268 ns ┊ GC (median): 0.00%
Time (mean ± σ): 38.703 ns ± 7.478 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂ ▃█▁ ▃▃ ▃▆▂ ▂ ▃ ▂ ▁ ▂ ▂
█▆▃▅▁▁▄▅███▆███▆▆███▁▇█▇▅▇█▆▇█▁▆▅▃▅▄▄██▅▆▅▇▅▄▃▁▄▃▁▄▁▃▃▃▁▄▄▇█ █
28.9 ns Histogram: log(frequency) by time 68.7 ns <
Memory estimate: 0 bytes, allocs estimate: 0.

improve function performance

I am putting together a small program that checks for solutions for Brocard's Problem or so called Brown Numbers and I first created a draft in ruby:
class Integer
def factorial
f = 1; for i in 1..self; f *= i; end; f
end
end
boundary = 1000
m = 0
# Brown Numbers - pair of integers (m,n) where n factorial is equal with square root of m
while m <= boundary
n = 0
while n <= boundary
puts "(#{m},#{n})" if ((n.factorial + 1) == (m ** 2))
n += 1
end
m += 1
end
But I discovered that Haskell is much more appropriate for doing mathematical operations, therefore I have asked a question previously and I got an answer pretty quick on how to translate my ruby code to Haskell:
results :: [(Integer, Integer)] --Use instead of `Int` to fix overflow issue
results = [(x,y) | x <- [1..1000], y <- [1..1000] , 1 + fac x == y*y]
where fac n = product [1..n]
I changed that slightly so I can run the same operation from and up to whatever number I want, because the above will do it from 1 up to 1000 or any hardcoded number but I would like to be able to decide the interval it should go through, ergo:
pairs :: (Integer, Integer) -> [(Integer, Integer)]
pairs (lower, upper) = [(m, n) | m <- [lower..upper], n <- [lower..upper], 1 + factorial n == m*m] where factorial n = product [1..n]
If possible, I would like some examples or pointers on optimisations for improving the speed of the operations, because at this point if I run this operation for an interval such as [100..10000] it takes a long time (I stopped it after 45mins).
PS The performance optimisations are to be applied to the Haskell implementation of the calculations (pairs function), not the ruby one, in case some may wonder which function I am talking about.
Well, how would you speed up the ruby implementation? Even though while they're different languages similar optimizations can be applied, namely memoization, and smarter algorithms.
1. Memoization
Memoization prevents you from calculating the factorial over and over.
Here's your version of pairs:
pairs :: (Integer, Integer) -> [(Integer, Integer)]
pairs (lower, upper) = [(m, n) | m <- [lower..upper], n <- [lower..upper], 1 + factorial n == m*m]
where factorial n = product [1..n]
How often does factorial get called? Well, we can say that it gets called at least upper - lower times, although it could be that we don't remember the values from previous calls. In this case, we need (upper - lower)² calls of factorial. Even though a factorial is fairly simple to compute, it doesn't come for free.
What if we instead generate a infinite list of factorials and simply pick the right ones?
pairsMem :: (Integer, Integer) -> [(Integer, Integer)]
pairsMem (lower, upper) = [(m, n) | m <- [lower..upper], n <- [lower..upper], 1 + factorial n == m*m]
where factorial = (factorials!!) . fromInteger
factorials = scanl (*) 1 [1..]
Now factorials is the list [1,1,2,6,24,…], and factorial simply looks up the corresponding value. How do both versions compare?
Your version
main = print $ pairs (0,1000)
> ghc --make SO.hs -O2 -rtsopts > /dev/null
> ./SO.hs +RTS -s
[(5,4),(11,5),(71,7)]
204,022,149,768 bytes allocated in the heap
220,119,948 bytes copied during GC
41,860 bytes maximum residency (2 sample(s))
20,308 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 414079 colls, 0 par 2.39s 2.23s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0001s
INIT time 0.00s ( 0.00s elapsed)
MUT time 67.33s ( 67.70s elapsed)
GC time 2.39s ( 2.23s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 69.72s ( 69.93s elapsed)
%GC time 3.4% (3.2% elapsed)
Alloc rate 3,030,266,322 bytes per MUT second
Productivity 96.6% of total user, 96.3% of total elapsed
Around 68 seconds.
pairsMem
main = print $ pairsMem (0,1000)
> ghc --make -O2 -rtsopts SO.hs > /dev/null
> ./SO.hs +RTS -s
[(5,4),(11,5),(71,7)]
551,558,988 bytes allocated in the heap
644,420 bytes copied during GC
231,120 bytes maximum residency (2 sample(s))
71,504 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1159 colls, 0 par 0.00s 0.01s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0002s
INIT time 0.00s ( 0.00s elapsed)
MUT time 2.17s ( 2.18s elapsed)
GC time 0.00s ( 0.01s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 2.17s ( 2.18s elapsed)
%GC time 0.0% (0.3% elapsed)
Alloc rate 253,955,217 bytes per MUT second
Productivity 100.0% of total user, 99.5% of total elapsed
Around two seconds or only 3% of the original time. Not bad for an almost trivial change. However, as you can see, we use twice the memory. After all, we're saving the factorials in a list. However, the total amount of allocated bytes is 0.27% of the non-memoized variant, since we don't need to regenerate the product.
pairsMem (100,10000)
What about large numbers? You said that with (100,1000) you stopped after 45 minutes. How fast is the memoized version?
main = print $ pairsMem (100,10000)
> ghc --make -O2 -rtsopts SO.hs > /dev/null
> ./SO.hs +RTS -s
… 20 minutes later Ctrl+C…
That still takes too long. What else can we do?
2. Smarter pairs
Lets go back to the drawing board. You're checking all pairs (n,m) in (lower,upper). Is this reasonable?
Actually, no, since factorials grow tremendously fast. So for any natural number let f(m) be the greatest natural number such that f(m)! <= m. Now, for any m, we only need to check the f(m) first factorials - all other's will be greater.
Just for the record, f(10^100) is 70.
Now the strategy is clear: we generate as many factorials as we need and simply check if m * m - 1 is in the list of factorials:
import Data.Maybe (isJust)
import Data.List (elemIndex)
pairsList :: (Integer, Integer) -> [(Integer, Integer)]
pairsList (lower, upper) = [(m, fromIntegral ret)
| m <- [lower..upper],
let l = elemIndex (m*m - 1) fs,
isJust l,
let Just ret = l
]
where fs = takeWhile (<upper*upper) $ scanl (*) 1 [1..]
How well does this version fare against pairsMemLim?
main = print $ pairsList (1, 10^8)
> ghc --make -O2 -rtsopts SO.hs > /dev/null
> ./SO +RTS -s
[(5,4),(11,5),(71,7)]
21,193,518,276 bytes allocated in the heap
2,372,136 bytes copied during GC
58,672 bytes maximum residency (2 sample(s))
19,580 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 40823 colls, 0 par 0.06s 0.11s 0.0000s 0.0000s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0001s
INIT time 0.00s ( 0.00s elapsed)
MUT time 38.17s ( 38.15s elapsed)
GC time 0.06s ( 0.11s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 38.23s ( 38.26s elapsed)
%GC time 0.2% (0.3% elapsed)
Alloc rate 555,212,922 bytes per MUT second
Productivity 99.8% of total user, 99.8% of total elapsed
Allright, down to 40s. But what if we use a data structure which provides an even more efficient lookup?
3. Using the correct data structure
Since we want efficient lookup, we're going to use Set. The function almost stays the same, however, fs is going to be Set Integer, and the lookup is done via lookupIndex:
import Data.Maybe (isJust)
import qualified Data.Set as S
pairsSet :: (Integer, Integer) -> [(Integer, Integer)]
pairsSet (lower, upper) = [(m, 1 + fromIntegral ret)
| m <- [lower..upper],
let l = S.lookupIndex (m*m - 1) fs,
isJust l,
let Just ret = l
]
where fs = S.fromList . takeWhile (<upper*upper) $ scanl (*) 1 [1..]
And here the performance of pairsSet:
> ./SO +RTS -s
[(5,4),(11,5),(71,7)]
18,393,520,096 bytes allocated in the heap
2,069,872 bytes copied during GC
58,752 bytes maximum residency (2 sample(s))
19,580 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 35630 colls, 0 par 0.06s 0.08s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0001s
INIT time 0.00s ( 0.00s elapsed)
MUT time 18.52s ( 18.52s elapsed)
GC time 0.06s ( 0.08s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 18.58s ( 18.60s elapsed)
%GC time 0.3% (0.4% elapsed)
Alloc rate 993,405,304 bytes per MUT second
Productivity 99.7% of total user, 99.5% of total elapsed
This concludes our journey through optimization. By the way, we have reduced the complexity from 𝒪(n³) to 𝒪(n log n) since our data structure gives us a logarithmic search.
From your code, It seems that a memorization could used to speed up the calculation of factorial.
For every m, the code need to compute every n's factorial, which I think is unnecessary.

Haskell: unexpected time-complexity in the computation involving large lists

I am dealing with the computation which has as an intermediate result a list A=[B], which is a list of K lists of the length L. The time-complexity to compute an element of B is controlled by the parameter M and is theoretically linear in M. Theoretically I would expect the time-complexity for the computation of A to be O(K*L*M). However, this is not the case and I don't understand why?
Here is the simple complete sketch program which exhibits the problem I have explained
import System.Random (randoms, mkStdGen)
import Control.Parallel.Strategies (parMap, rdeepseq)
import Control.DeepSeq (NFData)
import Data.List (transpose)
type Point = (Double, Double)
fmod :: Double -> Double -> Double
fmod a b | a < 0 = b - fmod (abs a) b
| otherwise = if a < b then a
else let q = a / b in b * (q - fromIntegral (floor q))
standardMap :: Double -> Point -> Point
standardMap k (q, p) = (fmod (q + p) (2 * pi), fmod (p + k * sin(q)) (2 * pi))
trajectory :: (Point -> Point) -> Point -> [Point]
trajectory map initial = initial : (trajectory map $ map initial)
justEvery :: Int -> [a] -> [a]
justEvery n (x:xs) = x : (justEvery n $ drop (n-1) xs)
justEvery _ [] = []
subTrace :: Int -> Int -> [a] -> [a]
subTrace n m = take (n + 1) . justEvery m
ensemble :: Int -> [Point]
ensemble n = let qs = randoms (mkStdGen 42)
ps = randoms (mkStdGen 21)
in take n $ zip qs ps
ensembleTrace :: NFData a => (Point -> [Point]) -> (Point -> a) ->
Int -> Int -> [Point] -> [[a]]
ensembleTrace orbitGen observable n m =
parMap rdeepseq ((map observable . subTrace n m) . orbitGen)
main = let k = 100
l = 100
m = 100
orbitGen = trajectory (standardMap 7)
observable (p, q) = p^2 - q^2
initials = ensemble k
mean xs = (sum xs) / (fromIntegral $ length xs)
result = (map mean)
$ transpose
$ ensembleTrace orbitGen observable l m
$ initials
in mapM_ print result
I am compiling with
$ ghc -O2 stdmap.hs -threaded
and runing with
$ ./stdmap +RTS -N4 > /dev/null
on the intel Q6600, Linux 3.6.3-1-ARCH, with GHC 7.6.1 and get the following results
for the different sets of the parameters K, L, M (k, l, m in the code of the program)
(K=200,L=200,N=200) -> real 0m0.774s
user 0m2.856s
sys 0m0.147s
(K=2000,L=200,M=200) -> real 0m7.409s
user 0m28.102s
sys 0m1.080s
(K=200,L=2000,M=200) -> real 0m7.326s
user 0m27.932s
sys 0m1.020s
(K=200,L=200,M=2000) -> real 0m10.581s
user 0m38.564s
sys 0m3.376s
(K=20000,L=200,M=200) -> real 4m22.156s
user 7m30.007s
sys 0m40.321s
(K=200,L=20000,M=200) -> real 1m16.222s
user 4m45.891s
sys 0m15.812s
(K=200,L=200,M=20000) -> real 8m15.060s
user 23m10.909s
sys 9m24.450s
I don't quite understand where the problem of such a pure scaling might be. If I understand correctly the lists are lazy and should not be constructed, since they are consumed in the head-tail direction? As could be observed from the measurements there is a correlation between the excessive real-time consumption and the excessive system-time consumption as the excess would be on the system account. But if there is some memory management wasting time, this should still scale linearly in K, L, M.
Help!
EDIT
I made changes in the code according to the suggestions given by Daniel Fisher, which indeed solved the bad scaling with respect to M. As pointed out, by forcing the strict evaluation in the trajectory, we avoid the construction of large thunks. I understand the performance improvement behind that, but I still don't understand the bad scaling of the original code, because (if I understand correctly) the space-time-complexity of the construction of the thunk should be linear in M?
Additionally, I still have problems understanding the bad scaling with respect to K (the size of the ensemble). I performed two additional measurements with the improved code for K=8000 and K=16000, keeping L=200, M=200. Scaling up to K=8000 is as expected but for K=16000 it is already abnormal. The problem seems to be in the number of overflowed SPARKS, which is 0 for K=8000 and 7802 for K=16000. This probably reflects in the bad concurrency which I quantify as a quotient Q = (MUT cpu time) / (MUT real time) which would be ideally equal to the number of CPU-s. However, Q ~ 4 for K = 8000 and Q ~ 2 for K = 16000.
Please help me understand the origin of this problem and the possible solutions.
K = 8000:
$ ghc -O2 stmap.hs -threaded -XBangPatterns
$ ./stmap +RTS -s -N4 > /dev/null
56,905,405,184 bytes allocated in the heap
503,501,680 bytes copied during GC
53,781,168 bytes maximum residency (15 sample(s))
6,289,112 bytes maximum slop
151 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 27893 colls, 27893 par 7.85s 1.99s 0.0001s 0.0089s
Gen 1 15 colls, 14 par 1.20s 0.30s 0.0202s 0.0558s
Parallel GC work balance: 23.49% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N4)
SPARKS: 8000 (8000 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 95.90s ( 24.28s elapsed)
GC time 9.04s ( 2.29s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 104.95s ( 26.58s elapsed)
Alloc rate 593,366,811 bytes per MUT second
Productivity 91.4% of total user, 360.9% of total elapsed
gc_alloc_block_sync: 315819
and
K = 16000:
$ ghc -O2 stmap.hs -threaded -XBangPatterns
$ ./stmap +RTS -s -N4 > /dev/null
113,809,786,848 bytes allocated in the heap
1,156,991,152 bytes copied during GC
114,778,896 bytes maximum residency (18 sample(s))
11,124,592 bytes maximum slop
300 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 135521 colls, 135521 par 22.83s 6.59s 0.0000s 0.0190s
Gen 1 18 colls, 17 par 2.72s 0.73s 0.0405s 0.1692s
Parallel GC work balance: 18.05% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N4)
SPARKS: 16000 (8198 converted, 7802 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 221.77s (139.78s elapsed)
GC time 25.56s ( 7.32s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 247.34s (147.10s elapsed)
Alloc rate 513,176,874 bytes per MUT second
Productivity 89.7% of total user, 150.8% of total elapsed
gc_alloc_block_sync: 814824
M. A. D.'s point about fmod is a good one, but it is not necessary to call out to C, and we can do better staying in Haskell land (the ticket the linked thread was about is meanwhile fixed). The trouble in
fmod :: Double -> Double -> Double
fmod a b | a < 0 = b - fmod (abs a) b
| otherwise = if a < b then a
else let q = a / b in b * (q - fromIntegral (floor q))
is that type defaulting leads to floor :: Double -> Integer (and consequently fromIntegral :: Integer -> Double) being called. Now, Integer is a comparatively complicated type, with slow operations, and the conversion from Integer to Double is also relatively complicated. The original code (with parameters k = l = 200 and m = 5000) produced the stats
./nstdmap +RTS -s -N2 > /dev/null
60,601,075,392 bytes allocated in the heap
36,832,004,184 bytes copied during GC
2,435,272 bytes maximum residency (13741 sample(s))
887,768 bytes maximum slop
9 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 46734 colls, 46734 par 41.66s 20.87s 0.0004s 0.0058s
Gen 1 13741 colls, 13740 par 23.18s 11.62s 0.0008s 0.0041s
Parallel GC work balance: 60.58% (serial 0%, perfect 100%)
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
SPARKS: 200 (200 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 34.99s ( 17.60s elapsed)
GC time 64.85s ( 32.49s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 99.84s ( 50.08s elapsed)
Alloc rate 1,732,048,869 bytes per MUT second
Productivity 35.0% of total user, 69.9% of total elapsed
on my machine (-N2 because I have only two physical cores). Simply changing the code to use a type signature floor q :: Int brings that down to
./nstdmap +RTS -s -N2 > /dev/null
52,105,495,488 bytes allocated in the heap
29,957,007,208 bytes copied during GC
2,440,568 bytes maximum residency (10481 sample(s))
893,224 bytes maximum slop
8 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 36979 colls, 36979 par 32.96s 16.51s 0.0004s 0.0066s
Gen 1 10481 colls, 10480 par 16.65s 8.34s 0.0008s 0.0018s
Parallel GC work balance: 68.64% (serial 0%, perfect 100%)
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
SPARKS: 200 (200 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.01s ( 0.01s elapsed)
MUT time 29.78s ( 14.94s elapsed)
GC time 49.61s ( 24.85s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 79.40s ( 39.80s elapsed)
Alloc rate 1,749,864,775 bytes per MUT second
Productivity 37.5% of total user, 74.8% of total elapsed
a reduction of about 20% in elapsed time, 13% in MUT time. Not bad. If we look at the code for floor that you get with optimisations, we can see why:
floorDoubleInt :: Double -> Int
floorDoubleInt (D# x) =
case double2Int# x of
n | x <## int2Double# n -> I# (n -# 1#)
| otherwise -> I# n
floorDoubleInteger :: Double -> Integer
floorDoubleInteger (D# x) =
case decodeDoubleInteger x of
(# m, e #)
| e <# 0# ->
case negateInt# e of
s | s ># 52# -> if m < 0 then (-1) else 0
| otherwise ->
case TO64 m of
n -> FROM64 (n `uncheckedIShiftRA64#` s)
| otherwise -> shiftLInteger m e
floor :: Double -> Int just uses the machine conversion, while floor :: Double -> Integer needs an expensive decodeDoubleInteger and more branches. But where floor is called here, we know that all involved Doubles are nonnegative, hence floor is the same as truncate, which maps directly to the machine conversion double2Int#, so let's try that instead of floor:
INIT time 0.00s ( 0.00s elapsed)
MUT time 29.29s ( 14.70s elapsed)
GC time 49.17s ( 24.62s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 78.45s ( 39.32s elapsed)
a really small reduction (to be expected, the fmod isn't really the bottleneck). For comparison, calling out to C:
INIT time 0.01s ( 0.01s elapsed)
MUT time 31.46s ( 15.78s elapsed)
GC time 54.05s ( 27.06s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 85.52s ( 42.85s elapsed)
is a bit slower (unsurprisingly, you can execute a number of primops in the time calling out to C takes).
But that's not where the big fish swim. The bad thing is that picking only every m-th element of the trajectories leads to large thunks that cause a lot of allocation and take long to evaluate when the time comes. So let's eliminate that leak and make the trajectories strict:
{-# LANGUAGE BangPatterns #-}
trajectory :: (Point -> Point) -> Point -> [Point]
trajectory map !initial#(!a,!b) = initial : (trajectory map $ map initial)
That reduces the allocations and GC time drastically, and as a consequence also the MUT time:
INIT time 0.00s ( 0.00s elapsed)
MUT time 21.83s ( 10.95s elapsed)
GC time 0.72s ( 0.36s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 22.55s ( 11.31s elapsed)
with the original fmod,
INIT time 0.00s ( 0.00s elapsed)
MUT time 18.26s ( 9.18s elapsed)
GC time 0.58s ( 0.29s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 18.84s ( 9.47s elapsed)
with floor q :: Int, and within measuring accuracy the same times for truncate q :: Int (the allocation figures are a bit lower for truncate).
The problem seems to be in the number of overflowed SPARKS, which is 0 for K=8000 and 7802 for K=16000. This probably reflects in the bad concurrency
Yes (though as far as I know the more correct term here would be parallelism instead of concurrency), there is a spark pool, and when that's full, any further sparks are not scheduled for being evaluated in whatever thread next has time when its turn comes, the computation is then done without parallelism, from the parent thread. In this case that means after an initial parallel phase, the computation falls back to sequential.
The size of the spark pool is apparently about 8K (2^13).
If you watch the CPU load via top, you will see that it drops from (close to 100%)*(number of cores) to a much lower value after a while (for me, it was ~100% with -N2 and ~130% with -N4).
The cure is to avoid sparking too much, and letting each spark do some more work. With the quick-and-dirty modification
ensembleTrace orbitGen observable n m =
withStrategy (parListChunk 25 rdeepseq) . map ((map observable . subTrace n m) . orbitGen)
I'm back to 200% with -N2 for practically the entire run and a good productivity,
INIT time 0.00s ( 0.00s elapsed)
MUT time 57.42s ( 29.02s elapsed)
GC time 5.34s ( 2.69s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 62.76s ( 31.71s elapsed)
Alloc rate 1,982,155,167 bytes per MUT second
Productivity 91.5% of total user, 181.1% of total elapsed
and with -N4 it's also fine (even a wee bit faster on the wall-clock - not much because all threads do basically the same, and I have only 2 physical cores),
INIT time 0.00s ( 0.00s elapsed)
MUT time 99.17s ( 26.31s elapsed)
GC time 16.18s ( 4.80s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 115.36s ( 31.12s elapsed)
Alloc rate 1,147,619,609 bytes per MUT second
Productivity 86.0% of total user, 318.7% of total elapsed
since now the spark pool doesn't overflow.
The proper fix is to make the size of the chunks a parameter that is computed from the number of trajectories and available cores so that the number of sparks doesn't exceed the pool size.
After doing some quick profiling I found that these are the serial offenders:
ghc --make -O2 MainNonOpt.hs -threaded -prof -auto-all -caf-all -fforce-recomp
./MainNonOpt +RTS -N4 -p > /dev/null
>>>
COST CENTRE MODULE %time %alloc
fmod Main 46.3 33.3
standardMap Main 28.5 0.0
trajectory Main 23.8 66.6
What's surprising about fmod is the large number of allocations it does considering it is mostly a numerical function. So the next step would be to annotate fmod to see where is the problem:
fmod :: Double -> Double -> Double
fmod a b | a < 0 = {-# SCC "negbranch" #-} b - fmod (abs a) b
| otherwise = {-# SCC "posbranch" #-} if a < b then a
else let q = {-# SCC "division" #-} a / b in {-# SCC "expression" #-} b * (q - {-# SCC "floor" #-} fromIntegral (floor q))
This gives us:
ghc --make -O2 MainNonOpt.hs -threaded -prof -caf-all -fforce-recomp
./MainNonOpt +RTS -N4 -p > /dev/null
COST CENTRE MODULE %time %alloc
MAIN MAIN 61.5 70.0
posbranch Main 16.6 0.0
floor Main 14.9 30.0
expression Main 4.5 0.0
negbranch Main 1.9 0.0
So the bit with floor is the one which causes the issues. After looking around it turns out that the Prelude does not implement some Double RealFrac functions as best as it should(see here), probably causing some boxing/unboxing.
So by following the advice from the link I used a modified version of floor which also made the call to fromIntegral unnecessary:
floor' :: Double -> Double
floor' x = c_floor x
{-# INLINE floor' #-}
foreign import ccall unsafe "math.h floor"
c_floor :: Double -> Double
fmod :: Double -> Double -> Double
fmod a b | a < 0 = {-# SCC "negbranch" #-} b - fmod (abs a) b
| otherwise = {-# SCC "posbranch" #-} if a < b then a
else let q = {-# SCC "division" #-} a / b in {-# SCC "expression" #-} b * (q - ({-# SCC "floor" #-} floor' q))
EDIT:
As Daniel Fisher Points out, there is no need to inline C code to improve the performance. An analogous Haskell function already exists. I'll leave the answer anyway, for further reference.
This does make a difference. On my machine, for k=l=200, M=5000 here are the number for the non-optimized and the optimized version:
Non optimized:
real 0m20.635s
user 1m17.321s
sys 0m4.980s
Optimized:
real 0m14.858s
user 0m55.271s
sys 0m3.815s
The trajectory function may have similar problems and you can use profiling like it was used above to pin-point the issue.
A great starting point for profiling in Haskell can be found in this chapter of Real World Haskell.

How to find multiplicative partitions of any integer?

I'm looking for an efficient algorithm for computing the multiplicative partitions for any given integer. For example, the number of such partitions for 12 is 4, which are
12 = 12 x 1 = 4 x 3 = 2 x 2 x 3 = 2 x 6
I've read the wikipedia article for this, but that doesn't really give me an algorithm for generating the partitions (it only talks about the number of such partitions, and to be honest, even that is not very clear to me!).
The problem I'm looking at requires me to compute multiplicative partitions for very large numbers (> 1 billion), so I was trying to come up with a dynamic programming approach for it (so that finding all possible partitions for a smaller number can be re-used when that smaller number is itself a factor of a bigger number), but so far, I don't know where to begin!
Any ideas/hints would be appreciated - this is not a homework problem, merely something I'm trying to solve because it seems so interesting!
The first thing I would do is get the prime factorization of the number.
From there, I can make a permutation of each subset of the factors, multiplied by the remaining factors at that iteration.
So if you take a number like 24, you get
2 * 2 * 2 * 3 // prime factorization
a b c d
// round 1
2 * (2 * 2 * 3) a * bcd
2 * (2 * 2 * 3) b * acd (removed for being dup)
2 * (2 * 2 * 3) c * abd (removed for being dup)
3 * (2 * 2 * 2) d * abc
Repeat for all "rounds" (round being the number of factors in the first number of the multiplication), removing duplicates as they come up.
So you end up with something like
// assume we have the prime factorization
// and a partition set to add to
for(int i = 1; i < factors.size; i++) {
for(List<int> subset : factors.permutate(2)) {
List<int> otherSubset = factors.copy().remove(subset);
int subsetTotal = 1;
for(int p : subset) subsetTotal *= p;
int otherSubsetTotal = 1;
for(int p : otherSubset) otherSubsetTotal *= p;
// assume your partition excludes if it's a duplicate
partition.add(new FactorSet(subsetTotal,otherSubsetTotal));
}
}
Of course, the first thing to do is find the prime factorisation of the number, like glowcoder said. Say
n = p^a * q^b * r^c * ...
Then
find the multiplicative partitions of m = n / p^a
for 0 <= k <= a, find the multiplicative partitions of p^k, which is equivalent to finding the additive partitions of k
for each multiplicative partition of m, find all distinct ways to distribute a-k factors p among the factors
combine results of 2. and 3.
It is convenient to treat the multiplicative partitions as lists (or sets) of (divisor, multiplicity) pairs to avoid producing duplicates.
I've written the code in Haskell because it's the most convenient and concise of the languages I know for this sort of thing:
module MultiPart (multiplicativePartitions) where
import Data.List (sort)
import Math.NumberTheory.Primes (factorise)
import Control.Arrow (first)
multiplicativePartitions :: Integer -> [[Integer]]
multiplicativePartitions n
| n < 1 = []
| n == 1 = [[]]
| otherwise = map ((>>= uncurry (flip replicate)) . sort) . pfPartitions $ factorise n
additivePartitions :: Int -> [[(Int,Int)]]
additivePartitions 0 = [[]]
additivePartitions n
| n < 0 = []
| otherwise = aParts n n
where
aParts :: Int -> Int -> [[(Int,Int)]]
aParts 0 _ = [[]]
aParts 1 m = [[(1,m)]]
aParts k m = withK ++ aParts (k-1) m
where
withK = do
let q = m `quot` k
j <- [q,q-1 .. 1]
[(k,j):prt | let r = m - j*k, prt <- aParts (min (k-1) r) r]
countedPartitions :: Int -> Int -> [[(Int,Int)]]
countedPartitions 0 count = [[(0,count)]]
countedPartitions quant count = cbParts quant quant count
where
prep _ 0 = id
prep m j = ((m,j):)
cbParts :: Int -> Int -> Int -> [[(Int,Int)]]
cbParts q 0 c
| q == 0 = if c == 0 then [[]] else [[(0,c)]]
| otherwise = error "Oops"
cbParts q 1 c
| c < q = [] -- should never happen
| c == q = [[(1,c)]]
| otherwise = [[(1,q),(0,c-q)]]
cbParts q m c = do
let lo = max 0 $ q - c*(m-1)
hi = q `quot` m
j <- [lo .. hi]
let r = q - j*m
m' = min (m-1) r
map (prep m j) $ cbParts r m' (c-j)
primePowerPartitions :: Integer -> Int -> [[(Integer,Int)]]
primePowerPartitions p e = map (map (first (p^))) $ additivePartitions e
distOne :: Integer -> Int -> Integer -> Int -> [[(Integer,Int)]]
distOne _ 0 d k = [[(d,k)]]
distOne p e d k = do
cap <- countedPartitions e k
return $ [(p^i*d,m) | (i,m) <- cap]
distribute :: Integer -> Int -> [(Integer,Int)] -> [[(Integer,Int)]]
distribute _ 0 xs = [xs]
distribute p e [(d,k)] = distOne p e d k
distribute p e ((d,k):dks) = do
j <- [0 .. e]
dps <- distOne p j d k
ys <- distribute p (e-j) dks
return $ dps ++ ys
distribute _ _ [] = []
pfPartitions :: [(Integer,Int)] -> [[(Integer,Int)]]
pfPartitions [] = [[]]
pfPartitions [(p,e)] = primePowerPartitions p e
pfPartitions ((p,e):pps) = do
cop <- pfPartitions pps
k <- [0 .. e]
ppp <- primePowerPartitions p k
mix <- distribute p (e-k) cop
return (ppp ++ mix)
It's not particularly optimised, but it does the job.
Some times and results:
Prelude MultiPart> length $ multiplicativePartitions $ 10^10
59521
(0.03 secs, 53535264 bytes)
Prelude MultiPart> length $ multiplicativePartitions $ 10^11
151958
(0.11 secs, 125850200 bytes)
Prelude MultiPart> length $ multiplicativePartitions $ 10^12
379693
(0.26 secs, 296844616 bytes)
Prelude MultiPart> length $ multiplicativePartitions $ product [2 .. 10]
70520
(0.07 secs, 72786128 bytes)
Prelude MultiPart> length $ multiplicativePartitions $ product [2 .. 11]
425240
(0.36 secs, 460094808 bytes)
Prelude MultiPart> length $ multiplicativePartitions $ product [2 .. 12]
2787810
(2.06 secs, 2572962320 bytes)
The 10^k are of course particularly easy because there are only two primes involved (but squarefree numbers are still easier), the factorials get slow earlier. I think by careful organisation of the order and choice of better data structures than lists, there's quite a bit to be gained (probably one should sort the prime factors by exponent, but I don't know whether one should start with the highest exponents or the lowest).
Why dont you find all the numbers that can divide the number and then you find permutations of the numbers that multiplications will add up to the number?
Finding all numbers that can divide your number takes O(n).
Then you can permute this set to find all possible sets that multiplication of this set will give you the number.
Once you find set of all possible numbers that divide the original number, then you can do dynamic programming on them to find the set of numbers that multiplying them will give you the original number.

Generate a list of primes up to a certain number

I'm trying to generate a list of primes below 1 billion. I'm trying this, but this kind of structure is pretty shitty. Any suggestions?
a <- 1:1000000000
d <- 0
b <- for (i in a) {for (j in 1:i) {if (i %% j !=0) {d <- c(d,i)}}}
That sieve posted by George Dontas is a good starting point. Here's a much faster version with running times for 1e6 primes of 0.095s as opposed to 30s for the original version.
sieve <- function(n)
{
n <- as.integer(n)
if(n > 1e8) stop("n too large")
primes <- rep(TRUE, n)
primes[1] <- FALSE
last.prime <- 2L
fsqr <- floor(sqrt(n))
while (last.prime <= fsqr)
{
primes[seq.int(2L*last.prime, n, last.prime)] <- FALSE
sel <- which(primes[(last.prime+1):(fsqr+1)])
if(any(sel)){
last.prime <- last.prime + min(sel)
}else last.prime <- fsqr+1
}
which(primes)
}
Here are some alternate algorithms below coded about as fast as possible in R. They are slower than the sieve but a heck of a lot faster than the questioners original post.
Here's a recursive function that uses mod but is vectorized. It returns for 1e5 almost instantaneously and 1e6 in under 2s.
primes <- function(n){
primesR <- function(p, i = 1){
f <- p %% p[i] == 0 & p != p[i]
if (any(f)){
p <- primesR(p[!f], i+1)
}
p
}
primesR(2:n)
}
The next one isn't recursive and faster again. The code below does primes up to 1e6 in about 1.5s on my machine.
primest <- function(n){
p <- 2:n
i <- 1
while (p[i] <= sqrt(n)) {
p <- p[p %% p[i] != 0 | p==p[i]]
i <- i+1
}
p
}
BTW, the spuRs package has a number of prime finding functions including a sieve of E. Haven't checked to see what the speed is like for them.
And while I'm writing a very long answer... here's how you'd check in R if one value is prime.
isPrime <- function(x){
div <- 2:ceiling(sqrt(x))
!any(x %% div == 0)
}
This is an implementation of the Sieve of Eratosthenes algorithm in R.
sieve <- function(n)
{
n <- as.integer(n)
if(n > 1e6) stop("n too large")
primes <- rep(TRUE, n)
primes[1] <- FALSE
last.prime <- 2L
for(i in last.prime:floor(sqrt(n)))
{
primes[seq.int(2L*last.prime, n, last.prime)] <- FALSE
last.prime <- last.prime + min(which(primes[(last.prime+1):n]))
}
which(primes)
}
sieve(1000000)
Prime Numbers in R
The OP asked to generate all prime numbers below one billion. All of the answers provided thus far are either not capable of doing this, will take a long a time to execute, or currently not available in R (see the answer by #Charles). The package RcppAlgos (I am the author) is capable of generating the requested output in just over 1 second using only one thread. It is based off of the segmented sieve of Eratosthenes by Kim Walisch.
RcppAlgos
library(RcppAlgos)
system.time(primeSieve(1e9)) ## using 1 thread
user system elapsed
1.099 0.077 1.176
Using Multiple Threads
And in recent versions (i.e. >= 2.3.0), we can utilize multiple threads for even faster generation. For example, now we can generate the primes up to 1 billion in under half a second!
system.time(primeSieve(10^9, nThreads = 8))
user system elapsed
2.046 0.048 0.375
Summary of Available Packages in R for Generating Primes
library(schoolmath)
library(primefactr)
library(sfsmisc)
library(primes)
library(numbers)
library(spuRs)
library(randtoolbox)
library(matlab)
## and 'sieve' from #John
Before we begin, we note that the problems pointed out by #Henrik in schoolmath still exists. Observe:
## 1 is NOT a prime number
schoolmath::primes(start = 1, end = 20)
[1] 1 2 3 5 7 11 13 17 19
## This should return 1, however it is saying that 52
## "prime" numbers less than 10^4 are divisible by 7!!
sum(schoolmath::primes(start = 1, end = 10^4) %% 7L == 0)
[1] 52
The point is, don't use schoolmath for generating primes at this point (no offense to the author... In fact, I have filed an issue with the maintainer).
Let's look at randtoolbox as it appears to be incredibly efficient. Observe:
library(microbenchmark)
## the argument for get.primes is for how many prime numbers you need
## whereas most packages get all primes less than a certain number
microbenchmark(priRandtoolbox = get.primes(78498),
priRcppAlgos = RcppAlgos::primeSieve(10^6), unit = "relative")
Unit: relative
expr min lq mean median uq max neval
priRandtoolbox 1.00000 1.00000 1.000000 1.000000 1.000000 1.0000000 100
priRcppAlgos 12.79832 12.55065 6.493295 7.355044 7.363331 0.3490306 100
A closer look reveals that it is essentially a lookup table (found in the file randtoolbox.c from the source code).
#include "primes.h"
void reconstruct_primes()
{
int i;
if (primeNumber[2] == 1)
for (i = 2; i < 100000; i++)
primeNumber[i] = primeNumber[i-1] + 2*primeNumber[i];
}
Where primes.h is a header file that contains an array of "halves of differences between prime numbers". Thus, you will be limited by the number of elements in that array for generating primes (i.e. the first one hundred thousand primes). If you are only working with smaller primes (less than 1,299,709 (i.e. the 100,000th prime)) and you are working on a project that requires the nth prime, randtoolbox is the way to go.
Below, we perform benchmarks on the rest of the packages.
Primes up to One Million
microbenchmark(priRcppAlgos = RcppAlgos::primeSieve(10^6),
priNumbers = numbers::Primes(10^6),
priSpuRs = spuRs::primesieve(c(), 2:10^6),
priPrimes = primes::generate_primes(1, 10^6),
priPrimefactr = primefactr::AllPrimesUpTo(10^6),
priSfsmisc = sfsmisc::primes(10^6),
priMatlab = matlab::primes(10^6),
priJohnSieve = sieve(10^6),
unit = "relative")
Unit: relative
expr min lq mean median uq max neval
priRcppAlgos 1.000000 1.00000 1.00000 1.000000 1.00000 1.00000 100
priNumbers 21.550402 23.19917 26.67230 23.140031 24.56783 53.58169 100
priSpuRs 232.682764 223.35847 233.65760 235.924538 236.09220 212.17140 100
priPrimes 46.591868 43.64566 40.72524 39.106107 39.60530 36.47959 100
priPrimefactr 39.609560 40.58511 42.64926 37.835497 38.89907 65.00466 100
priSfsmisc 9.271614 10.68997 12.38100 9.761438 11.97680 38.12275 100
priMatlab 21.756936 24.39900 27.08800 23.433433 24.85569 49.80532 100
priJohnSieve 10.630835 11.46217 12.55619 10.792553 13.30264 38.99460 100
Primes up to Ten Million
microbenchmark(priRcppAlgos = RcppAlgos::primeSieve(10^7),
priNumbers = numbers::Primes(10^7),
priSpuRs = spuRs::primesieve(c(), 2:10^7),
priPrimes = primes::generate_primes(1, 10^7),
priPrimefactr = primefactr::AllPrimesUpTo(10^7),
priSfsmisc = sfsmisc::primes(10^7),
priMatlab = matlab::primes(10^7),
priJohnSieve = sieve(10^7),
unit = "relative", times = 20)
Unit: relative
expr min lq mean median uq max neval
priRcppAlgos 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 20
priNumbers 30.57896 28.91780 31.26486 30.47751 29.81762 40.43611 20
priSpuRs 533.99400 497.20484 490.39989 494.89262 473.16314 470.87654 20
priPrimes 125.04440 114.71349 112.30075 113.54464 107.92360 103.74659 20
priPrimefactr 52.03477 50.32676 52.28153 51.72503 52.32880 59.55558 20
priSfsmisc 16.89114 16.44673 17.48093 16.64139 18.07987 22.88660 20
priMatlab 30.13476 28.30881 31.70260 30.73251 32.92625 41.21350 20
priJohnSieve 18.25245 17.95183 19.08338 17.92877 18.35414 32.57675 20
Primes up to One Hundred Million
For the next two benchmarks, we only consider RcppAlgos, numbers, sfsmisc, matlab, and the sieve function by #John.
microbenchmark(priRcppAlgos = RcppAlgos::primeSieve(10^8),
priNumbers = numbers::Primes(10^8),
priSfsmisc = sfsmisc::primes(10^8),
priMatlab = matlab::primes(10^8),
priJohnSieve = sieve(10^8),
unit = "relative", times = 20)
Unit: relative
expr min lq mean median uq max neval
priRcppAlgos 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 20
priNumbers 35.64097 33.75777 32.83526 32.25151 31.74193 31.95457 20
priSfsmisc 21.68673 20.47128 20.01984 19.65887 19.43016 19.51961 20
priMatlab 35.34738 33.55789 32.67803 32.21343 31.56551 31.65399 20
priJohnSieve 23.28720 22.19674 21.64982 21.27136 20.95323 21.31737 20
Primes up to One Billion
N.B. We must remove the condition if(n > 1e8) stop("n too large") in the sieve function.
## See top section
## system.time(primeSieve(10^9))
## user system elapsed
## 1.099 0.077 1.176 ## RcppAlgos single-threaded
## gc()
system.time(matlab::primes(10^9))
user system elapsed
31.780 12.456 45.549 ## ~39x slower than RcppAlgos
## gc()
system.time(numbers::Primes(10^9))
user system elapsed
32.252 9.257 41.441 ## ~35x slower than RcppAlgos
## gc()
system.time(sieve(10^9))
user system elapsed
26.266 3.906 30.201 ## ~26x slower than RcppAlgos
## gc()
system.time(sfsmisc::primes(10^9))
user system elapsed
24.292 3.389 27.710 ## ~24x slower than RcppAlgos
From these comparison, we see that RcppAlgos scales much better as n gets larger.
_________________________________________________________
| | 1e6 | 1e7 | 1e8 | 1e9 |
| |---------|----------|-----------|-----------
| RcppAlgos | 1.00 | 1.00 | 1.00 | 1.00 |
| sfsmisc | 9.76 | 16.64 | 19.66 | 23.56 |
| JohnSieve | 10.79 | 17.93 | 21.27 | 25.68 |
| numbers | 23.14 | 30.48 | 32.25 | 34.86 |
| matlab | 23.43 | 30.73 | 32.21 | 38.73 |
---------------------------------------------------------
The difference is even more dramatic when we utilize multiple threads:
microbenchmark(ser = primeSieve(1e6),
par = primeSieve(1e6, nThreads = 8), unit = "relative")
Unit: relative
expr min lq mean median uq max neval
ser 1.741342 1.492707 1.481546 1.512804 1.432601 1.275733 100
par 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100
microbenchmark(ser = primeSieve(1e7),
par = primeSieve(1e7, nThreads = 8), unit = "relative")
Unit: relative
expr min lq mean median uq max neval
ser 2.632054 2.50671 2.405262 2.418097 2.306008 2.246153 100
par 1.000000 1.00000 1.000000 1.000000 1.000000 1.000000 100
microbenchmark(ser = primeSieve(1e8),
par = primeSieve(1e8, nThreads = 8), unit = "relative", times = 20)
Unit: relative
expr min lq mean median uq max neval
ser 2.914836 2.850347 2.761313 2.709214 2.755683 2.438048 20
par 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20
microbenchmark(ser = primeSieve(1e9),
par = primeSieve(1e9, nThreads = 8), unit = "relative", times = 10)
Unit: relative
expr min lq mean median uq max neval
ser 3.081841 2.999521 2.980076 2.987556 2.961563 2.841023 10
par 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10
And multiplying the table above by the respective median times for the serial results:
_____________________________________________________________
| | 1e6 | 1e7 | 1e8 | 1e9 |
| |---------|----------|-----------|-----------
| RcppAlgos-Par | 1.00 | 1.00 | 1.00 | 1.00 |
| RcppAlgos-Ser | 1.51 | 2.42 | 2.71 | 2.99 |
| sfsmisc | 14.76 | 40.24 | 53.26 | 70.39 |
| JohnSieve | 16.32 | 43.36 | 57.62 | 76.72 |
| numbers | 35.01 | 73.70 | 87.37 | 104.15 |
| matlab | 35.44 | 74.31 | 87.26 | 115.71 |
-------------------------------------------------------------
Primes Over a Range
microbenchmark(priRcppAlgos = RcppAlgos::primeSieve(10^9, 10^9 + 10^6),
priNumbers = numbers::Primes(10^9, 10^9 + 10^6),
priPrimes = primes::generate_primes(10^9, 10^9 + 10^6),
unit = "relative", times = 20)
Unit: relative
expr min lq mean median uq max neval
priRcppAlgos 1.0000 1.0000 1.000 1.0000 1.0000 1.0000 20
priNumbers 115.3000 112.1195 106.295 110.3327 104.9106 81.6943 20
priPrimes 983.7902 948.4493 890.243 919.4345 867.5775 708.9603 20
Primes up to 10 billion in Under 6 Seconds
## primes less than 10 billion
system.time(tenBillion <- RcppAlgos::primeSieve(10^10, nThreads = 8))
user system elapsed
26.077 2.063 5.602
length(tenBillion)
[1] 455052511
## Warning!!!... Large object created
tenBillionSize <- object.size(tenBillion)
print(tenBillionSize, units = "Gb")
3.4 Gb
Primes Over a Range of Very Large Numbers:
Prior to version 2.3.0, we were simply using the same algorithm for numbers of every magnitude. This is okay for smaller numbers when most of the sieving primes have at least one multiple in each segment (Generally, the segment size is limited by the size of L1 Cache ~32KiB). However, when we are dealing with larger numbers, the sieving primes will contain many numbers that will have fewer than one multiple per segment. This situation creates a lot of overhead, as we are performing many worthless checks that pollutes the cache. Thus, we observe much slower generation of primes when the numbers are very large. Observe for version 2.2.0 (See Installing older version of R package):
## Install version 2.2.0
## packageurl <- "http://cran.r-project.org/src/contrib/Archive/RcppAlgos/RcppAlgos_2.2.0.tar.gz"
## install.packages(packageurl, repos=NULL, type="source")
system.time(old <- RcppAlgos::primeSieve(1e15, 1e15 + 1e9))
user system elapsed
7.932 0.134 8.067
And now using the cache friendly improvement originally developed by Tomás Oliveira, we see drastic improvements:
## Reinstall current version from CRAN
## install.packages("RcppAlgos"); library(RcppAlgos)
system.time(cacheFriendly <- primeSieve(1e15, 1e15 + 1e9))
user system elapsed
2.258 0.166 2.424 ## Over 3x faster than older versions
system.time(primeSieve(1e15, 1e15 + 1e9, nThreads = 8))
user system elapsed
4.852 0.780 0.911 ## Over 8x faster using multiple threads
Take Away
There are many great packages available for generating primes
If you are looking for speed in general, there is no match to RcppAlgos::primeSieve, especially for larger numbers.
If you are working with small primes, look no further than randtoolbox::get.primes.
If you need primes in a range, the packages numbers, primes, & RcppAlgos are the way to go.
The importance of good programming practices cannot be overemphasized (e.g. vectorization, using correct data types, etc.). This is most aptly demonstrated by the pure base R solution provided by #John. It is concise, clear, and very efficient.
Best way that I know of to generate all primes (without getting into crazy math) is to use the Sieve of Eratosthenes.
It is pretty straightforward to implement and allows you calculate primes without using division or modulus. The only downside is that it is memory intensive, but various optimizations can be made to improve memory (ignoring all even numbers for instance).
This method should be Faster and simpler.
allPrime <- function(n) {
primes <- rep(TRUE, n)
primes[1] <- FALSE
for (i in 1:sqrt(n)) {
if (primes[i]) primes[seq(i^2, n, i)] <- FALSE
}
which(primes)
}
0.12 second on my computer for n = 1e6
I implemented this in function AllPrimesUpTo in package primefactr.
I recommend primegen, Dan Bernstein's implementation of the Atkin-Bernstein sieve. It's very fast and will scale well to other problems. You'll need to pass data out to the program to use it, but I imagine there are ways to do that?
You can also cheat and use the primes() function in the schoolmath package :D
The isPrime() function posted above could use sieve(). One only needs to check if any of
the primes < ceiling(sqrt(x)) divide x with no remainder. Need to handle 1 and 2, also.
isPrime <- function(x) {
div <- sieve(ceiling(sqrt(x)))
(x > 1) & ((x == 2) | !any(x %% div == 0))
}
No suggestions, but allow me an extended comment of sorts. I ran this experiment with the following code:
get_primes <- function(n_min, n_max){
options(scipen=999)
result = vector()
for (x in seq(max(n_min,2), n_max)){
has_factor <- F
for (p in seq(2, ceiling(sqrt(x)))){
if(x %% p == 0) has_factor <- T
if(has_factor == T) break
}
if(has_factor==F) result <- c(result,x)
}
result
}
and after almost 24 hours of uninterrupted computer operations, I got a list of 5,245,897 primes. The π(1,000,000,000) = 50,847,534, so it would have taken 10 days to complete this calculation.
Here is the file of these first ~ 5 million prime numbers.
for (i in 2:1000) {
a = (2:(i-1))
b = as.matrix(i%%a)
c = colSums(b != 0)
if (c == i-2)
{
print(i)
}
}
Every number (i) before (a) is checked against the list of prime numbers (n) generated by checking for number (i-1)
Thanks for suggestions:
prime = function(a,n){
n=c(2)
i=3
while(i <=a){
for(j in n[n<=sqrt(i)]){
r=0
if (i%%j == 0){
r=1}
if(r==1){break}
}
if(r!=1){n = c(n,i)}
i=i+2
}
print(n)
}

Resources