Fastest way to get a hash from a list? - performance

I have a long list of integers that I want to turn into an MD5 hash. What's the quickest way to do this? I have tried two options, both similar. Just wondering if I'm missing an obviously quicker method.
import random
import hashlib
import cPickle as pickle
r = [random.randrange(1, 1000) for _ in range(0, 1000000)]
def method1(r):
p = pickle.dumps(r, -1)
return hashlib.md5(p).hexdigest()
def method2(r):
p = str(r)
return hashlib.md5(p).hexdigest()
def method3(r):
p = ','.join(map(str, r))
return hashlib.md5(p).hexdigest()
Then time it in iPython:
timeit method1(r)
timeit method2(r)
timeit method3(r)
Gives me this:
In [8]: timeit method1(r)
10 loops, best of 3: 68.7 ms per loop
In [9]: timeit method2(r)
10 loops, best of 3: 176 ms per loop
In [10]: timeit method3(r)
1 loops, best of 3: 270 ms per loop
So, option 1 is the best I've got. But I have to do it a lot and it's currently the rate determining step in my code.
Any tips or tricks to get a unique hash from a big list quicker than what's here, using Python 2.7?

You may find this useful. It uses my own custom bench-marking framework (based on timeit) to gather and print the results. Since the variations in speed are primarily due to the need to convert the r list to something that hashlib.md5() can work with, I've updated the suite of test cases to show how storing the values in an array.array instead, as #DSM suggested in a comment, would dramatically speed things up. Note that since the integers in the list are all relatively small I've stored them in an array of short (2-byte) values.
from __future__ import print_function
import sys
import timeit
setup = """
import array
import random
import hashlib
import marshal
import cPickle as pickle
import struct
r = [random.randrange(1, 1000) for _ in range(0, 1000000)]
ra = array.array('h', r) # create an array of shorts equivalent
def method1(r):
p = pickle.dumps(r, -1)
return hashlib.md5(p).hexdigest()
def method2(r):
p = str(r)
return hashlib.md5(p).hexdigest()
def method3(r):
p = ','.join(map(str, r))
return hashlib.md5(p).hexdigest()
def method4(r):
fmt = '%dh' % len(r)
buf = struct.pack(fmt, *r)
return hashlib.md5(buf).hexdigest()
def method5(r):
a = array.array('h', r)
return hashlib.md5(a).hexdigest()
def method6(r):
m = marshal.dumps(r)
return hashlib.md5(m).hexdigest()
# using pre-built array...
def pb_method1(ra):
p = pickle.dumps(ra, -1)
return hashlib.md5(p).hexdigest()
def pb_method2(ra):
p = str(ra)
return hashlib.md5(p).hexdigest()
def pb_method3(ra):
p = ','.join(map(str, ra))
return hashlib.md5(p).hexdigest()
def pb_method4(ra):
fmt = '%dh' % len(ra)
buf = struct.pack(fmt, *ra)
return hashlib.md5(buf).hexdigest()
def pb_method5(ra):
return hashlib.md5(ra).hexdigest()
def pb_method6(ra):
m = marshal.dumps(ra)
return hashlib.md5(m).hexdigest()
"""
statements = {
"pickle.dumps(r, -1)": """
method1(r)
""",
"str(r)": """
method2(r)
""",
"','.join(map(str, r))": """
method3(r)
""",
"struct.pack(fmt, *r)": """
method4(r)
""",
"array.array('h', r)": """
method5(r)
""",
"marshal.dumps(r)": """
method6(r)
""",
# versions using pre-built array...
"pickle.dumps(ra, -1)": """
pb_method1(ra)
""",
"str(ra)": """
pb_method2(ra)
""",
"','.join(map(str, ra))": """
pb_method3(ra)
""",
"struct.pack(fmt, *ra)": """
pb_method4(ra)
""",
"ra (pre-built)": """
pb_method5(ra)
""",
"marshal.dumps(ra)": """
pb_method6(ra)
""",
}
N = 10
R = 3
timings = [(
idea,
min(timeit.repeat(statements[idea], setup=setup, repeat=R, number=N)),
) for idea in statements]
longest = max(len(t[0]) for t in timings) # length of longest name
print('fastest to slowest timings (Python {}.{}.{})\n'.format(*sys.version_info[:3]),
' ({:,d} calls, best of {:d})\n'.format(N, R))
ranked = sorted(timings, key=lambda t: t[1]) # sort by speed (fastest first)
for timing in ranked:
print("{:>{width}} : {:.6f} secs, rel speed {rel:>8.6f}x".format(
timing[0], timing[1], rel=timing[1]/ranked[0][1], width=longest))
Results:
fastest to slowest timings (Python 2.7.6)
(10 calls, best of 3)
ra (pre-built) : 0.037906 secs, rel speed 1.000000x
marshal.dumps(ra) : 0.177953 secs, rel speed 4.694626x
marshal.dumps(r) : 0.695606 secs, rel speed 18.350932x
pickle.dumps(r, -1) : 1.266096 secs, rel speed 33.401179x
array.array('h', r) : 1.287884 secs, rel speed 33.975950x
pickle.dumps(ra, -1) : 1.955048 secs, rel speed 51.576558x
struct.pack(fmt, *r) : 2.085602 secs, rel speed 55.020743x
struct.pack(fmt, *ra) : 2.357887 secs, rel speed 62.203962x
str(r) : 2.918623 secs, rel speed 76.996860x
str(ra) : 3.686666 secs, rel speed 97.258777x
','.join(map(str, r)) : 4.701531 secs, rel speed 124.032173x
','.join(map(str, ra)) : 4.968734 secs, rel speed 131.081303x

You can improve performance slightly, simplify your code, and remove an import by using Python's builtin hash function instead of md5 from hashlib:
import random
import cPickle as pickle
r = [random.randrange(1, 1000) for _ in range(0, 1000000)]
def method1(r):
p = pickle.dumps(r, -1)
return hash(p)
def method2(r):
p = str(r)
return hash(p)
def method3(r):
p = ','.join(map(str, r))
return hash(p)

Related

Runge-Kutta curve fitting extremely slow

I am currently trying to do a regression of a function calculated via a RK4 method performed on a non-linear Volterra integral of the second kind. The problem I found is that the code is extremely slow, for 1 call of the curve_fit function (fitt), it takes about 30-40 minute to generate a data. Overall, there will be a lot of calls to fitt before the parameters are determined, this takes more than 6 hours to run. Is there anyway to optimize this code? Thanks in advance!
from scipy.special import gamma
from ml_internal import LTInversion
from scipy.optimize import curve_fit , fsolve
from scipy.misc import derivative
from sklearn.metrics import r2_score
from math import comb , factorial
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# Gets the data
df = pd.read_excel('D:\\CoMat\\Fractional_fit\\optimized\\data_optimized.xlsx')
skipTime = 1
skipIndex = df[df['Time']== skipTime].index.values[0]
xls = pd.read_excel('D:\\CoMat\\Fractional_fit\\optimized\\data_optimized.xlsx',skiprows=np.arange(1,skipIndex+1,1))
timeDF = xls['Time']
tempDF = xls['Temp']
taDF = xls['Ta']
timeDF = timeDF - timeDF[0]
tempDF = tempDF + 273.15
t0 = tempDF[0]
ta = sum(taDF)/len(taDF)
ta = ta + 273.15
###########################################
#Spliting into intervals
h = 0.05
a = 0
b = timeDF[len(timeDF)-1]
N = int(np.round((b-a)/h))
#Each xi
def xidx(index):
return a + h*index
#Function in the image are written here.
def gx(t,lamda,alpha):
return t0 * ml(lamda*(t**alpha),alpha)
gx = np.vectorize(gx)
def kernel(t,s,rad,lamda,alpha,beta):
if t == s:
return 0
return (t-s)**(alpha-1) * ml_(lamda*((t-s)**alpha),alpha,alpha,1) * (beta*(rad**4) - beta*(ta**4) - lamda*ta)
kernel = np.vectorize(kernel)
############################
# The problem is here!!!!!!
def fx(x,n,lamda,alpha,beta):
ans = gx(x,lamda,alpha)
for j in range(n):
ans += (h/6)*(kernel(x,xidx(j),f0[j],lamda,alpha,beta) + 2*kernel(x,xidx(j+1/2),f1[j],lamda,alpha,beta) + 2*kernel(x,xidx(j+1/2),f2[j],lamda,alpha,beta) + kernel(x,xidx(j+1),f3[j],lamda,alpha,beta))
return ans
#########################
f0 = np.zeros(N+1)
f0[0] = t0
f1 = np.zeros(N+1)
f2 = np.zeros(N+1)
f3 = np.zeros(N+1)
F = np.zeros((3,N+1))
def fitt(xvalue,lamda,alpha,beta):
global f0,f1,f2,f3,F
n = int(np.round(xvalue/h))
f1[n] = fx(xidx(n) + 1/2,n,lamda,alpha,beta) + (h/2)*kernel(xidx(n + 1/2),xidx(n),f0[n],lamda,alpha,beta)
f2[n] = fx(xidx(n + 1/2),n,lamda,alpha,beta)
f3[n] = fx(xidx(n+1),n,lamda,alpha,beta) + h*kernel(xidx(n+1),xidx(n+1/2),f2[n],lamda,alpha,beta)
if n+1 <= N:
f0[n+1] = fx(xidx(n+1),n,lamda,alpha,beta) + (h/6)*(kernel(xidx(n+1),xidx(n),f0[n],lamda,alpha,beta) + 2*kernel(xidx(n+1),xidx(n+1/2),f1[n],lamda,alpha,beta) + 2*kernel(xidx(n+1),xidx(n+1/2),f2[n],lamda,alpha,beta) + kernel(xidx(n+1),xidx(n+1),f3[n],lamda,alpha,beta))
if(xvalue == timeDF[len(timeDF) - 1]):
print(f0[n],n)
returnValue = f0[n]
f0 = np.zeros(N+1)
f0[0] = t0
f1 = np.zeros(N+1)
f2 = np.zeros(N+1)
f3 = np.zeros(N+1)
return returnValue
print(f0[n],n)
return f0[n]
fitt = np.vectorize(fitt)
#Fitting, plotting and giving (Adj) R-squared
popt , pcov = curve_fit(fitt,timeDF,tempDF,p0=(-0.1317,0.95,-1e-11),bounds=((-np.inf,0,-np.inf),(0,1,0)))
print(popt)
y_fit = np.array(fitt(timeDF,popt[0],popt[1],popt[2]))
plt.scatter(timeDF,tempDF,color='ORANGE',marker='.',s=0.5)
plt.fill_between(timeDF,tempDF-0.5,tempDF+0.5,color='ORANGE', alpha=0.2)
plt.plot(timeDF,y_fit,color='RED',linewidth=1)
plt.legend(["Experimental data", "Caputo fit"], loc ="upper right")
plt.xlabel("Time (min)")
plt.ylabel("Temperature (Kelvin)")
plt.show()
plt.close()
r2 = r2_score(tempDF,y_fit)
print(r2)
adjr2 = 1 - (1 - r2)*((len(xls)-1)/(len(xls)-3-1))
print(adjr2)
I already tried computing the values f0,f1,f2,f3 all at once, but the thing consuming the most time is Fn(x) which I haven't figured in out how to compute them all at once. If this is possible to compute at once, I think the program will run much faster. PS: ML,ML_ is a function from https://github.com/khinsen/mittag-leffler.
This is the function necesssary. Fn is the only one I haven't figured out yet.
There are two typing errors in the cited image. The combination of x_n and 1/2 is always meant to be the midpoint x_{n+1/2} = x_n + h/2. The second error is a duplication of x_{n+1/2} in the formula for f^{(4)}_n in its third term. The first error is probably producing errors that are large enough to make convergence complicated and any limit wrong for the intended problem.
In the Simpson/RK4 step, the 4 fx computations can be reduced to 2.
The F_n implement the left side of the integral equation
F(x) = g(x) + int(s=0 to x of K(x,s,f(s))
where the integral is approximated with the sample sequences f0,...,f3. Due to the structure of problem and algorithm F_n(x_n)=f^0_n = f^4_{n-1}.
Note that K(x,s,f) should be set to zero for s >= x. In the exact version of the equation these values "above the diagonal" are not used.
If an increase in accuracy is needed, for instance to avoid divergence where there is none in the exact solution, you can decrease the step site by a factor of 10 and then sub-sample the f^0_n sequence to produce the numerical guess for the given data. Other factors than 10 are of course also possible.

Why does pymc.MAP not always return the same value

I am running pymc2 to fit a straight line through my data. The code is shown below (modified from examples I found online). When I call the MAP function multiple times, I get different answers, even though I start with the exact same model. I thought the optimization method, fmin_powell, starts at the supplied value for each parameter. As far as I know, fmin_powell has no random component, so it should always end at the same optimum, yet it doesn't. Why do I keep getting different results?
import numpy as np
import pymc
# observed data
n = 21
a = 6
b = 2
sigma = 2
x = np.linspace(0, 1, n)
np.random.seed(1)
y_obs = a * x + b + np.random.normal(0, sigma, n)
def model():
# define priors
a = pymc.Normal('a', mu=0, tau=1 /10 ** 2, value=5)
b = pymc.Normal('b', mu=0, tau=1 / 10 ** 2, value=1)
tau = pymc.Gamma('tau', alpha=0.1, beta=0.1, value=1)
# define likelihood
#pymc.deterministic
def mu(a=a, b=b, x=x):
return a * x + b
y = pymc.Normal('y', mu=mu, tau=tau, value=y_obs, observed=True)
return locals()
ml = model() # dictionary of all locals
mcmc = pymc.Model(ml) # MCMC object
mapmcmc = pymc.MAP(mcmc)
mapmcmc.fit(method='fmin_powell')
print(mcmc.a.value, mcmc.b.value, mcmc.tau.value)
ml = model() # dictionary of all locals
mcmc = pymc.Model(ml) # MCMC object
mapmcmc = pymc.MAP(mcmc)
mapmcmc.fit(method='fmin_powell')
print(mcmc.a.value, mcmc.b.value, mcmc.tau.value)
ml = model() # dictionary of all locals
mcmc = pymc.Model(ml) # MCMC object
mapmcmc = pymc.MAP(mcmc)
mapmcmc.fit(method='fmin_powell')
print(mcmc.a.value, mcmc.b.value, mcmc.tau.value)

Numpy version of rolling MAD (mean absolute deviation)

How to make a rolling version of the following MAD function
from numpy import mean, absolute
def mad(data, axis=None):
return mean(absolute(data - mean(data, axis)), axis)
This code is an answer to this question
At the moment i convert numpy to pandas then apply this function, then convert the result back to numpy
pandasDataFrame.rolling(window=90).apply(mad)
but this is inefficient on larger data-frames. How to get a rolling window for the same function in numpy without looping and give the same result?
Here's a vectorized NumPy approach -
# From this post : http://stackoverflow.com/a/40085052/3293881
def strided_app(a, L, S ): # Window len = L, Stride len/stepsize = S
nrows = ((a.size-L)//S)+1
n = a.strides[0]
return np.lib.stride_tricks.as_strided(a, shape=(nrows,L), strides=(S*n,n))
# From this post : http://stackoverflow.com/a/14314054/3293881 by #Jaime
def moving_average(a, n=3) :
ret = np.cumsum(a, dtype=float)
ret[n:] = ret[n:] - ret[:-n]
return ret[n - 1:] / n
def mad_numpy(a, W):
a2D = strided_app(a,W,1)
return np.absolute(a2D - moving_average(a,W)[:,None]).mean(1)
Runtime test -
In [617]: data = np.random.randint(0,9,(10000))
...: df = pd.DataFrame(data)
...:
In [618]: pandas_out = pd.rolling_apply(df,90,mad).values.ravel()
In [619]: numpy_out = mad_numpy(data,90)
In [620]: np.allclose(pandas_out[89:], numpy_out) # Nans part clipped
Out[620]: True
In [621]: %timeit pd.rolling_apply(df,90,mad)
10 loops, best of 3: 111 ms per loop
In [622]: %timeit mad_numpy(data,90)
100 loops, best of 3: 3.4 ms per loop
In [623]: 111/3.4
Out[623]: 32.64705882352941
Huge 32x+ speedup there over the loopy pandas solution!

Fitting a capped Poisson process with a variable rate

I'm trying to estimate the rate of a Poisson process where the rate varies over time using the maximum a posteriori estimate. Here's a simplified example with a rate varying linearly (λ = ax+b) :
import numpy as np
import pymc
# Observation
a_actual = 1.3
b_actual = 2.0
t = np.arange(10)
obs = np.random.poisson(a_actual * t + b_actual)
# Model
a = pymc.Uniform(name='a', value=1., lower=0, upper=10)
b = pymc.Uniform(name='b', value=1., lower=0, upper=10)
#pymc.deterministic
def linear(a=a, b=b):
return a * t + b
r = pymc.Poisson(mu=linear, name='r', value=obs, observed=True)
model = pymc.Model([a, b, r])
map = pymc.MAP(model)
map.fit()
map.revert_to_max()
print "a :", a._value
print "b :", b._value
This is working fine. But my actual Poisson process is capped by a deterministic value. As I can't associate my observed values to a Deterministic function, I'm adding a Normal Stochastic function with a small variance for my observations :
import numpy as np
import pymc
# Observation
a_actual = 1.3
b_actual = 2.0
t = np.arange(10)
obs = np.random.poisson(a_actual * t + b_actual).clip(0, 10)
# Model
a = pymc.Uniform(name='a', value=1., lower=0, upper=10)
b = pymc.Uniform(name='b', value=1., lower=0, upper=10)
#pymc.deterministic
def linear(a=a, b=b):
return a * t + b
r = pymc.Poisson(mu=linear, name='r')
#pymc.deterministic
def clip(r=r):
return r.clip(0, 10)
rc = pymc.Normal(mu=r, tau=0.001, name='rc', value=obs, observed=True)
model = pymc.Model([a, b, r, rc])
map = pymc.MAP(model)
map.fit()
map.revert_to_max()
print "a :", a._value
print "b :", b._value
This code is producing the following error :
Traceback (most recent call last):
File "pymc-bug-2.py", line 59, in <module>
map.revert_to_max()
File "pymc/NormalApproximation.py", line 486, in revert_to_max
self._set_stochastics([self.mu[s] for s in self.stochastics])
File "pymc/NormalApproximation.py", line 58, in __getitem__
tot_len += self.owner.stochastic_len[p]
KeyError: 0
Any idea on what am I doing wrong?
By "Capped" do you mean that it is a truncated Poisson? It appears thats what you are saying. If it were a left truncation (which is more common), you could use the TruncatedPoisson distribution, but since you are doing a right truncation, you cannot (we should have made this more general!). What you are trying will not work -- the Poisson object has no clip() method. What you can do is use a factor potential. It would look like this:
#pymc.potential
def clip(r=r):
if np.any(r>10):
return -np.inf
return 0
This will constrain the values of r to be less than 10. Refer to the pymc docs for information on the Potential class.

Find prime numbers using Scala. Help me to improve

I wrote this code to find the prime numbers less than the given number i in scala.
def findPrime(i : Int) : List[Int] = i match {
case 2 => List(2)
case _ => {
val primeList = findPrime(i-1)
if(isPrime(i, primeList)) i :: primeList else primeList
}
}
def isPrime(num : Int, prePrimes : List[Int]) : Boolean = prePrimes.forall(num % _ != 0)
But, I got a feeling the findPrime function, especially this part:
case _ => {
val primeList = findPrime(i-1)
if(isPrime(i, primeList)) i :: primeList else primeList
}
is not quite in the functional style.
I am still learning functional programming. Can anyone please help me improve this code to make it more functional.
Many thanks.
Here's a functional implementation of the Sieve of Eratosthenes, as presented in Odersky's "Functional Programming Principles in Scala" Coursera course :
// Sieving integral numbers
def sieve(s: Stream[Int]): Stream[Int] = {
s.head #:: sieve(s.tail.filter(_ % s.head != 0))
}
// All primes as a lazy sequence
val primes = sieve(Stream.from(2))
// Dumping the first five primes
print(primes.take(5).toList) // List(2, 3, 5, 7, 11)
The style looks fine to me. Although the Sieve of Eratosthenes is a very efficient way to find prime numbers, your approach works well too, since you are only testing for division against known primes. You need to watch out however--your recursive function is not tail recursive. A tail recursive function does not modify the result of the recursive call--in your example you prepend to the result of the recursive call. This means that you will have a long call stack and so findPrime will not work for large i. Here is a tail-recursive solution.
def primesUnder(n: Int): List[Int] = {
require(n >= 2)
def rec(i: Int, primes: List[Int]): List[Int] = {
if (i >= n) primes
else if (prime(i, primes)) rec(i + 1, i :: primes)
else rec(i + 1, primes)
}
rec(2, List()).reverse
}
def prime(num: Int, factors: List[Int]): Boolean = factors.forall(num % _ != 0)
This solution isn't prettier--it's more of a detail to get your solution to work for large arguments. Since the list is built up backwards to take advantage of fast prepends, the list needs to be reversed. As an alternative, you could use an Array, Vector or a ListBuffer to append the results. With the Array, however, you would need to estimate how much memory to allocate for it. Fortunately we know that pi(n) is about equal to n / ln(n) so you can choose a reasonable size. Array and ListBuffer are also a mutable data types, which goes again your desire for functional style.
Update: to get good performance out of the Sieve of Eratosthenes I think you'll need to store data in a native array, which also goes against your desire for style in functional programming. There might be a creative functional implementation though!
Update: oops! Missed it! This approach works well too if you only divide by primes less than the square root of the number you are testing! I missed this, and unfortunately it's not easy to adjust my solution to do this because I'm storing the primes backwards.
Update: here's a very non-functional solution that at least only checks up to the square root.
rnative, you could use an Array, Vector or a ListBuffer to append the results. With the Array, however, you would need to estimate how much memory to allocate for it. Fortunately we know that pi(n) is about equal to n / ln(n) so you can choose a reasonable size. Array and ListBuffer are also a mutable data types, which goes again your desire for functional style.
Update: to get good performance out of the Sieve of Eratosthenes I think you'll need to store data in a native array, which also goes against your desire for style in functional programming. There might be a creative functional implementation though!
Update: oops! Missed it! This approach works well too if you only divide by primes less than the square root of the number you are testing! I missed this, and unfortunately it's not easy to adjust my solution to do this because I'm storing the primes backwards.
Update: here's a very non-functional solution that at least only checks up to the square root.
import scala.collection.mutable.ListBuffer
def primesUnder(n: Int): List[Int] = {
require(n >= 2)
val primes = ListBuffer(2)
for (i <- 3 to n) {
if (prime(i, primes.iterator)) {
primes += i
}
}
primes.toList
}
// factors must be in sorted order
def prime(num: Int, factors: Iterator[Int]): Boolean =
factors.takeWhile(_ <= math.sqrt(num).toInt) forall(num % _ != 0)
Or I could use Vectors with my original approach. Vectors are probably not the best solution because they don't have the fasted O(1) even though it's amortized O(1).
As schmmd mentions, you want it to be tail recursive, and you also want it to be lazy. Fortunately there is a perfect data-structure for this: Stream.
This is a very efficient prime calculator implemented as a Stream, with a few optimisations:
object Prime {
def is(i: Long): Boolean =
if (i == 2) true
else if ((i & 1) == 0) false // efficient div by 2
else prime(i)
def primes: Stream[Long] = 2 #:: prime3
private val prime3: Stream[Long] = {
#annotation.tailrec
def nextPrime(i: Long): Long =
if (prime(i)) i else nextPrime(i + 2) // tail
def next(i: Long): Stream[Long] =
i #:: next(nextPrime(i + 2))
3 #:: next(5)
}
// assumes not even, check evenness before calling - perf note: must pass partially applied >= method
def prime(i: Long): Boolean =
prime3 takeWhile (math.sqrt(i).>= _) forall { i % _ != 0 }
}
Prime.is is the prime check predicate, and Prime.primes returns a Stream of all prime numbers. prime3 is where the Stream is computed, using the prime predicate to check for all prime divisors less than the square root of i.
/**
* #return Bitset p such that p(x) is true iff x is prime
*/
def sieveOfEratosthenes(n: Int) = {
val isPrime = mutable.BitSet(2 to n: _*)
for (p <- 2 to Math.sqrt(n) if isPrime(p)) {
isPrime --= p*p to n by p
}
isPrime.toImmutable
}
A sieve method is your best bet for small lists of numbers (up to 10-100 million or so).
see: Sieve of Eratosthenes
Even if you want to find much larger numbers, you can use the list you generate with this method as divisors for testing numbers up to n^2, where n is the limit of your list.
#mfa has mentioned using a Sieve of Eratosthenes - SoE and #Luigi Plinge has mentioned that this should be done using functional code, so #netzwerg has posted a non-SoE version; here, I post a "almost" functional version of the SoE using completely immutable state except for the contents of a mutable BitSet (mutable rather than immutable for performance) that I posted as an answer to another question:
object SoE {
def makeSoE_Primes(top: Int): Iterator[Int] = {
val topndx = (top - 3) / 2
val nonprms = new scala.collection.mutable.BitSet(topndx + 1)
def cullp(i: Int) = {
import scala.annotation.tailrec; val p = i + i + 3
#tailrec def cull(c: Int): Unit = if (c <= topndx) { nonprms += c; cull(c + p) }
cull((p * p - 3) >>> 1)
}
(0 to (Math.sqrt(top).toInt - 3) >>> 1).filterNot { nonprms }.foreach { cullp }
Iterator.single(2) ++ (0 to topndx).filterNot { nonprms }.map { i: Int => i + i + 3 }
}
}
How about this.
def getPrimeUnder(n: Int) = {
require(n >= 2)
val ol = 3 to n by 2 toList // oddList
def pn(ol: List[Int], pl: List[Int]): List[Int] = ol match {
case Nil => pl
case _ if pl.exists(ol.head % _ == 0) => pn(ol.tail, pl)
case _ => pn(ol.tail, ol.head :: pl)
}
pn(ol, List(2)).reverse
}
It's pretty fast for me, in my mac, to get all prime under 100k, its take around 2.5 sec.
A scalar fp approach
// returns the list of primes below `number`
def primes(number: Int): List[Int] = {
number match {
case a
if (a <= 3) => (1 to a).toList
case x => (1 to x - 1).filter(b => isPrime(b)).toList
}
}
// checks if a number is prime
def isPrime(number: Int): Boolean = {
number match {
case 1 => true
case x => Nil == {
2 to math.sqrt(number).toInt filter(y => x % y == 0)
}
}
}
def primeNumber(range: Int): Unit ={
val primeNumbers: immutable.IndexedSeq[AnyVal] =
for (number :Int <- 2 to range) yield {
val isPrime = !Range(2, Math.sqrt(number).toInt).exists(x => number % x == 0)
if(isPrime) number
}
for(prime <- primeNumbers) println(prime)
}
object Primes {
private lazy val notDivisibleBy2: Stream[Long] = 3L #:: notDivisibleBy2.map(_ + 2)
private lazy val notDivisibleBy2Or3: Stream[Long] = notDivisibleBy2
.grouped(3)
.map(_.slice(1, 3))
.flatten
.toStream
private lazy val notDivisibleBy2Or3Or5: Stream[Long] = notDivisibleBy2Or3
.grouped(10)
.map { g =>
g.slice(1, 7) ++ g.slice(8, 10)
}
.flatten
.toStream
lazy val primes: Stream[Long] = 2L #::
notDivisibleBy2.head #::
notDivisibleBy2Or3.head #::
notDivisibleBy2Or3Or5.filter { i =>
i < 49 || primes.takeWhile(_ <= Math.sqrt(i).toLong).forall(i % _ != 0)
}
def apply(n: Long): Stream[Long] = primes.takeWhile(_ <= n)
def getPrimeUnder(n: Long): Long = Primes(n).last
}

Resources