Execution time julia program to count primes - performance

I am experimenting a bit with julia, since I've heard that it is suitable for scientific calculus and its syntax is reminiscent of python. I tried to write and execute a program to count prime numbers below a certain n, but the performances are not the ones hoped.
Here I post my code, with the disclaimer that I've literally started yesterday in julia programming and I am almost sure that something is wrong:
n = 250000
counter = 0
function countPrime(counter)
for i = 1:n
# print("begin counter= ", counter, "\n")
isPrime = true
# print("i= ", i, "\n")
for j = 2:(i-1)
if (i%j) == 0
isPrime = false
# print("j= ", j, "\n")
break
end
end
(isPrime==true) ? counter += 1 : counter
# print("Counter= ", counter, "\n")
end
return counter
end
println(countPrime(counter))
The fact is that the same program ported in C has about 5 seconds of execution time, while this one in julia has about 3 minutes and 50 seconds, which sounds odd to me since I thought that julia is a compiled language. What's happening?

Here is how I would change it:
function countPrime(n)
counter = 0
for i in 1:n
isPrime = true
for j in 2:i-1
if i % j == 0
isPrime = false
break
end
end
isPrime && (counter += 1)
end
return counter
end
This code runs in about 5 seconds on my laptop. Apart from stylistic changes the major change is that you should pass n as a parameter to your function and define the counter variable inside your functions.
The changes follow one of the first advices in the Performance Tips section of the Julia Manual.
The point is that when you use a global variable the Julia compiler is not able to make assumptions about the type of this variable (as it might change after the function was compiled), so it defensively assumes that it might be anything, which slows things down.
As for stylistic changes note that (isPrime==true) ? counter += 1 : counter can be written just as isPrime && (counter += 1) as you want to increment the counter if isPrime is true. Using the ternary operator ? : is not needed here.
To give a MWE of a problem with using global variables in functions:
julia> x = 10
10
julia> f() = x
f (generic function with 1 method)
julia> #code_warntype f()
MethodInstance for f()
from f() in Main at REPL[2]:1
Arguments
#self#::Core.Const(f)
Body::Any
1 ─ return Main.x
You can see that here inside the f function you refer to the global variable x. Therefore, when Julia compiles f it must assume that the value of x can have any type (which is called in Julia Any). Working with such values is slow as the compiler cannot use any optimizations that would take advantage of more specific type of value processed.

Related

Suggestions for performance improvement in Julia

I'm making my first effort to move from Matlab to Julia and have found my code to improve by ~3x but still think there is more to come, I'm not using any global variables in the function and have preallocated all the arrays used (I think?). If there was any thoughts on how it could be sped up even further it would be greatly appreciated, I'll fully convert even at the current improvement I think!
function word_sim(tau::Int, omega::Int, mu::Float64)
# inserts a word in position (tau+1), at each point creates a new word with prob mu
# otherwise randomly chooses a previously used. Runs the program until time omega
words = zeros(Int32, 1, omega) # to store the words
tests = rand(1,omega) # will compare mu to these
words[1] = 1; # initialize the words
next_word = 2 # will be the next word used
words[tau+1] = omega + 1; # max possible word so insert that at time tau
innovates = mu .> tests; # when we'll make a new word
for i = 2:tau # simulate the process
if innovates[i] == 1 # innovate
words[i] = next_word
next_word = next_word + 1
else # copy
words[i] = words[rand(1:(i-1))]
end
end
# force the word we're interested in
for i = (tau+2):omega
if innovates[i] == 1 # innovate
words[i] = next_word
next_word = next_word + 1
else # copy
words[i] = words[rand(1:(i-1))]
end
end
result = sum(words .== (omega + 1)); # count how many times our word occurred
return result
end
and when I run it with these values it takes ~.26 seconds on my PC
using Statistics
#time begin
nsim = 10^3;
omega = 100;
seed = [0:1:(omega-1);];
mu = 0.01;
results = zeros(Float64, 1, length(seed));
pops = zeros(Int64, 1, nsim);
for tau in seed
for jj = 1:nsim
pops[jj] = word_sim(tau, omega, mu);
end
results[tau+1] = mean(pops);
end
end
Or perhaps I'd be better writing the code in C++? Julia was my first reaction as I've heard rave reviews about its syntax, which to be honest is fantastic!
Any comments greatly appreciated.
A 3x speedup is a nice start, but it turns out there are a few more things you can do to improve performance significantly!
As a starting point, using your example posted above in Julia 1.6.1, I get
0.301665 seconds (798.10 k allocations: 164.778 MiB, 12.70% gc time)
That's a lot of allocations, and a fair amount of garbage collector ("gc") time, so it seems we're producing a fair amount of garbage here. Some of the culprits are lines like
tests = rand(1,omega) # will compare mu to these
or
innovates = mu .> tests; # when we'll make a new word
In languages like Matlab or Python, pre-calculating these things whole-vector-at-a-time can be good for performance, but in Julia it's generally not really necessary, and can even hurt because each of these lines is causing a brand new array to be allocated. If we remove these and just generate our tests on the fly, we can avoid these allocations. One other line that allocates in here is
result = sum(words .== (omega + 1))
where you first build a whole new array before taking the sum of it. You could avoid this by writing it as a for loop (even though this may feel wrong coming from Matlab, it's quite fast in Julia). Or, to keep it as a one-liner, use either count or sum with a function that does the comparison as the first argument
result = count(x->(x == omega+1), words)
(in this example, just using an anonymous function x->(x == omega+1)).
Adding up these changes so far then
function word_sim(tau::Int, omega::Int, mu::Float64)
# inserts a word in position (tau+1), at each point creates a new word with prob mu
# otherwise randomly chooses a previously used. Runs the program until time omega
words = zeros(Int32, 1, omega) # to store the words
words[1] = 1; # initialize the words
next_word = 2 # will be the next word used
words[tau+1] = omega + 1; # max possible word so insert that at time tau
for i = 2:tau # simulate the process
if mu > rand() # innovate
words[i] = next_word
next_word = next_word + 1
else # copy
words[i] = words[rand(1:(i-1))]
end
end
# force the word we're interested in
for i = (tau+2):omega
if mu > rand() # innovate
words[i] = next_word
next_word = next_word + 1
else # copy
words[i] = words[rand(1:(i-1))]
end
end
result = count(x->(x == omega+1), words) # count how many times our word occurred
return result
end
Using the same timing code, this now brings us down to
0.177766 seconds (298.10 k allocations: 51.863 MiB, 13.01% gc time)
So about half the time and half the allocations. There's still more though!
First, let's move the allocation of the words array outside of the word_sim function and instead make an in-place version of that function. We can also speed things up a adding an #inbounds to the tight for loops.
function word_sim!(words::AbstractArray, tau::Int, omega::Int, mu::Float64)
# inserts a word in position (tau+1), at each point creates a new word with prob mu
# otherwise randomly chooses a previously used. Runs the program until time omega
fill!(words, 0) # Probably not necessary actually, but I haven't spent enough time looking at the code to be sure
words[1] = 1; # initialize the words
next_word = 2 # will be the next word used
words[tau+1] = omega + 1; # max possible word so insert that at time tau
#inbounds for i = 2:tau # simulate the process
if mu > rand() # innovate
words[i] = next_word
next_word = next_word + 1
else # copy
words[i] = words[rand(1:(i-1))]
end
end
# force the word we're interested in
#inbounds for i = (tau+2):omega
if mu > rand() # innovate
words[i] = next_word
next_word = next_word + 1
else # copy
words[i] = words[rand(1:(i-1))]
end
end
result = count(x->(x == omega+1), words) # count how many times our word occurred
return result
end
In-place functions that modify one of their input arguments are usually denoted by a ! at the end of their name by convention in Julia, hence the new function name.
Since we have to modify the timing code a bit to pre-allocate words now, let's also take the opportunity to put that timing code into a function to avoid any globals in the timing.
function run_word_sim()
nsim = 10^3
omega = 100
seed = [0:1:(omega-1);]
mu = 0.01
results = zeros(Float64, 1, length(seed))
pops = zeros(Int64, 1, nsim)
words = zeros(Int32, 1, omega) # to store the words
for tau in seed
for jj = 1:nsim
pops[jj] = word_sim!(words, tau, omega, mu)
end
results[tau+1] = mean(pops)
end
return results
end
Then get the most accurate timing results (and optionally some useful plots and statistics) we can use the BenchmarkTools package and its #btime or #benchmark macros
julia> using BenchmarkTools
julia> #btime run_word_sim()
124.178 ms (4 allocations: 10.17 KiB)
or
So, almost another 3x speedup, and reduced allocations and memory usage (by four or five orders of magnitude) down to only the four arrays used in the timing code (seed, results, pops and words).
For the absolute maximum performance, you could possibly go even farther with LoopVectorization.jl and its #turbo macro, though it would likely require a change in algorithm since these loops depend on previous state, so don't appear to be compatible with loop re-ordering. You could turn the count into a for loop and #turbo that for a slight additional speedup though.
There are also other options for potentially faster random number generation, such as VectorizedRNG.jl as discussed in the discourse thread linked in the comments. While allocating a new vector of random numbers on each call of word_sim is likely not optimal, RNG is generally faster when you can generate a lot of random numbers at once, so passing a pre-allocated buffer of random numbers to word_sim! and filling that in-place with rand! as provided by either the Random stdlib or VectorizedRNG could yield a significant additional speedup.
Some of the tricks and rules of thumb used in this answer are discussed more generally in https://github.com/brenhinkeller/JuliaAdviceForMatlabProgrammers, along with a few other general Matlab -> Julia tips.

Understanding this Ruby program concerned with prime numbers

This is a program that came from a textbook:
# Initialize our counter
i = 1
# i: [0, 100]
while (i <= 100)
# Initialize prime flag
prime_flag = true
j = 2
# Test divisibility of i from [0, i/2]
while (j <= i / 2)
# puts " i ==> " to i.to_s + " j ==> " + j.to_s
if (i % j == 0)
prime_flag = false
# break
end
j = j + 1
end
# We found a prime!
if prime_flag
puts "Prime ==> " + i.to_s
end
# Increment the counter
i += 1
end
The while (j <= i / 2) introduces a new loop. What if we are trying to find prime numbers. Why is this written? Prime numbers don't have square roots. What is the purpose of j being <= i / 2? I do not understand why j is introduced.
You are correct that you should be only checking numbers <= floor(sqrt(i)). The above code is unnecessarily checking numbers from ceil(sqrt(i)) through i/2. It would give the correct answer, however.
In addition, this is not very Ruby-like code. It's terrible and the author should feel terrible (unless they intended to show you something bad in order for you to be amazed when you see how you can write it better!).
Here's the same code done in a more Ruby-like manner. Note that prime? can definitely be a one-liner, but I split things on to more lines readability in the context of the question:
def prime?(i) # Define a function that takes 1 parameter `i`
MAX_NUM_TO_CHECK = Math.sqrt(i) # No need to check numbers greater than sqrt(i)
(2..MAX_NUM_TO_CHECK).all? do |j| # Return `true` if the following line is true for
# all numbers [2,MAX_NUM_TO_CHECK]
i % j != 0 # true if `i` is evenly not divisible by `j`. Any
# input that evaluates to false here is not prime.
end
end
# Test primality of numbers [1,100]
(1..100).each {|n| puts "Prime ==> #{n}" if prime? n}
I think the biggest differences between your book and this code are:
The algorithm is different in that we do not check all values, but rather limit the checks to <= sqrt(i). We also stop checking once we know a number is not prime.
We iterate over Ranges rather than keeping counters. This is slightly higher level and easier to read once you get the syntax.
I split the code into two parts, a function that calculates whether a parameter is prime or not, and then we iterate over a Range of inputs (1..100). This seems like a good division of functionality to me, helping readability.
Some language features used here not in your example:
If statements can go after expressions, and the expression is only evaluated if the predicate (the thing after the if) evaluates to true. This can make some statements more readable.
A range is written (x..y) and allows you to quickly describe a series of values that you can iterate over without keeping counters.
Code inside
do |param1, ..., paramN| <CODE>; end
or
{|param1, ..., paramN| <CODE>}
is called a block. It's an anonymous function (a function passed in as a parameter to another function/method). We use this with all? and each here.
each is used to run a block of code on each element of a collection, ignoring the return value
all? can be used to determine whether a block returns true for every item in a collection.
If you're not familiar with passing code into functions, this is probably a little confusing. It allows you to execute different code based on your needs. For example, each runs the yielded block for every item in the collection.You could do anything in that block without changing the definition of each... you just yield it a block and it runs that block with the proper parameters. Here is a good blog post to get you started on how this works and what else you can use it for.

Julia: Return early from pmap()

Say I have something like the following:
function f(x)
some_test ? true : false
end
If I do pmap(f,some_array) I'll get some array of Bools. I'd like to do something if contains(==,p,false). However, I'd like to do this thing if there is at least just one false. I.e. if some_array is very very large I would like pmap to stop once it finds its first false.
some_test may be quite involved so I've read that a parallel for loop is not the way to go.
E.g if I have
p = pmap(f,some_array,[N for i = 1:some_large_value])
if contains(==,p,false)
return false
else
return true
end
and a false appears when i=100, how can I stop pmap from checking 101:some_large_value?
As another example of the behavior I'd like to do, take this example from ?pmap.
julia> pmap(x->iseven(x) ? error("foo") : x, 1:4; on_error=ex->0)
4-element Array{Int64,1}:
1
0
3
0
Instead of on_error=ex->0 I'd like pmap to return on the first even. Something like
pmap(x->iseven(x) ? return : x, 1:4)
which would ideally result in only a 1-element Array{Int64,1}.
This is generally hard to do, since other tasks may have already started. If you're not that worried about doing a few extra runs, one way is to modify the pmap example from the parallel computing docs
function pmap_chk(f, lst)
np = nprocs() # determine the number of processes available
n = length(lst)
results = Vector{Any}(n)
i = 0
nextidx() = (i+=1; i)
done = false
isdone() = done
setdone(flag) = (done = flag)
#sync begin
for p=1:np
if p != myid() || np == 1
#async begin
while !isdone()
idx = nextidx()
if idx > n
break
end
r, flag = remotecall_fetch(f, p, lst[idx])
results[idx] = r
if flag
setdone(flag)
end
end
end
end
end
end
resize!(results, i)
end
Here f should return a tuple containing the result and whether or not it is done, e.g.
julia> pmap_chk(1:100) do f
r = rand()
sleep(r)
(r, r>0.9)
end
15-element Array{Any,1}:
0.197364
0.60551
0.794526
0.105827
0.612087
0.170032
0.8584
0.533681
0.46603
0.901562
0.0894842
0.718619
0.501523
0.407671
0.514958
Note that it doesn't stop immediately.

How to decide between Global or Local vector for m choose n

Suppose we want to write a code that prints all ways of selecting n out of m options.
I think, the programming language does not matter, but if I should state it, Python.
I put the assignments in a vector A. Do I better define A as a global variable or pass it to the function each time? Why?
def choose(ind, n):
if n == 0:
print(A)
return
elif len(A)<= ind:
return
else:
A[ind] = 1
choose(ind + 1, n - 1)
A[ind] = 0
choose(ind + 1, n)
Always prefer passing over mutating globals whenever feasible.
Say you have the following functions:
def some_fun1 (n):
return n + 1;
m = 1;
def some_fun2 ():
return m + 1
With the first function, you can load up your REPL and throw data at it just by passing it as an argument. Your testing of that pure function has 0 effect on the rest of the program, which makes testing significantly easier.
With the second function, any time you need to test it, you must manually set all the globals the function relies on, which could potentially affect the operation of other functions if they rely on the same globals. This makes testing harder, and for that reason, among others, mutating globals should be avoided.

while-loop faster than for when returning iterator

I'm trying to oversimplify this as much as possible.
functions f1and f2 implement a very simplified version of a roulette wheel selection over a Vector R. The only difference between them is that f1 uses a for and f2 a while. Both functions return the index of the array where the condition was met.
R=rand(100)
function f1(X::Vector)
l = length(X)
r = rand()*X[l]
for i = 1:l
if r <= X[i]
return i
end
end
end
function f2(X::Vector)
l = length(X)
r = rand()*X[l]
i = 1
while true
if r <= X[i]
return i
end
i += 1
end
end
now I created a couple of test functions...
M is the number of times we repeat the function execution.
Now this is critical... I want to store the values I get from the functions because I need them later... To oversimplify the code I just created a new variable r where I sum up the returns from the functions.
function test01(M,R)
cumR = cumsum(R)
r = 0
for i = 1:M
a = f1(cumR)
r += a
end
return r
end
function test02(M,R)
cumR = cumsum(R)
r = 0
for i = 1:M
a = f2(cumR)
r += a
end
return r
end
So, next I get:
#time test01(1e7,R)
elapsed time: 1.263974802 seconds (320000832 bytes allocated, 15.06% gc time)
#time test02(1e7,R)
elapsed time: 0.57086421 seconds (1088 bytes allocated)
So, for some reason I can't figure out f1 allocates a lot of memory and its even greater the larger M gets.
I said the line r += a was critical, because if I remove it from both test functions, I get the same result with both tests, so no problems! So I thought there was a problem with the type of a being returned by the functions (because f1 returns the iterator of the for loop, and f2 uses its own variable i "manually declared" inside the function).
But...
aa = f1(cumsum(R))
bb = f2(cumsum(R))
typeof(aa) == typeof(bb)
true
So... what that hell is going on???
I apologize if this is some sort of basic question but, I've been going over this for over 3 hours now and couldn't find an answer... Even though the functions are fixed by using a while loop I hate not knowing what's going on.
Thanks.
When you see lots of surprising allocations like that, a good first thing to check is type-stability. The #code_warntype macro is very helpful here:
julia> #code_warntype f1(R)
# … lots of annotated code, but the important part is this last line:
end::Union{Int64,Void}
Compare that to f2:
julia> #code_warntype f2(R)
# ...
end::Int64
So, why are the two different? Julia thinks that f1 might sometimes return nothing (which is of type Void)! Look again at your f1 function: what would happen if the last element of X is NaN? It'll just fall off the end of the function with no explicit return statement. In f2, however, you'll end up indexing beyond the bounds of X and get an error instead. Fix this type-instabillity by deciding what to do if the loop completes without finding the answer and you'll see more similar timings.
As I stated in the comment, your functions f1 and f2 both contain random numbers inside it, and you are using the random numbers as stopping criterion. Thus, there is no deterministic way to measure which of the functions is faster (doesn't depend in the implementation).
You can replace f1 and f2 functions to accept r as a parameter:
function f1(X::Vector, r)
for i = 1:length(X)
if r <= X[i]
return i
end
end
end
function f2(X::Vector, r)
i = 1
while i <= length(X)
if r <= X[i]
return i
end
i += 1
end
end
And then measure the time properly with the same R and r for both functions:
>>> R = cumsum(rand(100))
>>> r = rand(1_000_000) * R[end] # generate 1_000_000 random thresholds
>>> #time for i=1:length(r); f1(R, r[i]); end;
0.177048 seconds (4.00 M allocations: 76.278 MB, 2.70% gc time)
>>> #time for i=1:length(r); f2(R, r[i]); end;
0.173244 seconds (4.00 M allocations: 76.278 MB, 2.76% gc time)
As you can see, the timings are now nearly identical. Any difference will be caused for external factors (warming or processor busy with other tasks).

Resources