Related
I am working on a scientific code that is experiencing issues with parallelization.
The parallel version is slower than the serial one and I am not sure if the right approaches are used for this application.
How can I improve the performance of the parallel calculation?
Is the right approach being used or should other packages / functions be considered for parallelization?
I have already tried a larger workload, however this makes no difference.
I suspect the problem is somehow due to data movement between workers, but I don't know how to check or improve this one.
Parallel programming with Julia is still relatively new for me, so I am very grateful for any help!
The simulation code is something of a benchmark for the Julia programming language, as our team is considering using Julia for all future projects if strong performance advantages to the current workflow can be demonstrated.
Because of this, I would like to maximize performance, also since calculations with very large models as well as possible use on a cluster are planned.
Minimum Working Example
The critical parts of the code can be broken down to the following example.
I start the process as follows:
using Distributed
addprocs();
#everywhere using SharedArrays, LinearAlgebra, Test
First I define the simulation model, containing all data used for the calculations.
Is it actually okay to store SharedArrays with other data in a struct or should a different approach be used?
#everywhere struct Model
idx::Vector{Tuple{Int,Int}} # indices
A::SharedMatrix{Float64} # results, will be constantly updated
B::Vector{Float64} # part of pre-processing, will only be read
end
See the non-parallel version of the function used for the update of the model below.
function update(m::Model, factor::Float64)
L::Float64 = 0.
k::Float64 = 0.
cnt::Int = 0
for (i,j) in m.idx
cnt+=1
L = norm(m.A[:,i]-m.A[:,j])
k = factor * m.B[cnt]
m.A[:,i] .+= k*L
m.A[:,j] .-= k*L
end
end
For parallelization, I simply tried the following. Is perhaps an approach with pmap better in this case?
#everywhere function parallel_update(m::Model, factor::Float64)
L::Float64 = 0.
k::Float64 = 0.
cnt::Int = 0
#sync #distributed for (i,j) in m.idx
cnt+=1
L = norm(m.A[:,i]-m.A[:,j])
k = factor * m.B[cnt]
m.A[:,i] .+= k*L
m.A[:,j] .-= k*L
end
end
To test the results I use the following function:
#everywhere function test_my_code()
# provide some data
n = 10000000
idx = [(rand(1:n),rand(1:n)) for k=1:n]
A = SharedArray(hcat(([rand(0.:1000.);rand(0.:1000.);rand(0.:1000.)] for k=1:n)...))
B = [rand(0.:1000.) for k=1:n]
# define models
model1 = Model(idx,A,B)
model2 = Model(idx,A,B)
# test and compare results
#time update(model1,2.)
#time parallel_update(model2,2.)
#test model1 == model2
end
julia> test_my_code() # first run
6.350694 seconds (50.00 M allocations: 5.215 GiB, 13.66% gc time)
11.422999 seconds (6.69 k allocations: 446.156 KiB)
Test Passed
julia> test_my_code() # second run
6.286828 seconds (50.00 M allocations: 5.215 GiB, 18.35% gc time)
6.297144 seconds (2.92 k allocations: 143.516 KiB)
Test Passed
Note: significant performance improvements for the serial code
I was already able to significantly improve the performance of the serial function and reduce the number of allocations to zero.
Since this seems to make no difference to the parallelization problem, I used the shorter, easier-to-read version for the previous example.
See the serial code below.
using LinearAlgebra, Test
struct Model
idx::Vector{Tuple{Int,Int}}
A::Matrix{Float64}
B::Vector{Float64}
end
function update(m::Model, factor::Float64)
L::Float64 = 0.
k::Float64 = 0.
cnt::Int = 0
for (i,j) in m.idx
cnt+=1
L = norm(m.A[:,i]-m.A[:,j])
k = factor * m.B[cnt]
m.A[:,i] .+= k*L
m.A[:,j] .-= k*L
end
end
function update_fast(m::Model, factor::Float64)
L::Float64 = 0.
k::Float64 = 0.
cnt::Int = 0
for (i,j) in m.idx
cnt+=1
L = sqrt((m.A[1,i]-m.A[1,j])^2 +
(m.A[2,i]-m.A[2,j])^2 +
(m.A[3,i]-m.A[3,j])^2)
k = factor * m.B[cnt]
m.A[1,i] += k*L
m.A[2,i] += k*L
m.A[3,i] += k*L
m.A[1,j] -= k*L
m.A[2,j] -= k*L
m.A[3,j] -= k*L
end
end
function test_serial_speedup()
n = 10000000
idx = [(rand(1:n),rand(1:n)) for k=1:n]
A = hcat(([rand(0.:1000.);rand(0.:1000.);rand(0.:1000.)] for k=1:n)...)
B = [rand(0.:1000.) for k=1:n]
model1 = Model(idx,A,B)
model2 = Model(idx,A,B)
#time update(model1,2.)
#time update_fast(model2,2.)
#test model1 == model2
end
julia> test_serial_speedup()
5.008049 seconds (50.00 M allocations: 5.215 GiB, 18.14% gc time)
0.464986 seconds
Test Passed
I have a simple function which appears at several places in my Julia code and is run millions of times inside a loop. The function essentially does rand([1,-1,im,-im]), that's, it picks one of four possible given values. I noticed that this function takes a substantial amount of time in my huge loop, so, I tried to write it in a slightly faster way like this:
function qpsk()
temp1 = ifelse(rand(Bool), 1+0im, -1+0im)
temp2 = ifelse(rand(Bool), 1+0im, 0+1im)
temp1*temp2
end
Then, it is typically called like this:
sig = complex(zeros(N))
for i = 1:N
sig[i] = qpsk()
end
Now, is there any way to further optimize this function, or use another faster method? Appreciate your help.
Comments on current answers:
The answer of #DanGetz (22 lines??) doesn't solve the problem, because at the moment, Julia is not as good at vectors as with explicit loops. Also,
my simple, 1 line qpsk2(s) below, is about 2X faster than those "cryptic" 22 lines of code in the original answer by Dan (a vector is created, though, which adds more time).
But the question remains, why they
didn't implement something like qpsk1 below? and why my original qpsk with branching is more than 3X faster than the straightforward qpsk4(s) below?
I added more versions below to guide the discussion if more experienced people like to jump in.
qpsk1(s) = s[1+(rand(Int8)&3)] # Blazingly fast
qpsk2(s) = s[1+rand(Bool)+2rand(Bool)] # Very fast
qpsk3(s) = s[rand(1:4,1)] # Compiler issue here?
qpsk4(s) = s[rand(1:4)] # Why slow?
qpsk5(s) = rand([s]) # Ridiculously slow!!
function test_orig(n) # Test qpsk(), very fast(branching!), why?
for i = 1:n
qpsk()
end
end
using StaticArrays
function test(func, n) # Test all qpsk1 --> qpsk5
s = SVector(1,-1,im,-im)
for i=1:n
func(s)
end
end
#time test(qpsk1,10^8) 0.554994 seconds (5 allocations: 176 bytes)
#time test(qpsk2,10^8) 0.755286 seconds (5 allocations: 176 bytes)
#time test(qpsk3,10^8) 13.431529 seconds (400 M allocations: 26.822 GiB, 20.68% gc time)
#time test(qpsk4,10^8) 2.520085 seconds (5 allocations: 176 bytes)
#time test(qpsk5,10^8) 10.881852 seconds (200 M allocations: 20.862 GiB, 19.76% gc time)
#time test_orig(10^8) 0.771778 seconds (5 allocations: 176 bytes)
#time nqpsk2(10^8); 1.402830 seconds (9 allocations: 1.490 GiB, 6.39% gc time)
Summary of answer
[(-1)^b1*im^b2 for (b1,b2) in zip(rand!(BitVector(N)),rand!(BitVector(N)))]
generates a length N vector faster.
Answer
Calculating the random bits is the bulk of the work, so exploring Chris' idea from comments of using RandomNumbers.jl is worth a shot. Additionally, we can use #rickhg12hs's idea to extract more bits from each random number generated. Regardless, generating a block of values together is essential for better optimization.
For example, the following code (nqpsk1 uses qpsk from the question as the baseline. nqpsk2 is a suggested improvement):
function qpsk()
temp1 = ifelse(rand(Bool), 1+0im, -1+0im)
temp2 = ifelse(rand(Bool), 1+0im, 0+1im)
temp1*temp2
end
nqpsk1(n::Int) = [qpsk() for i=1:n]
nqpsk2(n::Int) = begin
res = zeros(Int,2*n)
blocks = n >>> 4 # use blocks of 16 values
btail = n & 0x000000000000000f # in case n is not a multiple of 16
pos = 1
#inbounds for i=1:blocks
bits = rand(UInt32) # get random bits for a whole block
for j=1:16
b1 = Bool(bits & 1)
bits >>>= 1
b2 = Bool(bits & 1)
bits >>>= 1
res[pos+b1] = (-1)^b2
pos += 2
end
end
#inbounds for i=1:btail
res[pos+rand(Bool)] = (-1)^rand(Bool)
pos += 2
end
return reinterpret(Complex{Int64},res)
end
achieved a >4x improvement on my setup (Julia 0.7):
julia> using BenchmarkTools
julia> #btime nqpsk1(320);
8.791 μs (323 allocations: 15.19 KiB)
julia> #btime nqpsk2(320);
1.056 μs (3 allocations: 5.20 KiB)
Update
With only a modest compromise in speed (and some allocation), but much better looking code:
function nqpsk3(n::Int)
res = zeros(Int,2n)
rv1 = rand!(BitVector(n))
rv2 = rand!(BitVector(n))
#inbounds for (b1,b2,i) in zip(rv1,rv2,1:2:2n)
res[i+b1] = (-1)^b2
end
return reinterpret(Complex{Int},res)
end
The benchmark:
julia> #btime nqpsk3(320);
1.780 μs (11 allocations: 5.83 KiB)
Addendum
And the one-(wrapped)-line version, does OK (2.48 μs) too:
nqpsk4(n) = [(1+0im,-1+0im,0+im,0-im)[2b1+b2+1] for
(b1,b2) in zip(rand!(BitVector(n)),rand!(BitVector(n)))]
Finally, the real one-line version (1.96 μs):
nqpsk5(n) = [(-1)^b1*im^b2 for (b1,b2) in zip(rand!(BitVector(n)),rand!(BitVector(n)))]
Latest state of investigation
My current best solution is the following:
function g(pX::Array{Complex{Float64},1})
tab = [1.0,im,-1.0,-im]
bits = UInt128(0)
#inbounds for i = 1 : length(pX)
bits = (i % 64) == 1 ? rand(UInt128) : bits >>> 2
pX[i] = tab[(bits & 3)+1]
end
end
sig = complex(zeros(1280));
using BenchmarkTools
#btime g(sig)
3.838 μs (13 allocations: 464 bytes)
This is better than my optimized version of Dan Getz which runs with the same N, and i feel much more readable
4.236 μs (4 allocations: 20.16 KiB)
However, the performance is extremely fragil. Just have a look at subtle differences to this 36 times slower version:
function g(pX::Array{Complex{Float64},1})
tab = [1,im,-1,-im]
bits = 0
for i = 1 : length(pX)
bits = (i % 64) == 1 ? rand(UInt128) : bits >>> 2
pX[i] = tab[(bits & 3)+1]
end
end
138.320 μs (10209 allocations: 319.14 KiB)
Did you find the differences?
no conversion from Int64 to Float64
Type stability
disable range checking
to follow the convention g() should be renamed to g!()
In the following you find the evolution to the currently best timed solution
My first approach to answer was addressing general weaknesses
a) calling functions is expensive due to calling overhead.
b) complex calculations are more time consuming than lookup.
This end up with the proposal
cases = [1+0im,0+1im,-1+0im,0-1im]
g() = cases[rand(1:4)]
// to use just call g()
g()
What happend?
Why does a) do not succeed?
using BenchmarkTools
test(n) = [q() for i = 1:n]
g() = rand()
#btime test(800);
This results to
rand() => 5.784
rand(Float32) => 5.604
rand(Float64) => 5.821
rand(Bool) => 5.167
rand(Int8) => 5.126
rand(Int16) => 5.171
rand(Int32) => 5.631
rand(Int64) => 7.980
rand(Int128) => 10.549
rand(1:4) => 28.603
(rand(Int8) % 4) + 1 => 6.053
(rand(Int8) & 3) + 1 => 5.843
rand(0:255) => 28.568
rand(UInt8) => 5.104
rand([1,2,3,4]) => 58.437
l = [1,2,3,4]; g() = rand(l) => 47.399
rand(l, 1) => 70.052
m = (1,2,3,4); rand(m) => 124.311
0 => 0.872
0.0 => 0.887
Int8(0) => 0.113
return => 0.33
(running Julia 0.6 on Ubuntu)
How to judge the results
requesting float32 and float64 needs same time. This may be an indicator that float64 does NOT the full mantisse (of 56 bits) for random value
rand for Bool, Int8, Int16 needs nearly the same time. Probably same algorithmus just using fewer bits.
rand for Int32 slightly more time consuming. Int64 and Int128 takes under proportional more time.
rand(1:4) takes surprisingly much more time. It should be in the range of rand(Int8) since it is equivalent to (rand(Int8) % 4) + 1 and (rand(Int8) & 3) + 1.
Even if I hurt somebodies religious feelings, this is just poor code.
Same to rand(Uint) and rand(0:255)
The performance of rand with arrays and tupel is far from acceptable!
Why does b) do not succeed?
Julia seems not able to lookup efficiently from tupels or arrays.
But even when lookup were fast, the rand methods dominates.
Other approaches
Dan Getz approach uses all bits from a rand call. So in end it needs in his first algorithm 1/16 calls per value.
However, this approach could be improved by using UInt128 since now 1/64 calls per value is required.
On my machine Dan Getz original code takes 17.314 for 1280 values while the modified code takes 4.595. The improvement is proportional to the reduced number of calls to rand!
test2(n::Int) = begin
res = zeros(Int,2*n)
blocks = n >>> 7 # use blocks of 16 values
btail = n & 0x000000000000007f # in case n is not a multiple of 16
pos = 1
#inbounds for i=1:blocks
bits = rand(UInt128) # get random bits for a whole block
for j=1:16
b1 = Bool(bits & 1)
bits >>>= 1
b2 = Bool(bits & 1)
bits >>>= 1
res[pos+b1] = (-1)^b2
pos += 2
end
end
#inbounds for i=1:btail
res[pos+rand(Bool)] = (-1)^rand(Bool)
pos += 2
end
return reinterpret(Complex{Int64},res)
end
#btime test2(1280);
However, the use of reinterpret means to know the bit layout of the different structures. That not a real good idea.
high level view
In the end all what the questioneer has coded was a complicated complex build array of random numbers from 1 to 4 (or 0 to 3). I would try to optimize the next step in questioneers following task. However, no info were supplied.
In that following case julia performs much better, which sounds somewhat strange. More returns, less time??
#btime rand(0:3, 1280)
=> 24.377
PS:
Just for comparing the numbers with Dan Getz last approach, the following code takes 27.004
N=1280
#btime [(-1)^b1*im^b2 for (b1,b2) in zip(rand!(BitVector(N)),rand!(BitVector(N)))]
I try to do this task:codewars kata
Description:
In John's car the GPS records every s seconds the distance travelled
from an origin (distances are measured in an arbitrary but consistent
unit). For example, below is part of a record with s = 15:
x = [0.0, 0.19, 0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0, 2.25] The
sections are:
0.0-0.19, 0.19-0.5, 0.5-0.75, 0.75-1.0, 1.0-1.25, 1.25-1.50, 1.5-1.75, 1.75-2.0, 2.0-2.25 We can calculate John's average hourly speed on every section and we get:
[45.6, 74.4, 60.0, 60.0, 60.0, 60.0, 60.0, 60.0, 60.0] Given s and x
the task is to return as an integer the floor of the maximum average
speed per hour obtained on the sections of x. If x length is less than
or equal to 1 return 0: the car didn't move.
Example:
with the above data your function gps(x, s)should return 74
My code:
def gps(s, x)
i = 0
speed = 0
max = 0
0 if x.length <= 1
while i < x.length - 2
speed = get_speed(x[i].to_f, x[i + 1].to_f, s)
max = speed if speed > max
i += 1
end
print max.floor
end
def get_speed(a, b, s)
((b - a).abs * ((60 / s) * 60))
end
Problem is with passing some tests.
Tests:
gps(20, [0.0, 0.23, 0.46, 0.69, 0.92, 1.15, 1.38, 1.61]) result: 41 - correct
gps(12, [0.0, 0.11, 0.22, 0.33, 0.44, 0.65, 1.08, 1.26, 1.68, 1.89, 2.1, 2.31, 2.52, 3.25]) result: 77 - incorrect, should be 219.
I don't have idea where I do wrong. Could someone give some hint to resolve problem?
#mcfinnigan's answer correctly identifies the immediate mistake in your code, but the real underlying cause is that you are not writing idiomatic Ruby. If you were writing idiomatic Ruby (instead of FORTRAN in Ruby syntax, as you are doing), then you would use iterators instead of manually fiddling with loop indices and the problem wouldn't even arise in the first place. Something like this:
def gps(interval, measurements)
compute_result(interval, measurements).tap(&method(:print))
end
private
def compute_result(interval, measurements)
return 0 if measurements.length <= 1
hourly_speed(max_distance(*distances(*measurements)), interval)
end
def distances(*measurements)
measurements.
each_cons(2). # iterate over all consecutive pairs
map {|a, b| b - a } # transform to list of distances travelled
end
def max_distance(*distances)
distances.max
end
def hourly_speed(distance, time_in_seconds)
seconds_per_hour = 60.0 * 60
(distance * seconds_per_hour / time_in_seconds).floor
end
As you see, there are no loops, no indices, no loop conditions, in fact, apart from the edgecase of an empty measurements array, there are no conditions at all, and so there are no conditions, indices, loops etc. to get wrong.
The problem is broken down into smaller subproblems that can be tested and debugged individually. Every method returns a value (instead of just printing to the console), which makes it possible to easily test it automatically (and also to reuse it in other methods).
while i < x.length - 2
This appears to be the issue. A classic off-by-one error; you are not considering the final element in your array.
Change your condition to
while i < x.length - 1
and your bug goes away.
As a part of my script. I have some codes which are as follows (devectorized julia -as possible as)
for kk=1:n # Main loop
for j=1:m
rhs[j]=2*u0[j]-alf*dt*u1[j]-2*mu*u2[j];
end
c=lhs\rhs'; #c: coefficients to be obtained
u2=c'*h;
u1=c'*p.-c'*f;
u0=c'*Q-c'*f*x;
for j=1:m
for i=1:m
lhs[j,i]=2*(Q[i,j]-x[j]*f[i])+alf*dt*(p[i,j]-f[i])+eps*dt*(Q[i,j]-x[j]*f[i])*u1[j]+eps*u0[j]*dt*(p[i,j]-f[i])-2*mu*h[i,j];
end
end
end
where h, p, Q, lhs are mxm matrices; u0, u1, u2, rhs and x are 1xm arrays, alf, dt, mu, eps are scalar constants and f, c are mx1 arrays. I preallocated the matrices and arrays at the start of the script. Vectorized form of the above codes are as follows
for kk=1:n # Main loop
rhs=2*u0-alf*dt*u1-2*mu*u2;
c=lhs\rhs'; #c coefficients to be obtained
u2=c'*h;
u1=c'*p.-c'*f;
u0=c'*Q-c'*f*x;
lhs=2*(Q-f*x)+alf*dt*(p.-f)+eps*dt*(Q-f*x).*u1+eps*dt*u0.*(p.-f)-2*mu*h;
lhs=lhs';
end
For example for n=100 and m=64 the elapsed times are as follows:
devectorized julia: 1.8 seconds
vectorized julia: 0.2 seconds
vectorized numpy: 0.04 seconds
the vectorized julia code is approximately 9 times faster than devectorized julia code and vectorized python code is approximately 5 times faster than the vectorized julia code.
For n=500 and m=256
devectorized julia: 85.589233013 seconds
vectorized julia: 8.232898003 seconds
vectorized numpy: 1.62000012398 seconds
My question: Is it possible to increase the performance of julia in this case?
i think it's possible to also devectorize calculation of u0,u1,u2 like this:
function vectorized()
m = [1.0 2.0 3.0; 1.0 2.0 3.0; 1.0 2.0 3.0]
c = [1.0, 2.0, 3.0]
for i in 1:100000
x1 = c'*m
x2 = c'*m
x3 = c'*m
end
return
end
function vectime(N)
timings = Array(Float64, N)
# Force compilation
vectorized()
for itr in 1:N
timings[itr] = #elapsed vectorized()
end
return timings
end
println("vectorized=",mean(vectime(20)))
function devectorized()
m = [1.0 2.0 3.0; 1.0 2.0 3.0; 1.0 2.0 3.0]
c = [1.0, 2.0, 3.0]
x1 = [0.0, 0.0, 0.0]
x2 = [0.0, 0.0, 0.0]
x3 = [0.0, 0.0, 0.0]
mx = 3
for i in 1:100000
for k in 1:mx
for kk in 1:mx
x1[k]=x1[k]+c[k]*m[k,kk];
x2[k]=x2[k]+c[k]*m[k,kk];
x3[k]=x3[k]+c[k]*m[k,kk];
end
end
end
return
end
function dvectime(N)
timings = Array(Float64, N)
# Force compilation
devectorized()
for itr in 1:N
timings[itr] = #elapsed devectorized()
end
return timings
end
println("devectorized=",mean(dvectime(20)))
above code results:
vectorized=0.17680755404999998
devectorized=0.00441064295
I am interested in using Julia SharedArrays for a scientific computing project. My current implementation appeals to BLAS for all matrix-vector operations, but I thought that perhaps a SharedArray would offer some speedup on multicore machines. My idea is to simply update an output vector index-by-index, farming the index updates to worker processes.
Previous discussions here about SharedArrays and here about shared memory objects did not offer clear guidance on this issue. It seems intuitively simple enough, but after testing, I'm somewhat confused as to why this approach works so poorly (see code below). For starters, it seems like #parallel for allocates a lot of memory. And if I prefix the loop with #sync, which seems like a smart thing to do if the whole output vector is required later, then the parallel loop is substantially slower (though without #sync, the loop is mighty quick).
Have I incorrectly interpreted the proper use of the SharedArray object? Or perhaps did I inefficiently assign the calculations?
### test for speed gain w/ SharedArray vs. Array ###
# problem dimensions
n = 10000; p = 25000
# set BLAS threads; 64 seems reasonable in testing
blas_set_num_threads(64)
# make normal Arrays
x = randn(n,p)
y = ones(p)
z = zeros(n)
# make SharedArrays
X = convert(SharedArray{Float64,2}, x)
Y = convert(SharedArray{Float64,1}, y)
Z = convert(SharedArray{Float64,1}, z)
# run BLAS.gemv! on Arrays twice, time second case
BLAS.gemv!('N', 1.0, x, y, 0.0, z)
#time BLAS.gemv!('N', 1.0, x, y, 0.0, z)
# does BLAS work equally well for SharedArrays?
# check timing result and ensure same answer
BLAS.gemv!('N', 1.0, X, Y, 0.0, Z)
#time BLAS.gemv!('N', 1.0, X, Y, 0.0, Z)
println("$(isequal(z,Z))") # should be true
# SharedArrays can be updated in parallel
# code a loop to farm updates to worker nodes
# use transposed X to place rows of X in columnar format
# should (hopefully) help with performance issues from stride
Xt = X'
#parallel for i = 1:n
Z[i] = dot(Y, Xt[:,i])
end
# now time the synchronized copy of this
#time #sync #parallel for i = 1:n
Z[i] = dot(Y, Xt[:,i])
end
# still get same result?
println("$(isequal(z,Z))") # should be true
Output from test with 4 workers + 1 master node:
elapsed time: 0.109010169 seconds (80 bytes allocated)
elapsed time: 0.110858551 seconds (80 bytes allocated)
true
elapsed time: 1.726231048 seconds (119936 bytes allocated)
true
You're running into several issues, of which the most important is that Xt[:,i] creates a new array (allocating memory). Here's a demonstration that gets you closer to what you want:
n = 10000; p = 25000
# make normal Arrays
x = randn(n,p)
y = ones(p)
z = zeros(n)
# make SharedArrays
X = convert(SharedArray, x)
Y = convert(SharedArray, y)
Z = convert(SharedArray, z)
Xt = X'
#everywhere function dotcol(a, B, j)
length(a) == size(B,1) || throw(DimensionMismatch("a and B must have the same number of rows"))
s = 0.0
#inbounds #simd for i = 1:length(a)
s += a[i]*B[i,j]
end
s
end
function run1!(Z, Y, Xt)
for j = 1:size(Xt, 2)
Z[j] = dotcol(Y, Xt, j)
end
Z
end
function runp!(Z, Y, Xt)
#sync #parallel for j = 1:size(Xt, 2)
Z[j] = dotcol(Y, Xt, j)
end
Z
end
run1!(Z, Y, Xt)
runp!(Z, Y, Xt)
#time run1!(Z, Y, Xt)
zc = copy(sdata(Z))
fill!(Z, -1)
#time runp!(Z, Y, Xt)
#show sdata(Z) == zc
Results (when starting julia -p 8):
julia> include("/tmp/paralleldot.jl")
elapsed time: 0.465755791 seconds (80 bytes allocated)
elapsed time: 0.076751406 seconds (282 kB allocated)
sdata(Z) == zc = true
For comparison, when running on this same machine:
julia> blas_set_num_threads(8)
julia> #time A_mul_B!(Z, X, Y);
elapsed time: 0.067611858 seconds (80 bytes allocated)
So the raw Julia implementation is at least competitive with BLAS.