Related
Originally this is a problem coming up in mathematica.SE, but since multiple programming languages have involved in the discussion, I think it's better to rephrase it a bit and post it here.
In short, michalkvasnicka found that in the following MATLAB sample
s = 15000;
tic
% for-loop version
H = zeros(s,s);
for c = 1:s
for r = 1:s
H(r,c) = 1/(r+c-1);
end
end
toc
%Elapsed time is 1.359625 seconds.... For-loop
tic;
% vectorized version
c = 1:s;
r = c';
HH=1./(r+c-1);
toc
%Elapsed time is 0.047916 seconds.... Vectorized
isequal(H,HH)
the vectorized code piece is more than 25 times faster than the pure for-loop code piece. Though I don't have access to MATLAB so cannot test the sample myself, the timing 1.359625 seems to suggest it's tested on an average PC, just as mine.
But I cannot reproduce the timing with other languages like fortran or julia! (We know, both of them are famous for their performance of numeric calculation. Well, I admit I'm by no means an expert of fortran or julia. )
The followings are the samples I used for test. I'm using a laptop with i7-8565U CPU, Win 10.
fortran
fortran code is compiled with gfortran (TDM-GCC-10.3.0-2, with compile option -Ofast).
program tst
use, intrinsic :: iso_fortran_env
implicit none
integer,parameter::s=15000
integer::r,c
real(real64)::hmn(s,s)
do r=1,s
do c=1, s
hmn(r,c)=1._real64/(r + c - 1)
end do
end do
print *, hmn(s,s)
end program
compilation timing: 0.2057823 seconds
execution timing: 0.7179657 seconds
julia
Version of julia is 1.6.3.
#time (s=15000; Hmm=[1. /(r+c-1) for r=1:s,c=1:s];)
Timing: 0.7945998 seconds
Here comes the question:
Is the timing of MATLAB reliable?
If the answer to 1st question is yes, then how can we reproduce the performance (for 2 GHz CPU, the timing should be around 0.05 seconds) with julia, fortran, or any other programming languages?
Just to add on the Julia side - make sure you use BenchmarkToolsto benchmark, wrap the code you want to benchmark in functions so as not to benchmark in global scope, and interpolate any variables you pass to #btime.
Here's how I would do it:
julia> s = 15_000;
julia> function f_loop!(H)
for c ∈ 1:size(H, 1)
for r ∈ 1:size(H, 1)
H[r, c] = 1 / (r + c - 1)
end
end
end
f_loop! (generic function with 1 method)
julia> function f_vec!(H)
c = 1:size(H, 1)
r = c'
H .= 1 ./ (r .+ c .- 1)
end
f_vec! (generic function with 1 method)
julia> H = zeros(s, s);
julia> using BenchmarkTools
julia> #btime f_loop!($H);
625.891 ms (0 allocations: 0 bytes)
julia> H = zeros(s, s);
julia> #btime f_vec!($H);
625.248 ms (0 allocations: 0 bytes)
So both versions come in at the same time, which is what I'd expect for such a straightforward operation where a properly type-inferred code should compile down to roughly the same machine code.
tic/toc should be fine, but it looks like the timing is being skewed by memory pre-allocation.
I can reproduce similar timings to your MATLAB example, however
On first run (clear workspace)
Loop approach takes 2.08 sec
Vectorised approach takes 1.04 sec
Vectorisation saves 50% execution time
On second run (workspace not cleared)
Loop approach takes 2.55 sec
Vectorised approach takes 0.065 sec
Vectorisation "saves" 97.5% execution time
My guess would be that since the loop approach explicitly creates a new matrix via zeros, the memory is reallocated from scratch on every run and you don't see the speed improvement on subsequent runs.
However, when HH remains in memory and the HH=___ line outputs a matrix of the same size, I suspect MATLAB is doing some clever memory allocation to speed up the operation.
We can prove this theory with the following test:
Test Num | Workspace cleared | s | Loop (sec) | Vectorised (sec)
1 | Yes | 15000 | 2.10 | 1.41
2 | No | 15000 | 2.73 | 0.07
3 | No | 15000 | 2.50 | 0.07
4 | No | 15001 | 2.74 | 1.73
See the variation between tests 2 and 3, this is why timeit would have been helpful for an average runtime (see footnote). The difference in output sizes between tests 3 and 4 are pretty small, but the execution time returns to a similar magnitude of that in test 1 for the vectorised approach, suggesting that the re-allocation to create HH costs most of the time.
Footnote: tic/toc timings in MATLAB can be improved by using the in-built timeit function, which essentially takes an average over several runs. One interesting thing to observe from the workings of timeit though is that it explicitly "warms up" (quoting a comment) the tic/toc function by calling it a couple of times. You can see when running tic/toc a few times from a clear workspace (with no intermediate code) that the first call takes longer than subsequent calls, as there must be some overhead for getting the timer initialised.
I hope that the following modified benchmark could bring some new light to the problem:
s = 15000;
tic
% for-loop version
H = zeros(s,s);
for i =1:10
for c = 1:s
for r = 1:s
H(r,c) = H(r,c) + 1/(r+c-1+i);
end
end
end
toc
tic;
% vectorized version
HH = zeros(s,s);
c = 1:s;
r = c';
for i=1:10
HH= HH + 1./(r+c-1+i);
end
toc
isequal(H,HH)
In this case any kind of "cashing" is avoided by changing of matrix H (HH) at each for-loop (over "i") iteration.
In this case we get:
Elapsed time is 3.737275 seconds. (for-loop)
Elapsed time is 1.143387 seconds. (vectorized)
So, there is still performance improvement (~ 3x) due to the vectorization, which is probably done by implicit multi-threading implementation of vectorized Matlab commands.
Yes, tic/toc vs timeit is not strictly consistent, but the overall timing functionality is very similar.
To add to this, here is a simple python script which does the vectorized operation with numpy:
from timeit import default_timer
import numpy as np
s = 15000
start = default_timer()
# for-loop
H = np.zeros([s, s])
for c in range(1, s):
for r in range(1, s):
H[r, c] = 1 / (r + c - 1)
end = default_timer()
print(end - start)
start = default_timer()
# vectorized
c = np.arange(1, s).reshape([1, -1])
r = c.T
HH = 1 / (c + r - 1)
end = default_timer()
print(end - start)
for-loop: 32.94566780002788 seconds
vectorized: 0.494859800033737 seconds
While the for-loop version is terribly slow, the vectorized version is faster than the posted fortran/julia times. Numpy internally tries to use special SIMD hardware instructions to speed up arithmetic on vectors, which can make a significant difference. It's possible that the fortran/julia compilers weren't able to generate those instructions from the provided code, but numpy/matlab were able to. However, Matlab is still about 10x faster than the numpy code, which I don't think would be explained by better use of SIMD instructions. Instead, they may also be using multiple threads to parallelize the computation, since the matrix is fairly large.
Ultimately, I think the matlab numbers are plausible, but I'm not sure exactly how they're getting their speedup.
I am using MATLAB to run a for loop in which variable-length portions of a large vector are updated at each iteration with the content of another vector; something like:
for k=1:N
vec1(idx_start1(k):idx_end1(k)) = vec1(idx_start1(k):idx_end1(k)) +...
a(k)*vec2(idx_start2(k):idx_end2(k));
end
The selected portions of vec1 and vec2 are not so small and N can be quite large; moreover, if this can be useful, idx_end(k)<idx_start(k+1) does not necessarily hold (i.e. vec1's edited portions may be partially re-updated in subsequent iterations). As a consequence, the above is by far the slowest portion of code in my script and I would like to speed it up, if possible.
Is there any way to vectorize the above for loop in order to make it run faster? Or, are there any alternative approaches to improve its execution speed?
EDIT:
As requested in the comments, here are some example values: Using the profiler to check execution times, the loop above runs in about 3.3 s with N=5e4, length(vec1)=3e6, length(vec2)=1.7e3 and the portions indexed by idx_start/end are slightly shorter on average than the latter, although not significantly.
Of course, 3.3 s is not particularly worrying in itself, but I would like to be able to increase especially N and vec1 by 1 or 2 orders of magnitude and in such a loop it will take quite longer to run.
Sorry, I couldn't find a way to speed up your code. This is the code I created to try to speed it up:
N = 5e4;
vec1 = 1:3e6;
vec2 = 1:1.7e3;
rng(0)
a = randn(N, 1);
idx_start1 = randi([1, 2.9e6], N, 1);
idx_end1 = idx_start1 + 1000;
idx_start2 = randi([1, 0.6e3], N, 1);
idx_end2 = idx_start2 + 1000;
for k=1:N
vec1(idx_start1(k):idx_end1(k)) = vec1(idx_start1(k):idx_end1(k)) + a(k) * vec2(idx_start2(k):idx_end2(k));
% use = idx_start1(k):idx_end1(k);
% vec1(use) = vec1(use) + a(k) * vec2(idx_start2(k):idx_end2(k));
end
The two commented-out lines of code in the for loop were my attempt to speed it up, but it actually made it slower, much to my surprise. Generally, I would create a variable for an array that is used more than once thinking that is faster, but it is not. The code that is not commented out runs in 0.24 s versus 0.67 seconds for the code that is commented out.
I've noticed a strange behavior of julia during a matrix copy.
Consider the following three functions:
function priv_memcopyBtoA!(A::Matrix{Int}, B::Matrix{Int}, n::Int)
A[1:n,1:n] = B[1:n,1:n]
return nothing
end
function priv_memcopyBtoA2!(A::Matrix{Int}, B::Matrix{Int}, n::Int)
ii = 1; jj = 1;
while ii <= n
jj = 1 #(*)
while jj <= n
A[jj,ii] = B[jj,ii]
jj += 1
end
ii += 1
end
return nothing
end
function priv_memcopyBtoA3!(A::Matrix{Int}, B::Matrix{Int}, n::Int)
A[1:n,1:n] = view(B, 1:n, 1:n)
return nothing
end
Edit: 1) I tested if the code would throw an BoundsError so the line marked with jj = 1 #(*) was missing in the initial code. The testing results were already from the fixed version, so they remain unchanged. 2) I've added the view variant, thanks to #Colin T Bowers for addressing both issues.
It seems like both functions should lead to more or less the same code. Yet I get for
A = fill!(Matrix{Int32}(2^12,2^12),2); B = Int32.(eye(2^12));
the results
#timev priv_memcopyBtoA!(A,B, 2000)
0.178327 seconds (10 allocations: 15.259 MiB, 85.52% gc time)
elapsed time (ns): 178326537
gc time (ns): 152511699
bytes allocated: 16000304
pool allocs: 9
malloc() calls: 1
GC pauses: 1
and
#timev priv_memcopyBtoA2!(A,B, 2000)
0.015760 seconds (4 allocations: 160 bytes)
elapsed time (ns): 15759742
bytes allocated: 160
pool allocs: 4
and
#timev priv_memcopyBtoA3!(A,B, 2000)
0.043771 seconds (7 allocations: 224 bytes)
elapsed time (ns): 43770978
bytes allocated: 224
pool allocs: 7
That's a drastic difference. It's also surprising. I've expected the first version to be like memcopy, which is hard to beat for a large memory block.
The second version has overhead from the pointer arithmetic (getindex), the branch condition (<=) and the bounds check in each assignment. Yet each assignment takes just ~3 ns.
Also, the time which the garbage collector consumes, varies a lot for the first function. If no garbage collection is performed, the large difference becomes small, but it remains. It's still a factor of ~2.5 between version 3 and 2.
So why is the "memcopy" version not as efficient as the "assignment" version?
Firstly, your code contains a bug. Run this:
A = [1 2 ; 3 4]
B = [5 6 ; 7 8]
priv_memcopyBtoA2!(A, B, 2)
then:
julia> A
2×2 Array{Int64,2}:
5 2
7 4
You need to re-assign jj back to 1 at the end of each inner while loop, ie:
function priv_memcopyBtoA2!(A::Matrix{Int}, B::Matrix{Int}, n::Int)
ii = 1
while ii <= n
jj = 1
while jj <= n
A[jj,ii] = B[jj,ii]
jj += 1
end
ii += 1
end
return nothing
end
Even with the bug fix, you'll still note that the while loop solution is faster. This is because array slices in julia create temporary arrays. So in this line:
A[1:n,1:n] = B[1:n,1:n]
the right-hand side operation creates a temporary nxn array, and then assigns the temporary array to the left-hand side.
If you wanted to avoid the temporary array allocation, you would instead write:
A[1:n,1:n] = view(B, 1:n, 1:n)
and you'll notice that the timings of the two methods is now pretty close, although the while loop is still slightly faster. As a general rule, loops in Julia are fast (as in C fast), and explicitly writing out the loop will usually get you the most optimized compiled code. I would still expect the explicit loop to be faster than the view method.
As for the garbage collection stuff, that is just a result of your method of timing. Much better to use #btime from the package BenchmarkTools, which uses various tricks to avoid traps like timing garbage collection etc.
Why is A[1:n,1:n] = view(B, 1:n, 1:n) or variants of it, slower than a set of while loops? Let's look at what A[1:n,1:n] = view(B, 1:n, 1:n) does.
view returns an iterator which contains a pointer to the parent B and information how to compute the indices which should be copied. A[1:n,1:n] = ... is parsed to a call _setindex!(...). After that, and a few calls down the call chain, the main work is done by:
.\abstractarray.jl:883;
# In general, we simply re-index the parent indices by the provided ones
function getindex(V::SlowSubArray{T,N}, I::Vararg{Int,N}) where {T,N}
#_inline_meta
#boundscheck checkbounds(V, I...)
#inbounds r = V.parent[reindex(V, V.indexes, I)...]
r
end
#.\multidimensional.jl:212;
#inline function next(iter::CartesianRange{I}, state) where I<:CartesianIndex
state, I(inc(state.I, iter.start.I, iter.stop.I))
end
#inline inc(::Tuple{}, ::Tuple{}, ::Tuple{}) = ()
#inline inc(state::Tuple{Int}, start::Tuple{Int}, stop::Tuple{Int}) = (state[1]+1,)
#inline function inc(state, start, stop)
if state[1] < stop[1]
return (state[1]+1,tail(state)...)
end
newtail = inc(tail(state), tail(start), tail(stop))
(start[1], newtail...)
end
getindex takes a view V and an index I. We get the view from B and the index I from A. In each step reindex computes from the view V and the index I indices to get an element in B. It's called r and we return it. Finally r is written to A.
After each copy inc increments the index I to the next element in A and tests if one is done. Note that the code is from v0.63 but in master it's more or less the same.
In principle the code could be reduced to a set of while loops, yet it is more general. It works for arbitrary views of B and arbitrary slices of the form a:b:c and for an arbitrary number of matrix dimensions. The big N is in our case 2.
Since the functions are more complex, the compiler doesn't optimize them as well. I.e. there is a recommendation that the compiler should inline them, but it doesn't do that. This shows that the shown functions are non trivial.
For a set of loops the compiler reduces the innermost loop to three additions (each for a pointer to A and B and one for the loop index) and a single copy instruction.
tl;dr The internal call chain of A[1:n,1:n] = view(B, 1:n, 1:n) coupled with multiple dispatch is non trivial and handles the general case. This induces overhead. A set of while loops is already optimized to a special case.
Note that the performance depends on the compiler. If one looks at the one dimensional case A[1:n] = view(B, 1:n), it's faster than a while loop because it vectorizes the code. Yet for higher dimensions N >2 the difference grows.
Consider the basic iteration to generate N random numbers and save them in an array (assume either that we are not interested in array comprehensions and also that we don't know the calling rand(N))
function random_numbers(N::Int)
array = zeros(N)
for i in 1:N
array[i] = rand()
end
array
end
I am interested in a similar function that takes advantage of the cores of my laptop to generate the same array. I have checked this nice blog where the macros #everywhere, #spawn and #parallel are introduced but there the calculation is carried out "on-the-fly" and an array is not needed to save the data.
I have the impression that this is very basic and can be done easily using perhaps the function pmap but I am unfamiliar with parallel computing.
My aim is to apply this method to a function that I have built to generate random numbers drawn from an unusual distribution.
I would recommend to do a more careful initialization of random number generators in parallel processes, e.g:
# choose the seed you want
#everywhere srand(1)
# replace 10 below by maximum process id in your case
#everywhere const LOCAL_R = randjump(Base.GLOBAL_RNG, 10)[myid()]
# here is an example usage
#everywhere f() = rand(LOCAL_R)
In this way you:
make sure that your results are reproducible;
have control that there is no overlap between random sequences generated by different processes.
As suggested in the comment more clarification in the question is always welcome. However, it seems pmap will do what is required. The relevant documentation is here.
The following is a an example. Note, the time spent in the pmap method is half of the regular map. With 16 cores, the situation might be substantially better:
julia> addprocs(2)
2-element Array{Int64,1}:
2
3
julia> #everywhere long_rand() = foldl(+,0,(randn() for i=1:10_000_000))
julia> long_rand()
-1165.9596619177153
julia> #time map(x->long_rand(), zeros(10,10))
8.455930 seconds (204.89 k allocations: 11.069 MiB)
10×10 Array{Float64,2}:
⋮
⋮
julia> #time pmap(x->long_rand(), zeros(10,10));
6.125479 seconds (773.08 k allocations: 42.242 MiB, 0.25% gc time)
julia> #time pmap(x->long_rand(), zeros(10,10))
4.609745 seconds (20.99 k allocations: 954.991 KiB)
10×10 Array{Float64,2}:
⋮
⋮
I have been working in Julia for a few months now and I am interested in writing some of my code in parallel. I am working on a problem where I use 1 model to generate data for several different receivers (the data for each receiver is a vector). The data for each receiver can be computed independently which leads me to believe I should be able to use the pmap function. My plan is to initialize the data as a 2D SharedArray (each column represents the data for 1 receiver) and then have pmap loop over each of the columns. However I am finding that using SharedArray's with pmap is no faster than working in serial using map. I wrote the following dummy code to illustrate this point.
#everywhere function Dummy(icol,model,data,A,B)
nx = 250
nz = 250
nh = 50
for ih = 1:nh
for ix = 1:nx
for iz = 1:nz
data[iz,icol] += A[iz,ix,ih]*B[iz,ix,ih]*model[iz,ix,ih]
end
end
end
end
function main()
nx = 250
nz = 250
nh = 50
nt = 500
ncol = 100
model1 = rand(nz,nx,nh)
model2 = copy(model1)
model3 = convert(SharedArray,model1)
data1 = zeros(Float64,nt,ncol)
data2 = SharedArray(Float64,nt,ncol)
data3 = SharedArray(Float64,nt,ncol)
A1 = rand(nz,nx,nh)
A2 = copy(A1)
A3 = convert(SharedArray,A1)
B1 = rand(nz,nx,nh)
B2 = copy(B1)
B3 = convert(SharedArray,B1)
#time map((arg)->Dummy(arg,model1,data1,A1,B1),[icol for icol = 1:ncol])
#time pmap((arg)->Dummy(arg,model2,data2,A2,B2),[icol for icol = 1:ncol])
#time pmap((arg)->Dummy(arg,model3,data3,A3,B3),[icol for icol = 1:ncol])
println(data1==data2)
println(data1==data3)
end
main()
I start the Julia session with Julia -p 3 and run the script. The times for the 3 tests are 1.4s, 4.7s, and 1.6s respectively. Using SharedArrays with pmap (1.6s runtime) hasn't provided any improvement in speed compared with regular Arrays with map (1.4s). I am also confused as to why the 2nd case (data as a SharedArray, all other inputs as a regular Array with pmap) is so slow. What do I need to change in order to benefit from working in parallel?
Preface: yes, there actually is a solution to your issue. See code at bottom for that. But, before I get there, I'll go into some explanation.
I think the root of the problem here is memory access. First off, although I haven't rigorously investigated it, I suspect that there are a moderate number of improvements that could be made to Julia's underlying code in order to improve the way that it handles memory access in parallel processing. Nevertheless, in this case, I suspect that any underlying issues with the base code, if such actually exist, aren't so much at fault. Instead, I think it is useful to think carefully about what exactly is going on in your code and what it means vis-a-vis memory access.
A key thing to keep in mind when working in Julia is that it stores Arrays in column-major order. That is, it stores them as stacks of columns on top of each other. This generalizes to dimensions > 2 as well. See this very helpful segment of the Julia performance tips for more info. The implication of this is that it is fast to access one row after another after another in a single column. But, if you need to be jumping around columns, then you get into trouble. Yes, accessing ram memory might be relatively fast, but accessing cache memory is much, much faster, and so if your code allows for a single column or so to be loaded from ram into cache and then worked on, then you'll do much better than if you need to be doing lots of swapping between ram and cache. Here in your code, you're switching from column to column between your computations like nobody's business. For instance, in your pmap each process gets a different column of the shared array to work on. Then, each goes down the rows of that column and modifies the values in it. But, since they are trying to work in parallel with one another, and the whole array is too big to fit into your cache, there is lots of swapping between ram and cache that happens and that really slows you down. In theory, perhaps a clever enough under-the-hood memory management system could be devised to address this, but I don't really know - that goes beyond my pay grade. The same thing of course is happening to your accesses to your other objects.
Another thing to keep in mind in general when parallelizing is your ratio of flops (i.e. computer calculations) to read/write operations. Flops tend to parallelize well, you can have different cores, processes, etc. doing their own little computations on their own bits of data that they hold in their tiny caches. But, read/write operations don't parallelize so well. There are some things that can be done to engineer hardware systems to improve on this. But in general, if you have a given computer system with say, two cores, and you add four more cores to it, your ability to perform flops will increase three fold, but your ability to read/write data to/from ram won't really improve so much. (note: this is an oversimplication, a lot depends on your system). Nevertheless, in general, the higher your ratio of flops to read/writes, the more benefits you can expect from parallelism. In your case, your code involves a decent number of read/writes (all of those accesses to your different arrays) for a relatively small number of flops (a few multiplactions and an addition). It's just something to keep in mind.
Fortunately, your case is amenable to some good speedups from parallelism if written correctly. In my experience with Julia, all of my most successful parallelism comes when I can break data up and have workers process chunks of it separately. Your case happens to be amenable to that. Below is an example of some code I wrote that does that. You can see that it gets nearly a 3x increase in speed going from one processor to three. The code a bit crude in places, but it at least demonstrates the general idea of how something like this could be approached. I give a few comments on the code afterwards.
addprocs(3)
nx = 250;
nz = 250;
nh = 50;
nt = 250;
#everywhere ncol = 100;
model = rand(nz,nx,nh);
data = SharedArray(Float64,nt,ncol);
A = rand(nz,nx,nh);
B = rand(nz,nx,nh);
function distribute_data(X, obj_name_on_worker::Symbol, dim)
size_per_worker = floor(Int,size(X,1) / nworkers())
StartIdx = 1
EndIdx = size_per_worker
for (idx, pid) in enumerate(workers())
if idx == nworkers()
EndIdx = size(X,1)
end
println(StartIdx:EndIdx)
if dim == 3
#spawnat(pid, eval(Main, Expr(:(=), obj_name_on_worker, X[StartIdx:EndIdx,:,:])))
elseif dim == 2
#spawnat(pid, eval(Main, Expr(:(=), obj_name_on_worker, X[StartIdx:EndIdx,:])))
end
StartIdx = EndIdx + 1
EndIdx = EndIdx + size_per_worker - 1
end
end
distribute_data(model, :model, 3)
distribute_data(A, :A, 3)
distribute_data(B, :B, 3)
distribute_data(data, :data, 2)
#everywhere function Dummy(icol,model,data,A,B)
nx = size(model, 2)
nz = size(A,1)
nh = size(model, 3)
for ih = 1:nh
for ix = 1:nx
for iz = 1:nz
data[iz,icol] += A[iz,ix,ih]*B[iz,ix,ih]*model[iz,ix,ih]
end
end
end
end
regular_test() = map((arg)->Dummy(arg,model,data,A,B),[icol for icol = 1:ncol])
function parallel_test()
#everywhere begin
if myid() != 1
map((arg)->Dummy(arg,model,data,A,B),[icol for icol = 1:ncol])
end
end
end
#time regular_test(); # 2.120631 seconds (307 allocations: 11.313 KB)
#time parallel_test(); # 0.918850 seconds (5.70 k allocations: 337.250 KB)
getfrom(p::Int, nm::Symbol; mod=Main) = fetch(#spawnat(p, getfield(mod, nm)))
function recombine_data(Data::Symbol)
Results = cell(nworkers())
for (idx, pid) in enumerate(workers())
Results[idx] = getfrom(pid, Data)
end
return vcat(Results...)
end
#time P_Data = recombine_data(:data); # 0.003132 seconds
P_Data == data ## true
Comments
The use of the SharedArray is quite superfluous here. I just used it since it lends itself easily to being modified in place, which is how your code is originally written. This let me work more directly based on what you wrote without modifying it as much.
I didn't include the step to bring the data back in the time trial, but as you can see, it's quite a trivial amount of time in this case. In other situations, it might be less trivial, but data movement is just one of those issues that you face with parallelism.
When doing time trials in general, it is considered best practice to run the function once (in order to compile the code) and then run it again to get the times. That's what I did here.
See this SO post for where I got inspiration for some of these functions that I used here.