I have a function, say foo() that returns an int value, and I have to pass to different values to this function to obtaion two different values that have to be summed up, eg.
result = foo(2) + foo(37)
and I would like to make those foo(2) and foo(37) to be calculated in parallel (at the same time). It may help to have two versions of foo, one that uses a for loop and another one recursive. I am quite new to Julia and parallel programming but would like to get this problem going so that I can keep it up until I can build it as a web app with Genie.jl. Also, any resources to learn about parallel programming with Julia besides its docs will be highly appreciated!
If you want to use processes for the parallelisation you can use the "distributed for" loop:
8.2.3. Aggregate results
The second situation is when you want to perform a small operation on each of the items but you also want to perform an “aggregation function” at the end to retrieve a scalar value (or an array if the input is a matrix).
In these cases, you can use the #distributed (aggregationFunction) for construct.
As an example, you run in parallel a division by 2 and then use the sum as the aggregation function (assume three working processes are available):
function f(n)
s = 0.0
for i = 1:n
s += i/2
end
return s
end
function pf(n)
s = #distributed (+) for i = 1:n # aggregate using sum on variable s
i/2
# last element of for cycle is used by the aggregator
end
return s
end
#benchmark f(10000000) # median time: 11.478 ms
#benchmark pf(10000000) # median time: 4.458 ms
(From Julia Quick Syntax Reference)
Alternatively you can use threads. Julia already have multi-threads, but Julia 1.3 (due in few days/weeks, in rc4 at time of writing) will introduce a comprehensive thread API.
Related
This is my main code of parallel operation:
using Distributed
using SharedArrays
nprocs()
addprocs(7)
Now, I need to store a variable about time:
variable = SharedArray{ComplexF64, 3}(Dim, steps, paths)
Note that "steps" and "paths" denote time series and total number of trajectories, respectively. However, if i define this variable, i will meet with the out of memory probelm because Dim=10000, steps=600, and paths=1000, though i can use multiple kernels to achieve parallel operation. The code of parallel operation can be written as
#sync #distributed for path=1:paths
...
variable[:,:,path] = matrix_var
end
Actually, this variable is not my final result, and the result is
final_var = sum(variable, dim=3)
, which represents the summation of all trajectories.
Thus, I want to deal with the out of memory problem and simultaneously use parallel operation. If i cast away the dimension of "paths" when i define this variable, the out of memory problem will vanish, but parallel operation becomes invaild. I hope that there are a solution to overcome it.
Seems that for each value of path you should create the variable locally rather than on huge array. Your code might look more or less like this:
final_vars = #distributed (append!) for path=1:paths
#create local variable for a single step
locvariable = Array{ComplexF64, 2}(undef, Dim, steps)
# at any time locvariable is at most in nprocs() copies
# load data to locvariable specific to path and do your job
final_var = sum(locvariable, dim=2)
[final_var] # in this way you will get a vector of arrays
end
I am currently writing a numerical solver in Julia. I don't think the math behind it matters too much. It all boils down to the fact, that a specific operation is executed several times and uses a large percentage (~80%) of running time.
I tried to reduce it as much as possible and present you this piece of code, which can be saved as dummy.jl and then executed via include("dummy.jl") followed by dummy(10) (for compilation) and then dummy(1000).
function dummy(N::Int64)
A = rand(N,N)
#time timethis(A)
end
function timethis(A::Array{Float64,2})
dummyvariable = 0.0
for k=1:100 # just repeat a few times
for i=2:size(A)[1]-1
for j=2:size(A)[2]-1
dummyvariable += slopefit(A[i-1,j],A[i,j],A[i+1,j],2.0)
dummyvariable += slopefit(A[i,j-1],A[i,j],A[i,j+1],2.0)
end
end
end
println(dummyvariable)
end
#inline function minmod(x::Float64, y::Float64)
return sign(x) * max(0.0, min(abs(x),y*sign(x) ) );
end
#inline function slopefit(left::Float64,center::Float64,right::Float64,theta::Float64)
# arg=ccall((:minmod,"libminmod"),Float64,(Float64,Float64),0.5*(right-left),theta*(center-left));
# result=ccall((:minmod,"libminmod"),Float64,(Float64,Float64),theta*(right-center),arg);
# return result
tmp = minmod(0.5*(right-left),theta*(center-left));
return minmod(theta*(right-center),tmp);
#return 1.0
end
Here, timethis shall imitate the part of the code where I spend a lot of time. I notice, that slopefitis extremely expensive to execute.
For example, dummy(1000) takes roughly 4 seconds on my machine. If instead, slopefit would just always return 1 and not compute anything, the time goes down to one tenth of the overall time.
Now, obviously there is no free lunch.
I am aware, that this is simply a costly operation. But I would still try to optimize it as much as possible, given that a lot of time is spend in something that looks like one could optimize it easily as it is just a few lines of code.
So far, I tried to implement minmod and slopefit as C-functions and call them, however that just increased computing time (maybe I did it wrong).
So my question is, what possibilities do I have to optimize the call of slopefit?
Note, that in the actual code, the arguments of slopefit are not the ones mentioned here, but depend on conditional statements which makes everything hard to vectorize (if that would bring any performance gain I am not sure).
There are two levels of optimization I can think of.
First: the following implementation of minmod will be faster as it avoids branching (I understand this is the functionality you want):
#inline minmod(x::Float64, y::Float64) = ifelse(x<0, clamp(y, x, 0.0), clamp(y, 0.0, x))
Second: you can use #inbounds to speed up loop a bit:
#inbounds for i=2:size(A)[1]-1
I use the #parallel for macro to run simulations for a range of parameters. Each run results in a 1-dimensional vector. In the end I would like to collect the results in a DataFrame.
Up until now I had always created an intermediate array and reduced the for-loop with vcat; then constructed the DataFrame. I thought it might also work to push! the result of each calculation to the master process via remotecall. A minimal example would look like
X=Float64[]
#sync #parallel for i in linspace(1.,10.,10)
remotecall_fetch(()->push!(X,i),1)
end
The result of which is consistently an array X with 9 not 10 elements. The number of dropped elements becomes larger as more workers are added.
This is on julia-0.6.1.
I thought I had understood julia's parallel computing structure, but it seems not.
What is the reason for this behavior? And how can I do it better and safely?
I suspect you're triggering a race condition, though couldn't say where.
If you only need to return one value per iteration, I would suggest just using pmap:
pmap(linspace(1.,10.,10)) do i
i
end
otherwise if each iteration could return multiple values, it would probably best to use RemoteChannels.
I'm trying to do some statistical analysis using Julia. The code consists of the files script.jl (e.g. initialisation of the data) and algorithm.jl.
The number of simulations is large (at least 100,000) so it makes sense to use parallel processing.
The code below is just some pseudocode to illustrate my question —
function script(simulations::Int64)
# initialise input data
...
# initialise other variables for statistical analysis using zeros()
...
require("algorithm.jl")
#parallel for z = 1:simulations
while true
choices = algorithm(data);
if length(choices) == 0
break
else
# process choices and pick one (which alters the data)
...
end
end
end
# display results of statistical analysis
...
end
and
function algorithm(data)
# actual algorithm
...
return choices;
end
As example, I would like to know how many choices there are on average, what is the most common choice, and so on. For this purpose I need to save some data from choices (in the for-loop) to the statistical analysis variables (initialised before the for-loop) and display the results (after the for-loop).
I've read about using #spawn and fetch() and functions like pmap() but I'm not sure how I should proceed. Just using the variables inside the for-loop does not work as each proc gets its own copy, so the values of the statistical analysis variables after the for-loop will just be zeros.
[Edit] In Julia I use include("script.jl") and script(100000) to run the simulations, there are no issues when using a single proc. However, when using multiple procs (e.g. using addprocs(3)) all statistical variables are zeros after the for-loop — which is to be expected.
It seems that you want to parallelize an inherently serial operations, because each operation is related to the result of another one (in this case data).
I think if you could implement the above code like:
#parallel (dosumethingwithdata) for z = 1:simulations
while true
choices = algorithm(data,z);
if length(choices) == 0
break
else
# process choices and pick one (which alters the data)
...
end
data
end
end
then you may find a parallel solution for the problem.
I was reading Parallel Computing docs of Julia, and having never done any parallel coding, I was left wanting a gentler intro. So, I thought of a (probably) simple problem that I couldn't figure out how to code in parallel Julia paradigm.
Let's say I have a matrix/dataframe df from some experiment. Its N rows are variables, and M columns are samples. I have a method pwCorr(..) that calculates pairwise correlation of rows. If I wanted an NxN matrix of all the pairwise correlations, I'd probably run a for-loop that'd iterate for N*N/2 (upper or lower triangle of the matrix) and fill in the values; however, this seems like a perfect thing to parallelize since each of the pwCorr() calls are independent of others. (Am I correct in thinking this way about what can be parallelized, and what cannot?)
To do this, I feel like I'd have to create a DArray that gets filled by a #parallel for loop. And if so, I'm not sure how this can be achieved in Julia. If that's not the right approach, I guess I don't even know where to begin.
This should work, first you need to propagate the top level variable (data) to all the workers:
for pid in workers()
remotecall(pid, x->(global data; data=x; nothing), data)
end
then perform the computation in chunks using the DArray constructor with some fancy indexing:
corrs = DArray((20,20)) do I
out=zeros(length(I[1]),length(I[2]))
for i=I[1], j=I[2]
if i<j
out[i-minimum(I[1])+1,j-minimum(I[2])+1]= 0.0
else
out[i-minimum(I[1])+1,j-minimum(I[2])+1] = cor(vec(data[i,:]), vec(data[j,:]))
end
end
out
end
In more detail, the DArray constructor takes a function which takes a tuple of index ranges and returns a chunk of the resulting matrix which corresponds to those index ranges. In the code above, I is the tuple of ranges with I[1] being the first range. You can see this more clearly with:
julia> DArray((10,10)) do I
println(I)
return zeros(length(I[1]),length(I[2]))
end
From worker 2: (1:10,1:5)
From worker 3: (1:10,6:10)
where you can see it split the array into two chunks on the second axis.
The trickiest part of the example was converting from these 'global' index ranges to local index ranges by subtracting off the minimum element and then adding back 1 for the 1 based indexing of Julia.
Hope that helps!