I'm trying to distribute a function that outputs a vector into an array.
I followed this post with something like the following code:
a = distribute([Float64[] for _ in 1:nrow(df)])
#sync #distributed for i in 1:nrow(df)
append!(localpart(a)[i], foo(df[i]))
end
But I get the following error:
BoundsError: attempt to access 145-element Vector{Vector{Float64}} at index [147]
I've only ever parallelized with SharedArrays, which aren't an option, since I need to store vectors in the shared array. Any and all advice would be life-saving.
Each localpart is indexed starting from one.
Hence you need to convert between a global index and the local index.
The function DistributedArrays.localindices() returns a one element tuple that contains a range of global indices that are mapped to the localpart.
This information can be in turn used for the index conversion:
#sync #distributed for i in 1:nrow(df)
id = i - DistributedArrays.localindices(a)[1][1] + 1
push!(localpart(a)[id], f(df[i,1]))
end
EDIT
to understand how localindiced work look at this code:
julia> addprocs(4);
julia> a = distribute([Float64[] for _ in 1:14]);
julia> fetch(#spawnat 2 DistributedArrays.localindices(a))
(1:4,)
julia> fetch(#spawnat 3 DistributedArrays.localindices(a))
(5:8,)
julia> fetch(#spawnat 4 DistributedArrays.localindices(a))
(9:11,)
julia> fetch(#spawnat 5 DistributedArrays.localindices(a))
(12:14,)
My multi processing needs are very simple: I work in machine learning, and I sometimes need to evaluate an algorithm in multiple datasets, or multiple algorithms in a dataset, or some such. I just need to run a function with some arguments and get a number.
I need no RPC, shared data, nothing.
In Julia, I am getting an error with the following code:
type Model
param
end
# 1. I have several algorithms/models
models = [Model(i) for i in 1:50]
# 2. I have one dataset
X = rand(50, 5)
# 3. I want to paralelize this function
#everywhere function transform(m)
sum(X .* m.param)
end
addprocs(3)
println(pmap(transform, models))
I keep getting errors such as,
ERROR: LoadError: On worker 2:
UndefVarError: #transform not defined
Also, is there a way to avoid having to place #everywhere everywhere? Can I just tell that all variables should be copied over to the workers when they are created (as is done in Python multiprocessing)?
My typical code looks obviously much more complicated than this, with models ranging several files.
For reference, this is what I would do in Python:
import numpy as np
import time
# 1. I have several algorithms/models
class Model:
def __init__(self, param):
self.param = param
models = [Model(i) for i in range(1,51)]
# 2. I have one dataset
X = np.random.random((50, 5))
# 3. I want to paralelize this function
def transform(m):
return np.sum(X * m.param)
import multiprocessing
pool = multiprocessing.Pool(4)
print(pool.map(transform, models))
Core issues is you need to add the processes before you attempt to define things on them.
addprocs should always be the first thing you do, before using even (see below).
This is why it is often done with the -p flag when you start julia.
Or with a ---machinefile <file> or with a -L <file>
#everywhere exectutes the code on all processes the currently exist.
i.e. process added after the #everywhere do not have the code executed on them.
Also you missed a few #everywheres.
addprocs(3)
#everywhere type Model
param
end
# 1. I have several algorithms/models
models = [Model(i) for i in 1:50]
# 2. I have one dataset
#everywhere X = rand(50, 5)
# 3. I want to paralelize this function
#everywhere function transform(m)
sum(X .* m.param)
end
println(pmap(transform, models))
Alternatives with fewer #everywheres.
use a block to send a whole block of code #everywhere
addprocs(3)
#everywhere begin
type Model
param
end
X = rand(50, 5)
function transform(m)
sum(X .* m.param)
end
end
models = [Model(i) for i in 1:50]
println(pmap(transform, models))
Use local variables
Local variables (including functions), are sent as required.
though this doesn't help for types.
addprocs(3)
#everywhere type Model
param
end
function main()
X = rand(50, 5)
models = [Model(i) for i in 1:50]
function transform(m)
sum(X .* m.param)
end
println(pmap(transform, models))
end
main()
Use modules
When you using Foo the module Foo is loaded on all processes.
But not brought into scope.
It is a bit weird and counter intuitive.
So much so that I can't conjure a working example of it.
but someone else might.
I was wondering how can I update an array in a #parallel for loop in a function and return the results. Here is a simple example:
addprocs(2)
function parallel_func()
a = Dict{Int64, Int64}()
#sync #parallel for i in 1:10
a[i] = 2*i
end
println(length(a))
return a
end
a = parallel_func()
println(length(a))
Here, a is empty after running the for loop with #parallel macro.
I know #parallel copies the data on each worker and does not touch the original data, but I thought there might be a way to fetch the data from all the workers. I appreciate if you can comment on any alternatives to expedite a for loop like the example above.
#parallel has a return. You can return an array by reducing with vcat or hcat:
arr = #sync #parallel (vcat) for i in 1:10
...
element
end
You can cat to an array of Pairs and build a Dict from that.
Edit
I noticed that you're actually looking to mutate an array. In that case you should use a SharedArray or a DistributedArray.
I am using GLPK with Julia, and using the methods written by spencerlyon
sendto(2, lp = lp) #lp is type GLPK.Prob
However, I cant seem to send a type GLPK.Prob between workers. Whenever I do try to send a type GLPK.Prob, it gets 'sent' and calling
remotecall_fetch(2, whos)
confirms that the GLPK.Prob got sent
The problem appears when I try to solve it by calling
simplex(lp)
the error
GLPK.GLPKError("invalid GLPK.Prob")
appears. I know that the GLPK.Prob isnt originally an invalid GLPK.Prob and if I decide to construct the GLPK.Prob type explicitly on another worker, fx worker 2, calling simplex runs just fine
This is a problem as the GLPK.Prob is generated from a custom type of mine that is a bit on the heavy side
tl;dr Are there possibly some types that cannot be sent between workers properly?
Update
I see now that calling
remotecall_fetch(2, simplex, lp)
will return the above GLPK error
Furthermore I've just noticed that the GLPK module has got a method called
GLPK.copy_prob(GLPK.Prob, GLPK.Prob, Int)
but deepcopy (and certainly not copy) wont work when copying a GLPK.Prob
Example
function create_lp()
lp = GLPK.Prob()
GLPK.set_prob_name(lp, "sample")
GLPK.term_out(GLPK.OFF)
GLPK.set_obj_dir(lp, GLPK.MAX)
GLPK.add_rows(lp, 3)
GLPK.set_row_bnds(lp,1,GLPK.UP,0,100)
GLPK.set_row_bnds(lp,2,GLPK.UP,0,600)
GLPK.set_row_bnds(lp,3,GLPK.UP,0,300)
GLPK.add_cols(lp, 3)
GLPK.set_col_bnds(lp,1,GLPK.LO,0,0)
GLPK.set_obj_coef(lp,1,10)
GLPK.set_col_bnds(lp,2,GLPK.LO,0,0)
GLPK.set_obj_coef(lp,2,6)
GLPK.set_col_bnds(lp,3,GLPK.LO,0,0)
GLPK.set_obj_coef(lp,3,4)
s = spzeros(3,3)
s[1,1] = 1
s[1,2] = 1
s[1,3] = 1
s[2,1] = 10
s[3,1] = 2
s[2,2] = 4
s[3,2] = 2
s[2,3] = 5
s[3,3] = 6
GLPK.load_matrix(lp, s)
return lp
end
This will return a lp::GLPK.Prob() which will return 733.33 when running
simplex(lp)
result = get_obj_val(lp)#returns 733.33
However, doing
addprocs(1)
remotecall_fetch(2, simplex, lp)
will result in the error above
It looks like the problem is that your lp object contains a pointer.
julia> lp = create_lp()
GLPK.Prob(Ptr{Void} #0x00007fa73b1eb330)
Unfortunately, working with pointers and parallel processing is difficult - if different processes have different memory spaces then it won't be clear which memory address the process should look at in order to access the memory that the pointer points to. These issues can be overcome, but apparently they require individual work for each data type that involves said pointers, see this GitHub discussion for more.
Thus, my thought would be that if you want to access the pointer on the worker, you could just create it on that worker. E.g.
using GLPK
addprocs(2)
#everywhere begin
using GLPK
function create_lp()
lp = GLPK.Prob()
GLPK.set_prob_name(lp, "sample")
GLPK.term_out(GLPK.OFF)
GLPK.set_obj_dir(lp, GLPK.MAX)
GLPK.add_rows(lp, 3)
GLPK.set_row_bnds(lp,1,GLPK.UP,0,100)
GLPK.set_row_bnds(lp,2,GLPK.UP,0,600)
GLPK.set_row_bnds(lp,3,GLPK.UP,0,300)
GLPK.add_cols(lp, 3)
GLPK.set_col_bnds(lp,1,GLPK.LO,0,0)
GLPK.set_obj_coef(lp,1,10)
GLPK.set_col_bnds(lp,2,GLPK.LO,0,0)
GLPK.set_obj_coef(lp,2,6)
GLPK.set_col_bnds(lp,3,GLPK.LO,0,0)
GLPK.set_obj_coef(lp,3,4)
s = spzeros(3,3)
s[1,1] = 1
s[1,2] = 1
s[1,3] = 1
s[2,1] = 10
s[3,1] = 2
s[2,2] = 4
s[3,2] = 2
s[2,3] = 5
s[3,3] = 6
GLPK.load_matrix(lp, s)
return lp
end
end
a = #spawnat 2 eval(:(lp = create_lp()))
b = #spawnat 2 eval(:(result = simplex(lp)))
fetch(b)
See the documentation below on #spawn for more info on using it, as it can take a bit of getting used to.
The macros #spawn and #spawnat are two of the tools that Julia makes available to assign tasks to workers. Here is an example:
julia> #spawnat 2 println("hello world")
RemoteRef{Channel{Any}}(2,1,3)
julia> From worker 2: hello world
Both of these macros will evaluate an expression on a worker process. The only difference between the two is that #spawnat allows you to choose which worker will evaluate the expression (in the example above worker 2 is specified) whereas with #spawn a worker will be automatically chosen, based on availability.
In the above example, we simply had worker 2 execute the println function. There was nothing of interest to return or retrieve from this. Often, however, the expression we sent to the worker will yield something we wish to retrieve. Notice in the example above, when we called #spawnat, before we got the printout from worker 2, we saw the following:
RemoteRef{Channel{Any}}(2,1,3)
This indicates that the #spawnat macro will return a RemoteRef type object. This object in turn will contain the return values from our expression that is sent to the worker. If we want to retrieve those values, we can first assign the RemoteRef that #spawnat returns to an object and then, and then use the fetch() function which operates on a RemoteRef type object, to retrieve the results stored from an evaluation performed on a worker.
julia> result = #spawnat 2 2 + 5
RemoteRef{Channel{Any}}(2,1,26)
julia> fetch(result)
7
The key to being able to use #spawn effectively is understanding the nature behind the expressions that it operates on. Using #spawn to send commands to workers is slightly more complicated than just typing directly what you would type if you were running an "interpreter" on one of the workers or executing code natively on them. For instance, suppose we wished to use #spawnat to assign a value to a variable on a worker. We might try:
#spawnat 2 a = 5
RemoteRef{Channel{Any}}(2,1,2)
Did it work? Well, let's see by having worker 2 try to print a.
julia> #spawnat 2 println(a)
RemoteRef{Channel{Any}}(2,1,4)
julia>
Nothing happened. Why? We can investigate this more by using fetch() as above. fetch() can be very handy because it will retrieve not just successful results but also error messages as well. Without it, we might not even know that something has gone wrong.
julia> result = #spawnat 2 println(a)
RemoteRef{Channel{Any}}(2,1,5)
julia> fetch(result)
ERROR: On worker 2:
UndefVarError: a not defined
The error message says that a is not defined on worker 2. But why is this? The reason is that we need to wrap our assignment operation into an expression that we then use #spawn to tell the worker to evaluate. Below is an example, with explanation following:
julia> #spawnat 2 eval(:(a = 2))
RemoteRef{Channel{Any}}(2,1,7)
julia> #spawnat 2 println(a)
RemoteRef{Channel{Any}}(2,1,8)
julia> From worker 2: 2
The :() syntax is what Julia uses to designate expressions. We then use the eval() function in Julia, which evaluates an expression, and we use the #spawnat macro to instruct that the expression be evaluated on worker 2.
We could also achieve the same result as:
julia> #spawnat(2, eval(parse("c = 5")))
RemoteRef{Channel{Any}}(2,1,9)
julia> #spawnat 2 println(c)
RemoteRef{Channel{Any}}(2,1,10)
julia> From worker 2: 5
This example demonstrates two additional notions. First, we see that we can also create an expression using the parse() function called on a string. Secondly, we see that we can use parentheses when calling #spawnat, in situations where this might make our syntax more clear and manageable.