Related
I wrote a function that acts on each combination of columns in an input matrix. It uses multiple for loops and is very slow, so I am trying to parallelize it to use the maximum number of threads on my computer.
I am having difficulty finding the correct syntax to set this up. I'm using the Parallel package in octave, and have tried several ways to set up the calls. Here are two of them, in a simplified form, as well as a non-parallel version that I believe works:
function A = parallelExample(M)
pkg load parallel;
# Get total count of columns
ct = columns(M);
# Generate column pairs
I = nchoosek([1:ct],2);
ops = rows(I);
slice = ones(1, ops);
Ic = mat2cell(I, slice, 2);
## # Non-parallel
## A = zeros(1, ops);
## for i = 1:ops
## A(i) = cmbtest(Ic{i}, M);
## endfor
# Parallelized call v1
A = parcellfun(nproc, #cmbtest, Ic, {M});
## # Parallelized call v2
## afun = #(x) cmbtest(x, M);
## A = parcellfun(nproc, afun, Ic);
endfunction
# function to apply
function P = cmbtest(indices, matrix)
colset = matrix(:,indices);
product = colset(:,1) .* colset(:,2);
P = sum(product);
endfunction
For both of these examples I generate every combination of two columns and convert those pairs into a cell array that the parcellfun function should split up. In the first, I attempt to convert the input matrix M into a 1x1 cell array so it goes to each parallel instance in the same form. I get the error 'C must be a cell array' but this must be internal to the parcellfun function. In the second, I attempt to define an anonymous function that includes the matrix. The error I get here specifies that 'cmbtest' is undefined.
(Naturally, the actual function I'm trying to apply is far more complex than cmbtest here)
Other things I have tried:
Put M into a global variable so it doesn't need to be passed. Seemed to be impossible to put a global variable in a function file, though I may just be having syntax issues.
Make cmbtest a nested function so it can access M (parcellfun doesn't support that)
I'm out of ideas at this point and could use help figuring out how to get this to work.
Converting my comments above to an answer.
When performing parallel operations, it is useful to think of each parallel worker that will result as separate and independent octave instances, which need to have appropriate access to all functions and variables they will require in order to do their independent work.
Therefore, do not rely on subfunctions when calling parcellfun from a main function, since this might lead to errors if the worker is unable to access the subfunction directly under the hood.
In this case, separating the subfunction into its own file fixed the problem.
I am using GLPK with Julia, and using the methods written by spencerlyon
sendto(2, lp = lp) #lp is type GLPK.Prob
However, I cant seem to send a type GLPK.Prob between workers. Whenever I do try to send a type GLPK.Prob, it gets 'sent' and calling
remotecall_fetch(2, whos)
confirms that the GLPK.Prob got sent
The problem appears when I try to solve it by calling
simplex(lp)
the error
GLPK.GLPKError("invalid GLPK.Prob")
appears. I know that the GLPK.Prob isnt originally an invalid GLPK.Prob and if I decide to construct the GLPK.Prob type explicitly on another worker, fx worker 2, calling simplex runs just fine
This is a problem as the GLPK.Prob is generated from a custom type of mine that is a bit on the heavy side
tl;dr Are there possibly some types that cannot be sent between workers properly?
Update
I see now that calling
remotecall_fetch(2, simplex, lp)
will return the above GLPK error
Furthermore I've just noticed that the GLPK module has got a method called
GLPK.copy_prob(GLPK.Prob, GLPK.Prob, Int)
but deepcopy (and certainly not copy) wont work when copying a GLPK.Prob
Example
function create_lp()
lp = GLPK.Prob()
GLPK.set_prob_name(lp, "sample")
GLPK.term_out(GLPK.OFF)
GLPK.set_obj_dir(lp, GLPK.MAX)
GLPK.add_rows(lp, 3)
GLPK.set_row_bnds(lp,1,GLPK.UP,0,100)
GLPK.set_row_bnds(lp,2,GLPK.UP,0,600)
GLPK.set_row_bnds(lp,3,GLPK.UP,0,300)
GLPK.add_cols(lp, 3)
GLPK.set_col_bnds(lp,1,GLPK.LO,0,0)
GLPK.set_obj_coef(lp,1,10)
GLPK.set_col_bnds(lp,2,GLPK.LO,0,0)
GLPK.set_obj_coef(lp,2,6)
GLPK.set_col_bnds(lp,3,GLPK.LO,0,0)
GLPK.set_obj_coef(lp,3,4)
s = spzeros(3,3)
s[1,1] = 1
s[1,2] = 1
s[1,3] = 1
s[2,1] = 10
s[3,1] = 2
s[2,2] = 4
s[3,2] = 2
s[2,3] = 5
s[3,3] = 6
GLPK.load_matrix(lp, s)
return lp
end
This will return a lp::GLPK.Prob() which will return 733.33 when running
simplex(lp)
result = get_obj_val(lp)#returns 733.33
However, doing
addprocs(1)
remotecall_fetch(2, simplex, lp)
will result in the error above
It looks like the problem is that your lp object contains a pointer.
julia> lp = create_lp()
GLPK.Prob(Ptr{Void} #0x00007fa73b1eb330)
Unfortunately, working with pointers and parallel processing is difficult - if different processes have different memory spaces then it won't be clear which memory address the process should look at in order to access the memory that the pointer points to. These issues can be overcome, but apparently they require individual work for each data type that involves said pointers, see this GitHub discussion for more.
Thus, my thought would be that if you want to access the pointer on the worker, you could just create it on that worker. E.g.
using GLPK
addprocs(2)
#everywhere begin
using GLPK
function create_lp()
lp = GLPK.Prob()
GLPK.set_prob_name(lp, "sample")
GLPK.term_out(GLPK.OFF)
GLPK.set_obj_dir(lp, GLPK.MAX)
GLPK.add_rows(lp, 3)
GLPK.set_row_bnds(lp,1,GLPK.UP,0,100)
GLPK.set_row_bnds(lp,2,GLPK.UP,0,600)
GLPK.set_row_bnds(lp,3,GLPK.UP,0,300)
GLPK.add_cols(lp, 3)
GLPK.set_col_bnds(lp,1,GLPK.LO,0,0)
GLPK.set_obj_coef(lp,1,10)
GLPK.set_col_bnds(lp,2,GLPK.LO,0,0)
GLPK.set_obj_coef(lp,2,6)
GLPK.set_col_bnds(lp,3,GLPK.LO,0,0)
GLPK.set_obj_coef(lp,3,4)
s = spzeros(3,3)
s[1,1] = 1
s[1,2] = 1
s[1,3] = 1
s[2,1] = 10
s[3,1] = 2
s[2,2] = 4
s[3,2] = 2
s[2,3] = 5
s[3,3] = 6
GLPK.load_matrix(lp, s)
return lp
end
end
a = #spawnat 2 eval(:(lp = create_lp()))
b = #spawnat 2 eval(:(result = simplex(lp)))
fetch(b)
See the documentation below on #spawn for more info on using it, as it can take a bit of getting used to.
The macros #spawn and #spawnat are two of the tools that Julia makes available to assign tasks to workers. Here is an example:
julia> #spawnat 2 println("hello world")
RemoteRef{Channel{Any}}(2,1,3)
julia> From worker 2: hello world
Both of these macros will evaluate an expression on a worker process. The only difference between the two is that #spawnat allows you to choose which worker will evaluate the expression (in the example above worker 2 is specified) whereas with #spawn a worker will be automatically chosen, based on availability.
In the above example, we simply had worker 2 execute the println function. There was nothing of interest to return or retrieve from this. Often, however, the expression we sent to the worker will yield something we wish to retrieve. Notice in the example above, when we called #spawnat, before we got the printout from worker 2, we saw the following:
RemoteRef{Channel{Any}}(2,1,3)
This indicates that the #spawnat macro will return a RemoteRef type object. This object in turn will contain the return values from our expression that is sent to the worker. If we want to retrieve those values, we can first assign the RemoteRef that #spawnat returns to an object and then, and then use the fetch() function which operates on a RemoteRef type object, to retrieve the results stored from an evaluation performed on a worker.
julia> result = #spawnat 2 2 + 5
RemoteRef{Channel{Any}}(2,1,26)
julia> fetch(result)
7
The key to being able to use #spawn effectively is understanding the nature behind the expressions that it operates on. Using #spawn to send commands to workers is slightly more complicated than just typing directly what you would type if you were running an "interpreter" on one of the workers or executing code natively on them. For instance, suppose we wished to use #spawnat to assign a value to a variable on a worker. We might try:
#spawnat 2 a = 5
RemoteRef{Channel{Any}}(2,1,2)
Did it work? Well, let's see by having worker 2 try to print a.
julia> #spawnat 2 println(a)
RemoteRef{Channel{Any}}(2,1,4)
julia>
Nothing happened. Why? We can investigate this more by using fetch() as above. fetch() can be very handy because it will retrieve not just successful results but also error messages as well. Without it, we might not even know that something has gone wrong.
julia> result = #spawnat 2 println(a)
RemoteRef{Channel{Any}}(2,1,5)
julia> fetch(result)
ERROR: On worker 2:
UndefVarError: a not defined
The error message says that a is not defined on worker 2. But why is this? The reason is that we need to wrap our assignment operation into an expression that we then use #spawn to tell the worker to evaluate. Below is an example, with explanation following:
julia> #spawnat 2 eval(:(a = 2))
RemoteRef{Channel{Any}}(2,1,7)
julia> #spawnat 2 println(a)
RemoteRef{Channel{Any}}(2,1,8)
julia> From worker 2: 2
The :() syntax is what Julia uses to designate expressions. We then use the eval() function in Julia, which evaluates an expression, and we use the #spawnat macro to instruct that the expression be evaluated on worker 2.
We could also achieve the same result as:
julia> #spawnat(2, eval(parse("c = 5")))
RemoteRef{Channel{Any}}(2,1,9)
julia> #spawnat 2 println(c)
RemoteRef{Channel{Any}}(2,1,10)
julia> From worker 2: 5
This example demonstrates two additional notions. First, we see that we can also create an expression using the parse() function called on a string. Secondly, we see that we can use parentheses when calling #spawnat, in situations where this might make our syntax more clear and manageable.
I found this post - Shared array usage in Julia, which is clearly close but I still don't really understand what to do in my case.
I am trying to pass a shared array to a function I define, and call that function using #everywhere. The following, which has no shared array, works:
#everywhere mat = rand(3,3)
#everywhere foo1(x::Array) = det(x)
Then this
#everywhere println(foo1(mat))
properly produces different results from each worker. Now let me include a shared array:
test = SharedArray(Float64,10)
#everywhere foo2(x::Array,y::SharedArray) = det(x) + sum(y)
Then this
#everywhere println(foo2(mat,test))
fails on the workers.
ERROR: On worker 2:
UndefVarError: test not defined
etc. I can get what I want like this:
for w in procs()
#spawnat w println(foo2(eval(:mat),test))
end
This works - but is it optimal? Is there a way to make it work with #everywhere?
While it's tempting to use "named variables" on workers, it generally seems to work better if you access them via references. Schematically, you might do something like this:
mat = [#spawnat p rand(3,3) for p in workers()] # process 1 holds references to objects on workers
#sync for (i, p) in enumerate(workers())
#spawnat p foo(mat[i], sharedarray)
end
So I'm trying to iterate over the list of partitions of something, say 1:n for some n between 13 and 21. The code that I ideally want to run looks something like this:
valid_num = #parallel (+) for p in partitions(1:n)
int(is_valid(p))
end
println(valid_num)
This would use the #parallel for to map-reduce my problem. For example, compare this to the example in the Julia documentation:
nheads = #parallel (+) for i=1:200000000
Int(rand(Bool))
end
However, if I try my adaptation of the loop, I get the following error:
ERROR: `getindex` has no method matching getindex(::SetPartitions{UnitRange{Int64}}, ::Int64)
in anonymous at no file:1433
in anonymous at multi.jl:1279
in run_work_thunk at multi.jl:621
in run_work_thunk at multi.jl:630
in anonymous at task.jl:6
which I think is because I am trying to iterate over something that is not of the form 1:n (EDIT: I think it's because you cannot call p[3] if p=partitions(1:n)).
I've tried using pmap to solve this, but because the number of partitions can get really big, really quickly (there are more than 2.5 million partitions of 1:13, and when I get to 1:21 things will be huge), constructing such a large array becomes an issue. I left it running over night and it still didn't finish.
Does anyone have any advice for how I can efficiently do this in Julia? I have access to a ~30 core computer and my task seems easily parallelizable, so I would be really grateful if anyone knows a good way to do this in Julia.
Thank you so much!
The below code gives 511, the number of partitions of size 2 of a set of 10.
using Iterators
s = [1,2,3,4,5,6,7,8,9,10]
is_valid(p) = length(p)==2
valid_num = #parallel (+) for i = 1:30
sum(map(is_valid, takenth(chain(1:29,drop(partitions(s), i-1)), 30)))
end
This solution combines the takenth, drop, and chain iterators to get the same effect as the take_every iterator below under PREVIOUS ANSWER. Note that in this solution, every process must compute every partition. However, because each process uses a different argument to drop, no two processes will ever call is_valid on the same partition.
Unless you want to do a lot of math to figure out how to actually skip partitions, there is no way to avoid computing partitions sequentially on at least one process. I think Simon's answer does this on one process and distributes the partitions. Mine asks each worker process to compute the partitions itself, which means the computation is being duplicated. However, it is being duplicated in parallel, which (if you actually have 30 processors) will not cost you time.
Here is a resource on how iterators over partitions are actually computed: http://www.informatik.uni-ulm.de/ni/Lehre/WS03/DMM/Software/partitions.pdf.
PREVIOUS ANSWER (More complicated than necessary)
I noticed Simon's answer while writing mine. Our solutions seem similar to me, except mine uses iterators to avoid storing partitions in memory. I'm not sure which would actually be faster for what size sets, but I figure it's good to have both options. Assuming it takes you significantly longer to compute is_valid than to compute the partitions themselves, you can do something like this:
s = [1,2,3,4]
is_valid(p) = length(p)==2
valid_num = #parallel (+) for i = 1:30
foldl((x,y)->(x + int(is_valid(y))), 0, take_every(partitions(s), i-1, 30))
end
which gives me 7, the number of partitions of size 2 for a set of 4. The take_every function returns an iterator that returns every 30th partition starting with the ith. Here is the code for that:
import Base: start, done, next
immutable TakeEvery{Itr}
itr::Itr
start::Any
value::Any
flag::Bool
skip::Int64
end
function take_every(itr, offset, skip)
value, state = Nothing, start(itr)
for i = 1:(offset+1)
if done(itr, state)
return TakeEvery(itr, state, value, false, skip)
end
value, state = next(itr, state)
end
if done(itr, state)
TakeEvery(itr, state, value, true, skip)
else
TakeEvery(itr, state, value, false, skip)
end
end
function start{Itr}(itr::TakeEvery{Itr})
itr.value, itr.start, itr.flag
end
function next{Itr}(itr::TakeEvery{Itr}, state)
value, state_, flag = state
for i=1:itr.skip
if done(itr.itr, state_)
return state[1], (value, state_, false)
end
value, state_ = next(itr.itr, state_)
end
if done(itr.itr, state_)
state[1], (value, state_, !flag)
else
state[1], (value, state_, false)
end
end
function done{Itr}(itr::TakeEvery{Itr}, state)
done(itr.itr, state[2]) && !state[3]
end
One approach would be to divide the problem up into pieces that are not too big to realize and then process the items within each piece in parallel, e.g. as follows:
function my_take(iter,state,n)
i = n
arr = Array[]
while !done(iter,state) && (i>0)
a,state = next(iter,state)
push!(arr,a)
i = i-1
end
return arr, state
end
function get_part(npart,npar)
valid_num = 0
p = partitions(1:npart)
s = start(p)
while !done(p,s)
arr,s = my_take(p,s,npar)
valid_num += #parallel (+) for a in arr
length(a)
end
end
return valid_num
end
valid_num = #time get_part(10,30)
I was going to use the take() method to realize up to npar items from the iterator but take() appears to be deprecated so I've included my own implementation which I've called my_take(). The getPart() function therefore uses my_take() to obtain up to npar partitions at a time and carry out a calculation on them. In this case, the calculation just adds up their lengths, because I don't have the code for the OP's is_valid() function. get_part() then returns the result.
Because the length() calculation isn't very time-consuming, this code is actually slower when run on parallel processors than it is on a single processor:
$ julia -p 1 parpart.jl
elapsed time: 10.708567515 seconds (373025568 bytes allocated, 6.79% gc time)
$ julia -p 2 parpart.jl
elapsed time: 15.70633439 seconds (548394872 bytes allocated, 9.14% gc time)
Alternatively, pmap() could be used on each piece of the problem instead of the parallel for loop.
With respect to the memory issue, realizing 30 items from partitions(1:10) took nearly 1 gigabyte of memory on my PC when I ran Julia with 4 worker processes so I expect realizing even a small subset of partitions(1:21) will require a great deal of memory. It may be desirable to estimate how much memory would be needed to see if it would be at all possible before trying such a computation.
With respect to the computation time, note that:
julia> length(partitions(1:10))
115975
julia> length(partitions(1:21))
474869816156751
... so even efficient parallel processing on 30 cores might not be enough to make the larger problem solvable in a reasonable time.
I have a large vector of vectors of strings:
There are around 50,000 vectors of strings,
each of which contains 2-15 strings of length 1-20 characters.
MyScoringOperation is a function which operates on a vector of strings (the datum) and returns an array of 10100 scores (as Float64s). It takes about 0.01 seconds to run MyScoringOperation (depending on the length of the datum)
function MyScoringOperation(state:State, datum::Vector{String})
...
score::Vector{Float64} #Size of score = 10000
I have what amounts to a nested loop.
The outer loop typically would runs for 500 iterations
data::Vector{Vector{String}} = loaddata()
for ii in 1:500
score_total = zeros(10100)
for datum in data
score_total+=MyScoringOperation(datum)
end
end
On one computer, on a small test case of 3000 (rather than 50,000) this takes 100-300 seconds per outer loop.
I have 3 powerful servers with Julia 3.9 installed (and can get 3 more easily, and then can get hundreds more at the next scale).
I have basic experience with #parallel, however it seems like it is spending a lot of time copying the constant (It more or less hang on the smaller testing case)
That looks like:
data::Vector{Vector{String}} = loaddata()
state = init_state()
for ii in 1:500
score_total = #parallel(+) for datum in data
MyScoringOperation(state, datum)
end
state = update(state, score_total)
end
My understanding of the way this implementation works with #parallel is that it:
For Each ii:
partitions data into a chuck for each worker
sends that chuck to each worker
works all process there chunks
main procedure sums the results as they arrive.
I would like to remove step 2,
so that instead of sending a chunk of data to each worker,
I just send a range of indexes to each worker, and they look it up from their own copy of data. or even better, only giving each only their own chunk, and having them reuse it each time (saving on a lot of RAM).
Profiling backs up my belief about the functioning of #parellel.
For a similarly scoped problem (with even smaller data),
the non-parallel version runs in 0.09seconds,
and the parallel runs in
And the profiler shows almost all the time is spent 185 seconds.
Profiler shows almost 100% of this is spend interacting with network IO.
This should get you started:
function get_chunks(data::Vector, nchunks::Int)
base_len, remainder = divrem(length(data),nchunks)
chunk_len = fill(base_len,nchunks)
chunk_len[1:remainder]+=1 #remained will always be less than nchunks
function _it()
for ii in 1:nchunks
chunk_start = sum(chunk_len[1:ii-1])+1
chunk_end = chunk_start + chunk_len[ii] -1
chunk = data[chunk_start: chunk_end]
produce(chunk)
end
end
Task(_it)
end
function r_chunk_data(data::Vector)
all_chuncks = get_chunks(data, nworkers()) |> collect;
remote_chunks = [put!(RemoteRef(pid)::RemoteRef, all_chuncks[ii]) for (ii,pid) in enumerate(workers())]
#Have to add the type annotation sas otherwise it thinks that, RemoteRef(pid) might return a RemoteValue
end
function fetch_reduce(red_acc::Function, rem_results::Vector{RemoteRef})
total = nothing
#TODO: consider strongly wrapping total in a lock, when in 0.4, so that it is garenteed safe
#sync for rr in rem_results
function gather(rr)
res=fetch(rr)
if total===nothing
total=res
else
total=red_acc(total,res)
end
end
#async gather(rr)
end
total
end
function prechunked_mapreduce(r_chunks::Vector{RemoteRef}, map_fun::Function, red_acc::Function)
rem_results = map(r_chunks) do rchunk
function do_mapred()
#assert r_chunk.where==myid()
#pipe r_chunk |> fetch |> map(map_fun,_) |> reduce(red_acc, _)
end
remotecall(r_chunk.where,do_mapred)
end
#pipe rem_results|> convert(Vector{RemoteRef},_) |> fetch_reduce(red_acc, _)
end
rchunk_data breaks the data into chunks, (defined by get_chunks method) and sends those chunks each to a different worker, where they are stored in RemoteRefs.
The RemoteRefs are references to memory on your other proccesses(and potentially computers), that
prechunked_map_reduce does a variation on a kind of map reduce to have each worker first run map_fun on each of it's chucks elements, then reduce over all the elements in its chuck using red_acc (a reduction accumulator function). Finally each worker returns there result which is then combined by reducing them all together using red_acc this time using the fetch_reduce so that we can add the first ones completed first.
fetch_reduce is a nonblocking fetch and reduce operation. I believe it has no raceconditions, though this maybe because of a implementation detail in #async and #sync. When julia 0.4 comes out, it is easy enough to put a lock in to make it obviously have no race conditions.
This code isn't really battle hardened. I don;t believe the
You also might want to look at making the chuck size tunable, so that you can seen more data to faster workers (if some have better network or faster cpus)
You need to reexpress your code as a map-reduce problem, which doesn't look too hard.
Testing that with:
data = [float([eye(100),eye(100)])[:] for _ in 1:3000] #480Mb
chunk_data(:data, data)
#time prechunked_mapreduce(:data, mean, (+))
Took ~0.03 seconds, when distributed across 8 workers (none of them on the same machine as the launcher)
vs running just locally:
#time reduce(+,map(mean,data))
took ~0.06 seconds.