Does #everywhere not load a function on the master? - parallel-processing

I made a module with an if condition on the number of cores. If the number of cores is more than 1 the route is parallel; otherwise, it goes the serial route as seen in the code below
module mymodule
import Pkg
using Distributed
if nworkers() > 1
#everywhere using Pkg
#everywhere Pkg.activate(".")
#everywhere Pkg.instantiate()
#everywhere using CSV
#everywhere include("src/myfuncs.jl")
function func()
df=CSV.read(file);
.......
end
else
using Pkg
Pkg.activate(".")
Pkg.instantiate()
using CSV
include("src/myfuncs.jl")
function func()
df=CSV.read(file);
.......
end
end
end #mymodule
1) When I instantiate a Julia session, e.g., julia -p 8 I get an error saying ERROR: UndefVarError: CSV not defined. On the other hand, when a session is instantiated simply as julia there is no error. The Project.toml & Master.toml files are one level higher than src. Do I have to load on the master before using #everyone, like
include("src/myfuncs.jl")
#everywhere include("src/myfuncs.jl")
2) Moreover, I find that when the program goes the serial route it can't find the myfunc.jl file because it is already in the src folder (looks for src/src/myfunc.jl), this behavior is confusing me.
Can someone share their thoughts here?

#everywhere does execute on all workers and the master. However:
Sometimes, if you have bad luck and the module that you are importing is not compiled, a race condition can occur (not always reproducible but reported by several users on StackOverflow), hence, the best bet is to always write the code like this (note that if your cluster is distributed across many servers this might not be enough):
using Distributed
#everywhere using Distributed
using CSV
#everywhere using CSV
Modify your code to run using before Pkg.activate
I am not sure what you want to achieve by #everywhere Pkg.instantiate() but for sure what you are doing now this can not be good (you must make sure that it is not run in more than one copy for a cluster node)
Finally, there is no need to separate your code depending on the number of workers - see the safe pattern in point (1)
Hope that helps!

Related

Passing multiple arguments to external programs in a Pipeline

I'm trying to build a pipeline for NGS data.
I made a small example pipeline for passing commands to shell. Example pipeline has two scripts thats called from shell that just concatenates(sumtool.py) and multiplies(multool.py) values in many dataframes (10 in this case). My wrapper(wrapper.py) handles with the input and passes the commands that runs the scripts in order. Here is the relevant part of the code from the wrapper:
def run_cmd(orig_func):
#wraps(orig_func)
def wrapper(*args,**kwargs):
cmdls = orig_func(*args,**kwargs)
cmdc = ' '.join(str(arg) for arg in cmdls)
cmd = cmdc.replace(',','')
Popen(cmd,shell=True).wait()
return wrapper
#run_cmd
def runsumtool(*args):
return args
for file in getcsv():
runsumtool('python3','sumtool.py','--infile={}'.format(file),'--outfile={}'.format(dirlist[1]))
This works alright but I want to be able to pass all the commands at once for the first script with all the dataframes wait for it to finish and then run the second script with all commands at once for every dataframe. Since Popen().wait() waits for each command it takes way longer.
I tried to incorporate luigi for a solution but I wasn't successful running external programs or trying to pass multiple I/O's with luigi. Any tip on that is appreciated.
Another solution I'm imagining is passing the samples individually all at once but I'm not sure how to put it in python(or any other language really). This would also solve the I/O problem with luigi.
thanks
Note1: This is a small example pipeline I build. My main purpose is to call programs like bwa, picard in a pipeline ... which i cannot import.
Note2: I'm using Popen from subprocess already. You can find it between lines 4 and 5.

Prevent overwriting modules in Julia parallelization

I've written a Julia module with various functions which I call to analyze data. Several of these functions are dependent on packages, which are included at the start of the file "NeuroTools.jl."
module NeuroTools
using MAT, PyPlot, PyCall;
function getHists(channels::Array{Int8,2}...
Many of the functions I have are useful to run in parallel, so I wrote a driver script to map functions to different threads using remotecall/fetch. To load the functions on each thread, I launch Julia with the -L option to load my module on each worker.
julia -p 16 -L NeuroTools.jl parallelize.jl
To bring the loaded functions into scope, the "parallelize.jl" script has the line
#everywhere using NeuroTools
My parallel function works and executes properly, but each worker thread spits out a bunch of warnings from the modules being overwritten.
WARNING: replacing module MAT
WARNING: Method definition read(Union{HDF5.HDF5Dataset, HDF5.HDF5Datatype, HDF5.HDF5Group}, Type{Bool}) in module MAT_HDF5...
(contniues for many lines)
Is there a way to load the module differently or change the scope to prevent all these warnings? The documentation does not seem entirely clear on this issue.
Coincidentally I was looking for the same thing this morning
(rd,wr) = redirect_stdout()
So you'd need to call
remotecall_fetch(worker_id, redirect_stdout)
If you want to completely turn it off, this will work
If you want to turn it back on, you could
out = STDOUT
(a,b) = redirect_stdout()
#then to turn it back on, do:
redirect_stdout(out)
This is fixed in the more recent releases, and #everywhere using ... is right if you really need the module in scope in all workers. This GitHub issue talks about the problem and has links to some of the other relevant discussions.
If still using older versions of Julia where this was the case, just write using NeuroTools in NeuroTools.jl after defining the module, instead of executing #everywhere using NeuroTools.
The Parallel Computing section of the Julia documentation for version 0.5 says,
using DummyModule causes the module to be loaded on all processes; however, the module is brought into scope only on the one executing the statement.
Executing #everywhere using NeuroTools used to tell each processes to load the module on all processes, and the result was a pile of replacing module warnings.

Choosing a better parallel architecture in Python

I am working on Data Wrangling problem using Python,
which processes a dirty Excel file into a clean Excel file
I would like to process multiple input files by introducing concurrency/parallelism.
I have the following options 1) Using multiThreading 2) Using multiProceesing modules 3) ParallelPython module,
I have a basic idea of the three methods, I would like to know which method is best and why?
In Bref, Processing of a SINGLE dirty Excel file today takes 3 minutes,
Objective : To introduce parallelism/concurrency to process multiple files at once.
Looking for, best method of parallelism to achieve the objective
Since your process is mostly CPU bound multi-threading won't be fast because of the GIL...
I would recommend multiprocessing or concurrent.futures since they are a bit simpler the ParallelPython (only a bit :) )
example:
with concurrent.futures.ProcessPoolExecutor() as executor:
for file_path, clean_file in zip(files, executor.map(data_wrangler, files)):
print('%s is now clean!' % (file_path))
#do something with clean_file if you want
Only if you need to distribute the load between servers then I would recommend ParallelPython .

Using julia in a cluster

I've been using Julia in parallel on my computer successfully but want to increase the number of processors/workers I use so I plan to use my departmental cluster (UCL Econ). When just using julia on my computer, I have two seperate files. FileA contains all the functions I use, including the main function funcy(x,y,z). FileB calls this function over several processors as follows:
addprocs(4)
require("FileA.jl")
solution = pmap(imw -> funcy(imw,y,z), 1:10)
When I try to run this on the cluster, the require statement does not seem to work (though I don't get an explicit error output which is frustrating). Any advice?

Julia parallel programming - Making existing function available to all workers

I am faced with the following problem:
I have a function called TrainModel that runs for a very long time on a single thread. When it finishes computing, it returns a function as an output argument, let's call it f. When I enquire the type of this f, Julia returns:
(generic function with 1 method)
(I am not sure of this last piece of information is useful to anyone reading this)
Now in a second step, I need to apply function f on a very large array of values. This is a step that I would like to parallelise. Having had started Julia with multiple processes, e.g.
julia -p 4
ideally, I would use:
pmap(f, my_values)
or perhaps:
aux = #parallel (hcat) for ii=1:100000000
f(my_values[ii])
end
Unfortunately, this doesn't work. Julia complains that the workers are not aware of the function f, i.e. I get a messsage:
ERROR: function f not defined on process 2
How can I make function f available to all workers? Obviously a "dirty" solution would be to run the time-consuming function TrainModel on all workers, like this perhaps:
#everywhere f = TrainModel( ... )
but this would be a waste of cpu when all I want is that just the result f is available to all workers.
Though I searched for posts with similar problems, so far I could not find an answer...
Thanks in advance!
best,
N.
The approach to return the function seems elegant but unfortunately, unlike JavaScript, Julia does not resolve all the variables when creating the functions.
Technically, your training function could produce the source code of the function with literal values for all the trained parameters. Then pass it to each of the worker processes, which can parse it in their environment to a callable function.
I suggest to return a data structure that contains all the information to produce the trained function: weights of an ANN, support vectors, decision rules ...
Define a the "trained" function on the worker processes, such that it will utilized the trained parameters. You might want to have the ability of saving the results of the training to disk anyway, so that you can easily re-produce your computations.
There is a Unix-only solution based on the PTools.jl package (https://github.com/amitmurthy/PTools.jl).
It relies on parallelism via forking instead of the Julia in-built mechanism. Forked processes are spawned with the same workspace as the main process, so all functions and variables are directly available to the workers.
This is a similar to the Fork clusters in R parallel package, so it can be used as the mclapply function.
The function of interest is pfork(n::Integer, f::Function, args...) and one noticeable difference with mclapply in R is that the function f must take as first argument the index of the worker.
An example:
Pkg.add("PTools")
Pkg.checkout("PTools") #to get the last version, else the package does not build at the time of writing
using PTools
f(workid,x) = x[workid] + 1
pfork(3, f, [1,2,3,4,5]) #Only the three first elements of the array will be computed
3-element Array{Any,1}:
2
3
4
I expect that an interface to pfork will be built so that the first argument of the function will not need to be the index of the worker, but for the time being it can be used to solve the problem

Resources