Prevent overwriting modules in Julia parallelization - parallel-processing

I've written a Julia module with various functions which I call to analyze data. Several of these functions are dependent on packages, which are included at the start of the file "NeuroTools.jl."
module NeuroTools
using MAT, PyPlot, PyCall;
function getHists(channels::Array{Int8,2}...
Many of the functions I have are useful to run in parallel, so I wrote a driver script to map functions to different threads using remotecall/fetch. To load the functions on each thread, I launch Julia with the -L option to load my module on each worker.
julia -p 16 -L NeuroTools.jl parallelize.jl
To bring the loaded functions into scope, the "parallelize.jl" script has the line
#everywhere using NeuroTools
My parallel function works and executes properly, but each worker thread spits out a bunch of warnings from the modules being overwritten.
WARNING: replacing module MAT
WARNING: Method definition read(Union{HDF5.HDF5Dataset, HDF5.HDF5Datatype, HDF5.HDF5Group}, Type{Bool}) in module MAT_HDF5...
(contniues for many lines)
Is there a way to load the module differently or change the scope to prevent all these warnings? The documentation does not seem entirely clear on this issue.

Coincidentally I was looking for the same thing this morning
(rd,wr) = redirect_stdout()
So you'd need to call
remotecall_fetch(worker_id, redirect_stdout)
If you want to completely turn it off, this will work
If you want to turn it back on, you could
out = STDOUT
(a,b) = redirect_stdout()
#then to turn it back on, do:
redirect_stdout(out)

This is fixed in the more recent releases, and #everywhere using ... is right if you really need the module in scope in all workers. This GitHub issue talks about the problem and has links to some of the other relevant discussions.
If still using older versions of Julia where this was the case, just write using NeuroTools in NeuroTools.jl after defining the module, instead of executing #everywhere using NeuroTools.
The Parallel Computing section of the Julia documentation for version 0.5 says,
using DummyModule causes the module to be loaded on all processes; however, the module is brought into scope only on the one executing the statement.
Executing #everywhere using NeuroTools used to tell each processes to load the module on all processes, and the result was a pile of replacing module warnings.

Related

Julia seems to be very slow

I am running code shown in this question. I expected it to run faster second and third time (on first run it takes time to compile the code). However, it seems to be taking same amount of time as the first time. How can I make this code run faster?
Edit: I am running the code by giving command on Linux terminal: julia mycode.jl
I tried following instructions in the answer by #Przemyslaw Szufel but got following error:
julia> create_sysimage(["Plots"], sysimage_path="sys_plots.so", precompile_execution_file="precompile_plots.jl")
ERROR: MethodError: no method matching create_sysimage(::Array{String,1}; sysimage_path="sys_plots.so", precompile_execution_file="precompile_plots.jl")
Closest candidates are:
create_sysimage() at /home/cardio/.julia/packages/PackageCompiler/2yhCw/src/PackageCompiler.jl:462 got unsupported keyword arguments "sysimage_path", "precompile_execution_file"
create_sysimage(::Union{Array{Symbol,1}, Symbol}; sysimage_path, project, precompile_execution_file, precompile_statements_file, incremental, filter_stdlibs, replace_default, base_sysimage, isapp, julia_init_c_file, version, compat_level, soname, cpu_target, script) at /home/cardio/.julia/packages/PackageCompiler/2yhCw/src/PackageCompiler.jl:462
Stacktrace:
[1] top-level scope at REPL[25]:1
I am using Julia on Debian Stable Linux: Debian ⛬ julia/1.5.3+dfsg-3
In Julia packages are compiled each time they are run withing a single Julia session. Hence starting a new Julia process means that each time Plots.jl get compiled. This is quite a big package so will take a significant time to compile.
In order to circumvent it, use the PackageCompiler and compile Plots.jl into a static system image that can be used later by Julia
The basic steps include:
using PackageCompiler
create_sysimage(["Plots"], sysimage_path="sys_plots.so", precompile_execution_file="precompile_plots.jl")
After this is done you will need to run your code as:
julia --sysimage sys_plots.so mycode.jl
Similarly you could have added MultivariateStats and RDatasets to the generated sysimage but I do not think they cause any significant delay.
Note that if the subsequent runs are part of your development process (rather than your production system implementation) and you are eg. developing a Julia module than you could rather consider using Revise.jl in the development process rather than precompile the sysimage. Basically, having the sysimage means that you will need to rebuild it each time you update your Julia packages so I would consider this approach rather for production than development (depends on your exact scenario).
I had this problem and almost went back to Python but now I run scripts in the REPL with include. It is much faster this way.
Note: First run will be slow but subsequent runs in the same REPL session will be fast even if the script is edited.
Fedora 36, Julia 1.8.1

Limitations on file append when using in multi-processed environment

My process creates a log file and appends a new line at the end of the file by using a, e.g:
fopen("log.txt", "a");
The order of the writes is not critical, but I need to ensure that fopen always succeeds. My question is, can the call above be executed from multiple processes at the same time on Windows, Linux and macOS without any race-condition?
If not, what is the most common and easy way to ensure I can write to the log file? There is file-lokcing, but also a file-lock (aka log.txt.lock) possible. Could anyone share some insights or resources which go more into detail?
If you do not use any synchronization between processes, you'll highly likely have moment when several processes will try to write to the file and the best you can get is mesh of input strings.
In order to synchronize any work in several processes (multiprocessing module). Use Lock. It will prevent several processes to do some work simultaneously.
It will look something like this:
import multiprocessing
# create lock in main process and "send" it to child processes.
lock = multiprocessing.Lock()
# ...
# in child Process
with lock:
do_some_work()
If you need more detailed example, feel free to ask.
Also you can check example in official docs

Does #everywhere not load a function on the master?

I made a module with an if condition on the number of cores. If the number of cores is more than 1 the route is parallel; otherwise, it goes the serial route as seen in the code below
module mymodule
import Pkg
using Distributed
if nworkers() > 1
#everywhere using Pkg
#everywhere Pkg.activate(".")
#everywhere Pkg.instantiate()
#everywhere using CSV
#everywhere include("src/myfuncs.jl")
function func()
df=CSV.read(file);
.......
end
else
using Pkg
Pkg.activate(".")
Pkg.instantiate()
using CSV
include("src/myfuncs.jl")
function func()
df=CSV.read(file);
.......
end
end
end #mymodule
1) When I instantiate a Julia session, e.g., julia -p 8 I get an error saying ERROR: UndefVarError: CSV not defined. On the other hand, when a session is instantiated simply as julia there is no error. The Project.toml & Master.toml files are one level higher than src. Do I have to load on the master before using #everyone, like
include("src/myfuncs.jl")
#everywhere include("src/myfuncs.jl")
2) Moreover, I find that when the program goes the serial route it can't find the myfunc.jl file because it is already in the src folder (looks for src/src/myfunc.jl), this behavior is confusing me.
Can someone share their thoughts here?
#everywhere does execute on all workers and the master. However:
Sometimes, if you have bad luck and the module that you are importing is not compiled, a race condition can occur (not always reproducible but reported by several users on StackOverflow), hence, the best bet is to always write the code like this (note that if your cluster is distributed across many servers this might not be enough):
using Distributed
#everywhere using Distributed
using CSV
#everywhere using CSV
Modify your code to run using before Pkg.activate
I am not sure what you want to achieve by #everywhere Pkg.instantiate() but for sure what you are doing now this can not be good (you must make sure that it is not run in more than one copy for a cluster node)
Finally, there is no need to separate your code depending on the number of workers - see the safe pattern in point (1)
Hope that helps!

Choosing a better parallel architecture in Python

I am working on Data Wrangling problem using Python,
which processes a dirty Excel file into a clean Excel file
I would like to process multiple input files by introducing concurrency/parallelism.
I have the following options 1) Using multiThreading 2) Using multiProceesing modules 3) ParallelPython module,
I have a basic idea of the three methods, I would like to know which method is best and why?
In Bref, Processing of a SINGLE dirty Excel file today takes 3 minutes,
Objective : To introduce parallelism/concurrency to process multiple files at once.
Looking for, best method of parallelism to achieve the objective
Since your process is mostly CPU bound multi-threading won't be fast because of the GIL...
I would recommend multiprocessing or concurrent.futures since they are a bit simpler the ParallelPython (only a bit :) )
example:
with concurrent.futures.ProcessPoolExecutor() as executor:
for file_path, clean_file in zip(files, executor.map(data_wrangler, files)):
print('%s is now clean!' % (file_path))
#do something with clean_file if you want
Only if you need to distribute the load between servers then I would recommend ParallelPython .

Using julia in a cluster

I've been using Julia in parallel on my computer successfully but want to increase the number of processors/workers I use so I plan to use my departmental cluster (UCL Econ). When just using julia on my computer, I have two seperate files. FileA contains all the functions I use, including the main function funcy(x,y,z). FileB calls this function over several processors as follows:
addprocs(4)
require("FileA.jl")
solution = pmap(imw -> funcy(imw,y,z), 1:10)
When I try to run this on the cluster, the require statement does not seem to work (though I don't get an explicit error output which is frustrating). Any advice?

Resources