Using julia in a cluster - parallel-processing

I've been using Julia in parallel on my computer successfully but want to increase the number of processors/workers I use so I plan to use my departmental cluster (UCL Econ). When just using julia on my computer, I have two seperate files. FileA contains all the functions I use, including the main function funcy(x,y,z). FileB calls this function over several processors as follows:
addprocs(4)
require("FileA.jl")
solution = pmap(imw -> funcy(imw,y,z), 1:10)
When I try to run this on the cluster, the require statement does not seem to work (though I don't get an explicit error output which is frustrating). Any advice?

Related

What is good way of using multiprocessing for bifacial_radiance simulations?

For a university project I am using bifacial_radiance v0.4.0 to run simulations of approx. 270 000 rows of data in an EWP file.
I have set up a scene with some panels in a module following a tutorial on the bifacial_radiance GitHub page.
I am running the python script for this on a high power computer with 64 cores. Since python natively only uses 1 processor I want to use multiprocessing, which is currently working. However it does not seem very fast, even when starting 64 processes it uses roughly 10 % of the CPU's capacity (according to the task manager).
The script will first create the scene with panels.
Then it will look at a result file (where I store results as csv), and compare it to the contents of the radObj.metdata object. Both metdata and my result file use dates, so all dates which exist in the metdata file but not in the result file are stored in a queue object from the multiprocessing package. I also initialize a result queue.
I want to send a lot of the work to other processors.
To do this I have written two function:
A file writer function which every 10 seconds gets all items from the result queue and writes them to the result file. This function is running in a single multiprocessing.Process process like so:
fileWriteProcess = Process(target=fileWriter,args=(resultQueue,resultFileName)).start()
A ray trace function with a unique ID which does the following:
Get an index ìdx from the index queue (described above)
Use this index in radObj.gendaylit(idx)
Create the octfile. For this I have modified the name which the octfile is saved with to use a prefix which is the name of the process. This is to avoid all the processes using the same octfile on the SSD. octfile = radObj.makeOct(prefix=name)
Run an analysis analysis = bifacial_radiance.AnalysisObj(octfile,radObj.basename)
frontscan, backscan = analysis.moduleAnalysis(scene)
frontDict, backDict = analysis.analysis(octfile, radObj.basename, frontscan, backscan)
Read the desired results from resultDict and put them in the resultQueue as a single line of comma-separated values.
This all works. The processes are running after being created in a for loop.
This speeds up the whole simulation process quite a bit (10 days down to 1½ day), but as said earlier the CPU is running at around 10 % capacity and the GPU is running around 25 % capacity. The computer has 512 GB ram which is not an issue. The only communication with the processes is through the resultQueue and indexQueue, which should not bottleneck the program. I can see that it is not synchronizing as the results are written slightly unsorted while the input EPW file is sorted.
My question is if there is a better way to do this, which might make it run faster? I can see in the source code that a boolean "hpc" is used to initiate some of the classes, and a comment in the code mentions that it is for multiprocessing, but I can't find any information about it elsewhere.

monetdbe: multiple connections reading vs writing

I am finding that with monetdbe (embedded, Python), I can import data to two tables simultaneously from two processes, but I can't do two SELECT queries.
For example, if I run this in Bash:
(python stdinquery.py < sql_examples/wind.sql &); (python stdinquery.py < sql_examples/first_event_by_day.sql &)
then I get this error from one process, while the other finishes its query fine:
monetdbe.exceptions.OperationalError: Failed to open database: MALException:monetdbe.monetdbe_startup:GDKinit() failed (code -2)
I'm a little surprised that it can write two tables at once but not read two tables at once. Am I overlooking something?
My stdinquery.py is just:
import sys
import monetdbe
monet_conn = monetdbe.connect("dw.db")
cursor = monet_conn.cursor()
query = sys.stdin.read()
cursor.executescript(query)
print(cursor.fetchdf())
You are starting multiple concurrent Python processes. Each of those tries to create or open a database on disk at the dw.db location. That won't work because the embedded database processes are not aware of each others.
With the core C library of monetdbe, it is possible to write multi-threaded applications where each connecting application thread uses its own connection object. See this example written in C here. Again this only works for concurrent threads within a single monetdbe process, not multiple concurrent monetdbe processes claiming the same database location.
Unfortunately it is not currently possible with the Python monetdbe module to setup something analogous to the C example above. But probably in the next release
it is going to be possible to use e.g. concurrent.futures to write something similar in Python.

Running pipelines with data parallellization

I've been running the kedro tutorials (the hello world and the spaceflight) and I'm wondering if it's easily possible to do data parallelization using Kedro.
Imagine, the situation where I have a node that needs to be executed in millions of files.
I've seem that there's the option kedro run -p, but this do only task parallelization (as stated here https://kedro.readthedocs.io/en/latest/03_tutorial/04_create_pipelines.html).
Thanks for the any feedback
Kedro has a number of build-in DataSet classes. For IO parallelization, there is SparkDataSet which delegates IO parallelization to PySpark https://kedro.readthedocs.io/en/latest/04_user_guide/09_pyspark.html#creating-a-sparkdataset
Another dataset is DaskDataSet, but this is still WIP in this PR https://github.com/quantumblacklabs/kedro/pull/97 (if you want to use Dask, you could have a look at this PR and create your own custom dataset)

Does #everywhere not load a function on the master?

I made a module with an if condition on the number of cores. If the number of cores is more than 1 the route is parallel; otherwise, it goes the serial route as seen in the code below
module mymodule
import Pkg
using Distributed
if nworkers() > 1
#everywhere using Pkg
#everywhere Pkg.activate(".")
#everywhere Pkg.instantiate()
#everywhere using CSV
#everywhere include("src/myfuncs.jl")
function func()
df=CSV.read(file);
.......
end
else
using Pkg
Pkg.activate(".")
Pkg.instantiate()
using CSV
include("src/myfuncs.jl")
function func()
df=CSV.read(file);
.......
end
end
end #mymodule
1) When I instantiate a Julia session, e.g., julia -p 8 I get an error saying ERROR: UndefVarError: CSV not defined. On the other hand, when a session is instantiated simply as julia there is no error. The Project.toml & Master.toml files are one level higher than src. Do I have to load on the master before using #everyone, like
include("src/myfuncs.jl")
#everywhere include("src/myfuncs.jl")
2) Moreover, I find that when the program goes the serial route it can't find the myfunc.jl file because it is already in the src folder (looks for src/src/myfunc.jl), this behavior is confusing me.
Can someone share their thoughts here?
#everywhere does execute on all workers and the master. However:
Sometimes, if you have bad luck and the module that you are importing is not compiled, a race condition can occur (not always reproducible but reported by several users on StackOverflow), hence, the best bet is to always write the code like this (note that if your cluster is distributed across many servers this might not be enough):
using Distributed
#everywhere using Distributed
using CSV
#everywhere using CSV
Modify your code to run using before Pkg.activate
I am not sure what you want to achieve by #everywhere Pkg.instantiate() but for sure what you are doing now this can not be good (you must make sure that it is not run in more than one copy for a cluster node)
Finally, there is no need to separate your code depending on the number of workers - see the safe pattern in point (1)
Hope that helps!

Prevent overwriting modules in Julia parallelization

I've written a Julia module with various functions which I call to analyze data. Several of these functions are dependent on packages, which are included at the start of the file "NeuroTools.jl."
module NeuroTools
using MAT, PyPlot, PyCall;
function getHists(channels::Array{Int8,2}...
Many of the functions I have are useful to run in parallel, so I wrote a driver script to map functions to different threads using remotecall/fetch. To load the functions on each thread, I launch Julia with the -L option to load my module on each worker.
julia -p 16 -L NeuroTools.jl parallelize.jl
To bring the loaded functions into scope, the "parallelize.jl" script has the line
#everywhere using NeuroTools
My parallel function works and executes properly, but each worker thread spits out a bunch of warnings from the modules being overwritten.
WARNING: replacing module MAT
WARNING: Method definition read(Union{HDF5.HDF5Dataset, HDF5.HDF5Datatype, HDF5.HDF5Group}, Type{Bool}) in module MAT_HDF5...
(contniues for many lines)
Is there a way to load the module differently or change the scope to prevent all these warnings? The documentation does not seem entirely clear on this issue.
Coincidentally I was looking for the same thing this morning
(rd,wr) = redirect_stdout()
So you'd need to call
remotecall_fetch(worker_id, redirect_stdout)
If you want to completely turn it off, this will work
If you want to turn it back on, you could
out = STDOUT
(a,b) = redirect_stdout()
#then to turn it back on, do:
redirect_stdout(out)
This is fixed in the more recent releases, and #everywhere using ... is right if you really need the module in scope in all workers. This GitHub issue talks about the problem and has links to some of the other relevant discussions.
If still using older versions of Julia where this was the case, just write using NeuroTools in NeuroTools.jl after defining the module, instead of executing #everywhere using NeuroTools.
The Parallel Computing section of the Julia documentation for version 0.5 says,
using DummyModule causes the module to be loaded on all processes; however, the module is brought into scope only on the one executing the statement.
Executing #everywhere using NeuroTools used to tell each processes to load the module on all processes, and the result was a pile of replacing module warnings.

Resources