Julia parallel programming - Making existing function available to all workers - for-loop

I am faced with the following problem:
I have a function called TrainModel that runs for a very long time on a single thread. When it finishes computing, it returns a function as an output argument, let's call it f. When I enquire the type of this f, Julia returns:
(generic function with 1 method)
(I am not sure of this last piece of information is useful to anyone reading this)
Now in a second step, I need to apply function f on a very large array of values. This is a step that I would like to parallelise. Having had started Julia with multiple processes, e.g.
julia -p 4
ideally, I would use:
pmap(f, my_values)
or perhaps:
aux = #parallel (hcat) for ii=1:100000000
f(my_values[ii])
end
Unfortunately, this doesn't work. Julia complains that the workers are not aware of the function f, i.e. I get a messsage:
ERROR: function f not defined on process 2
How can I make function f available to all workers? Obviously a "dirty" solution would be to run the time-consuming function TrainModel on all workers, like this perhaps:
#everywhere f = TrainModel( ... )
but this would be a waste of cpu when all I want is that just the result f is available to all workers.
Though I searched for posts with similar problems, so far I could not find an answer...
Thanks in advance!
best,
N.

The approach to return the function seems elegant but unfortunately, unlike JavaScript, Julia does not resolve all the variables when creating the functions.
Technically, your training function could produce the source code of the function with literal values for all the trained parameters. Then pass it to each of the worker processes, which can parse it in their environment to a callable function.
I suggest to return a data structure that contains all the information to produce the trained function: weights of an ANN, support vectors, decision rules ...
Define a the "trained" function on the worker processes, such that it will utilized the trained parameters. You might want to have the ability of saving the results of the training to disk anyway, so that you can easily re-produce your computations.

There is a Unix-only solution based on the PTools.jl package (https://github.com/amitmurthy/PTools.jl).
It relies on parallelism via forking instead of the Julia in-built mechanism. Forked processes are spawned with the same workspace as the main process, so all functions and variables are directly available to the workers.
This is a similar to the Fork clusters in R parallel package, so it can be used as the mclapply function.
The function of interest is pfork(n::Integer, f::Function, args...) and one noticeable difference with mclapply in R is that the function f must take as first argument the index of the worker.
An example:
Pkg.add("PTools")
Pkg.checkout("PTools") #to get the last version, else the package does not build at the time of writing
using PTools
f(workid,x) = x[workid] + 1
pfork(3, f, [1,2,3,4,5]) #Only the three first elements of the array will be computed
3-element Array{Any,1}:
2
3
4
I expect that an interface to pfork will be built so that the first argument of the function will not need to be the index of the worker, but for the time being it can be used to solve the problem

Related

Calling two forward modules asynchronously in parallel in PyTorch

I have to compute the output of two modules in parallel. There is no dependency between the two processing. Is it possible in PyTorch? I only need to do this during eval without batch processing.
My current code looks like this (input1.shape[0] = 1 and input2.shape[0] = 1)
x = self.layer1(input1)
y = self.layer2(input2)
I am wondering if something like this is feasible -
x, y = call_in_parallel(self.layer1(input1), self.layer2(input2))
Note that I do not need this functionality for training the model. I only need it during the evaluation phase during which I process the inputs in a batch size of 1.
Plus, in my case, I do not have simple MLP layers. I will like to call separate forwards on more complex models like GNNs or transformers.

What is good way of using multiprocessing for bifacial_radiance simulations?

For a university project I am using bifacial_radiance v0.4.0 to run simulations of approx. 270 000 rows of data in an EWP file.
I have set up a scene with some panels in a module following a tutorial on the bifacial_radiance GitHub page.
I am running the python script for this on a high power computer with 64 cores. Since python natively only uses 1 processor I want to use multiprocessing, which is currently working. However it does not seem very fast, even when starting 64 processes it uses roughly 10 % of the CPU's capacity (according to the task manager).
The script will first create the scene with panels.
Then it will look at a result file (where I store results as csv), and compare it to the contents of the radObj.metdata object. Both metdata and my result file use dates, so all dates which exist in the metdata file but not in the result file are stored in a queue object from the multiprocessing package. I also initialize a result queue.
I want to send a lot of the work to other processors.
To do this I have written two function:
A file writer function which every 10 seconds gets all items from the result queue and writes them to the result file. This function is running in a single multiprocessing.Process process like so:
fileWriteProcess = Process(target=fileWriter,args=(resultQueue,resultFileName)).start()
A ray trace function with a unique ID which does the following:
Get an index ìdx from the index queue (described above)
Use this index in radObj.gendaylit(idx)
Create the octfile. For this I have modified the name which the octfile is saved with to use a prefix which is the name of the process. This is to avoid all the processes using the same octfile on the SSD. octfile = radObj.makeOct(prefix=name)
Run an analysis analysis = bifacial_radiance.AnalysisObj(octfile,radObj.basename)
frontscan, backscan = analysis.moduleAnalysis(scene)
frontDict, backDict = analysis.analysis(octfile, radObj.basename, frontscan, backscan)
Read the desired results from resultDict and put them in the resultQueue as a single line of comma-separated values.
This all works. The processes are running after being created in a for loop.
This speeds up the whole simulation process quite a bit (10 days down to 1½ day), but as said earlier the CPU is running at around 10 % capacity and the GPU is running around 25 % capacity. The computer has 512 GB ram which is not an issue. The only communication with the processes is through the resultQueue and indexQueue, which should not bottleneck the program. I can see that it is not synchronizing as the results are written slightly unsorted while the input EPW file is sorted.
My question is if there is a better way to do this, which might make it run faster? I can see in the source code that a boolean "hpc" is used to initiate some of the classes, and a comment in the code mentions that it is for multiprocessing, but I can't find any information about it elsewhere.

Does #everywhere not load a function on the master?

I made a module with an if condition on the number of cores. If the number of cores is more than 1 the route is parallel; otherwise, it goes the serial route as seen in the code below
module mymodule
import Pkg
using Distributed
if nworkers() > 1
#everywhere using Pkg
#everywhere Pkg.activate(".")
#everywhere Pkg.instantiate()
#everywhere using CSV
#everywhere include("src/myfuncs.jl")
function func()
df=CSV.read(file);
.......
end
else
using Pkg
Pkg.activate(".")
Pkg.instantiate()
using CSV
include("src/myfuncs.jl")
function func()
df=CSV.read(file);
.......
end
end
end #mymodule
1) When I instantiate a Julia session, e.g., julia -p 8 I get an error saying ERROR: UndefVarError: CSV not defined. On the other hand, when a session is instantiated simply as julia there is no error. The Project.toml & Master.toml files are one level higher than src. Do I have to load on the master before using #everyone, like
include("src/myfuncs.jl")
#everywhere include("src/myfuncs.jl")
2) Moreover, I find that when the program goes the serial route it can't find the myfunc.jl file because it is already in the src folder (looks for src/src/myfunc.jl), this behavior is confusing me.
Can someone share their thoughts here?
#everywhere does execute on all workers and the master. However:
Sometimes, if you have bad luck and the module that you are importing is not compiled, a race condition can occur (not always reproducible but reported by several users on StackOverflow), hence, the best bet is to always write the code like this (note that if your cluster is distributed across many servers this might not be enough):
using Distributed
#everywhere using Distributed
using CSV
#everywhere using CSV
Modify your code to run using before Pkg.activate
I am not sure what you want to achieve by #everywhere Pkg.instantiate() but for sure what you are doing now this can not be good (you must make sure that it is not run in more than one copy for a cluster node)
Finally, there is no need to separate your code depending on the number of workers - see the safe pattern in point (1)
Hope that helps!

Using julia in a cluster

I've been using Julia in parallel on my computer successfully but want to increase the number of processors/workers I use so I plan to use my departmental cluster (UCL Econ). When just using julia on my computer, I have two seperate files. FileA contains all the functions I use, including the main function funcy(x,y,z). FileB calls this function over several processors as follows:
addprocs(4)
require("FileA.jl")
solution = pmap(imw -> funcy(imw,y,z), 1:10)
When I try to run this on the cluster, the require statement does not seem to work (though I don't get an explicit error output which is frustrating). Any advice?

Is it faster to use a persistent variable than to reallocate memory each function call by e.g. zeros() if a function is called often?

Assume I have a function which is called often, say by an ODE-solver or similar. Is it faster to use a persistent variable than to reallocate it each time?
That is, which function would be faster and what is best practice?
function ret=thisfunction(a,b,c)
A = zeros(3)
foo = 3;
bar = 34;
% ...
% process some in A
% ...
ret = A\c;
end
or
function ret=thatfunction(a,b,c)
persistent A foo bar
if isempty(A);
A=zeros(3);
foo = 3;
bar = 34;
end
% ...
% process some in A
% ...
ret = A\c;
end
Which one is faster can only be proven by test, as it may depend on variable size etc. However, I would say that if it is not required, it is usually also not recommended to use persistent variables.
Therefore I would definately recommend you to use option number one.
Sidenote: You probably want to check whether it exists rather than whether it is empty. Furthermore I don't know what happens to your A when you leave the function scope, if you want to define it as persistent or global you may have to do it one level higher.
When you have a single function such as this to test I have found that it's very easy to setup a parent function, run the function you are testing say, 10 million times and time the results. Then consider the difference in time AND the possible trade off or side effects of using a persistent variable here. It may not be worth it if the difference is a few percent over 10 million calls and you are actually only going to call the function 10 thousand times in application. YMMV.
In regards to best practice, I would dissuade you from using persitent variables in this manner, for two reasons.
Persitent variables can be cleared externally, e.g. running clear('thatfunction') from any other function that has "thatfunction" on the path would reset your persitent variables in "thatfunction". As such, it's possible that they'll be unwittingly reset elsewhere. This may not be a problem for you in this context, but if you want to keep results between function calls (which is the primary point of persitent variables) this can cause you headaches.
Also, if you modify them, you'll have to remember to clear them when you're done running in-order to reset your workspace to a clean state. Otherwise if you (or someone else) runs your program again without clearing your persitent variable(s) first, the results from the previous run. This isn't an issue if they're read-only, but you cannot enforce that they will be.

Resources